-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Extra float values appear after resampling in column used by groupby #50893
Comments
The result is a consequence of floating point arithmetic. In the resample step new values appear on the groupby_value column because you take the mean of a floating point value repeated a different number of times: repeated 60 times for the first three hours and repeated 20 times for the remaining 20 minutes. These result will not necessarily be the same. Consider the following pure numpy example: import numpy as np
val = 0.03
np.full(20, val).mean() == np.full(60, val).mean() In my system the arrays are of type Going back to the original example, here we can see that the new values arise from resample groups of different sizes: >>> pd.set_option("display.precision", 20)
>>> df.set_index("Timestamp").groupby("groupby_value").resample("H").agg(mean=("groupby_value", "mean"), count=("groupby_value", "count"))
mean count
groupby_value Timestamp
0.00000000000000000000 2020-01-01 00:00:00 0.00000000000000000000 60
2020-01-01 01:00:00 0.00000000000000000000 60
2020-01-01 02:00:00 0.00000000000000000000 60
2020-01-01 03:00:00 0.00000000000000000000 20
0.01000000000000000021 2020-01-01 00:00:00 0.01000000000000000021 60
2020-01-01 01:00:00 0.01000000000000000021 60
2020-01-01 02:00:00 0.01000000000000000021 60
2020-01-01 03:00:00 0.01000000000000000021 20
0.02000000000000000042 2020-01-01 00:00:00 0.02000000000000000042 60
2020-01-01 01:00:00 0.02000000000000000042 60
2020-01-01 02:00:00 0.02000000000000000042 60
2020-01-01 03:00:00 0.02000000000000000042 20
0.02999999999999999889 2020-01-01 00:00:00 0.02999999999999999542 60
2020-01-01 01:00:00 0.02999999999999999542 60
2020-01-01 02:00:00 0.02999999999999999542 60
2020-01-01 03:00:00 0.02999999999999999889 20
0.04000000000000000083 2020-01-01 00:00:00 0.04000000000000000083 60
2020-01-01 01:00:00 0.04000000000000000083 60
2020-01-01 02:00:00 0.04000000000000000083 60
2020-01-01 03:00:00 0.04000000000000000083 20
0.05000000000000000278 2020-01-01 00:00:00 0.05000000000000000278 60
2020-01-01 01:00:00 0.05000000000000000278 60
2020-01-01 02:00:00 0.05000000000000000278 60
2020-01-01 03:00:00 0.05000000000000000278 20
0.05999999999999999778 2020-01-01 00:00:00 0.05999999999999999084 60
2020-01-01 01:00:00 0.05999999999999999084 60
2020-01-01 02:00:00 0.05999999999999999084 60
2020-01-01 03:00:00 0.05999999999999999778 20
0.07000000000000000666 2020-01-01 00:00:00 0.07000000000000000666 60
2020-01-01 01:00:00 0.07000000000000000666 60
2020-01-01 02:00:00 0.07000000000000000666 60
2020-01-01 03:00:00 0.07000000000000000666 20
0.08000000000000000167 2020-01-01 00:00:00 0.08000000000000000167 60
2020-01-01 01:00:00 0.08000000000000000167 60
2020-01-01 02:00:00 0.08000000000000000167 60
2020-01-01 03:00:00 0.08000000000000000167 20
0.08999999999999999667 2020-01-01 00:00:00 0.08999999999999999667 60
2020-01-01 01:00:00 0.08999999999999999667 60
2020-01-01 02:00:00 0.08999999999999999667 60
2020-01-01 03:00:00 0.08999999999999999667 20 INSTALLED VERSIONScommit : 87cfe4e pandas : 1.5.0 |
Thank you for pointing out the root cause, that makes sense. What would be the suggested way to deal with this, then? And will it become a part of Pandas or at least be mentioned in documentation? I am asking because Pandas is used by Data Scientists who are not necessarily computer scientists expecting to deal with floating point arithmetic, however I do understand that the decision about what to do with it belongs to Pandas contributors. |
That is very much dependent on the purpose of the computation. For most computation in my experience, it is okay to ignore. Other times, rounding prior to output is sufficient. If attempting to count distinct values, then binning may be appropriate (e.g. https://pandas.pydata.org/docs/reference/api/pandas.cut.html).
I wouldn't be opposed to a section in the Basic Functionality page of the User Guide. It would be great to link to an authoritative source if possible. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
After groupby and resample, where the groupby column has float values, new values appear in that column, which are close but not equal to some of the existing values.
For instance, in the example above we get 0.029999999999999995 and 0.03 instead of 0.03.
This leads to incorrect aggregations as well (see the last output in the example).
Expected Behavior
Expected: keep original values of floats used for grouping.
Installed Versions
INSTALLED VERSIONS
commit : 3fa869e
python : 3.10.9.final.0
python-bits : 64
OS : Darwin
OS-release : 22.2.0
Version : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.0.dev0+1265.g3fa869ef90
numpy : 1.24.1
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: