Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Extra float values appear after resampling in column used by groupby #50893

Open
3 tasks done
takarzyna opened this issue Jan 20, 2023 · 3 comments
Open
3 tasks done
Labels
Docs Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@takarzyna
Copy link

takarzyna commented Jan 20, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

timestamps = [pd.Timestamp("2020-01-01") + i * pd.Timedelta(minutes=1) for i in range(200)]
dfs = []
for i in range(10):
    df_i = pd.DataFrame.from_dict(
        {
            "Timestamp": timestamps, 
            "groupby_value": [i * 0.01] * len(timestamps),
            "value": [i + 2] * len(timestamps)
        }
    )
    dfs += [df_i]
df = pd.concat(dfs)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])

df_grouped = df.set_index("Timestamp").groupby("groupby_value").resample("1H").mean()

print(sorted(list(set(df["groupby_value"]))), "\n", sorted(list(set(df_grouped["groupby_value"]))))
""" Output:
[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09] 
 [0.0, 0.01, 0.02, 0.029999999999999995, 0.03, 0.04, 0.05, 0.05999999999999999, 0.06, 0.07, 0.08, 0.09] """
print(df_grouped.reset_index(drop=True).groupby("groupby_value").count())
""" Output:
               value
groupby_value       
0.00               4
0.01               4
0.02               4
0.03               3
0.03               1  # <- another 0.03?
0.04               4
0.05               4
0.06               3
0.06               1  # <- another 0.06?
0.07               4
0.08               4
0.09               4
"""

Issue Description

After groupby and resample, where the groupby column has float values, new values appear in that column, which are close but not equal to some of the existing values.
For instance, in the example above we get 0.029999999999999995 and 0.03 instead of 0.03.
This leads to incorrect aggregations as well (see the last output in the example).

Expected Behavior

Expected: keep original values of floats used for grouping.

Installed Versions

INSTALLED VERSIONS

commit : 3fa869e
python : 3.10.9.final.0
python-bits : 64
OS : Darwin
OS-release : 22.2.0
Version : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.0.dev0+1265.g3fa869ef90
numpy : 1.24.1
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

@takarzyna takarzyna added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 20, 2023
@dicristina
Copy link
Contributor

The result is a consequence of floating point arithmetic. In the resample step new values appear on the groupby_value column because you take the mean of a floating point value repeated a different number of times: repeated 60 times for the first three hours and repeated 20 times for the remaining 20 minutes. These result will not necessarily be the same.


Consider the following pure numpy example:

import numpy as np
val = 0.03
np.full(20, val).mean() == np.full(60, val).mean()

In my system the arrays are of type float64 and the result is False.


Going back to the original example, here we can see that the new values arise from resample groups of different sizes:

>>> pd.set_option("display.precision", 20)

>>> df.set_index("Timestamp").groupby("groupby_value").resample("H").agg(mean=("groupby_value", "mean"), count=("groupby_value", "count"))
                                                              mean  count
groupby_value          Timestamp                                         
0.00000000000000000000 2020-01-01 00:00:00  0.00000000000000000000     60
                       2020-01-01 01:00:00  0.00000000000000000000     60
                       2020-01-01 02:00:00  0.00000000000000000000     60
                       2020-01-01 03:00:00  0.00000000000000000000     20
0.01000000000000000021 2020-01-01 00:00:00  0.01000000000000000021     60
                       2020-01-01 01:00:00  0.01000000000000000021     60
                       2020-01-01 02:00:00  0.01000000000000000021     60
                       2020-01-01 03:00:00  0.01000000000000000021     20
0.02000000000000000042 2020-01-01 00:00:00  0.02000000000000000042     60
                       2020-01-01 01:00:00  0.02000000000000000042     60
                       2020-01-01 02:00:00  0.02000000000000000042     60
                       2020-01-01 03:00:00  0.02000000000000000042     20
0.02999999999999999889 2020-01-01 00:00:00  0.02999999999999999542     60
                       2020-01-01 01:00:00  0.02999999999999999542     60
                       2020-01-01 02:00:00  0.02999999999999999542     60
                       2020-01-01 03:00:00  0.02999999999999999889     20
0.04000000000000000083 2020-01-01 00:00:00  0.04000000000000000083     60
                       2020-01-01 01:00:00  0.04000000000000000083     60
                       2020-01-01 02:00:00  0.04000000000000000083     60
                       2020-01-01 03:00:00  0.04000000000000000083     20
0.05000000000000000278 2020-01-01 00:00:00  0.05000000000000000278     60
                       2020-01-01 01:00:00  0.05000000000000000278     60
                       2020-01-01 02:00:00  0.05000000000000000278     60
                       2020-01-01 03:00:00  0.05000000000000000278     20
0.05999999999999999778 2020-01-01 00:00:00  0.05999999999999999084     60
                       2020-01-01 01:00:00  0.05999999999999999084     60
                       2020-01-01 02:00:00  0.05999999999999999084     60
                       2020-01-01 03:00:00  0.05999999999999999778     20
0.07000000000000000666 2020-01-01 00:00:00  0.07000000000000000666     60
                       2020-01-01 01:00:00  0.07000000000000000666     60
                       2020-01-01 02:00:00  0.07000000000000000666     60
                       2020-01-01 03:00:00  0.07000000000000000666     20
0.08000000000000000167 2020-01-01 00:00:00  0.08000000000000000167     60
                       2020-01-01 01:00:00  0.08000000000000000167     60
                       2020-01-01 02:00:00  0.08000000000000000167     60
                       2020-01-01 03:00:00  0.08000000000000000167     20
0.08999999999999999667 2020-01-01 00:00:00  0.08999999999999999667     60
                       2020-01-01 01:00:00  0.08999999999999999667     60
                       2020-01-01 02:00:00  0.08999999999999999667     60
                       2020-01-01 03:00:00  0.08999999999999999667     20

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-25-generic
Version : #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : None
pytest : 7.1.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@takarzyna
Copy link
Author

Thank you for pointing out the root cause, that makes sense. What would be the suggested way to deal with this, then? And will it become a part of Pandas or at least be mentioned in documentation? I am asking because Pandas is used by Data Scientists who are not necessarily computer scientists expecting to deal with floating point arithmetic, however I do understand that the decision about what to do with it belongs to Pandas contributors.

@rhshadrach
Copy link
Member

What would be the suggested way to deal with this, then?

That is very much dependent on the purpose of the computation. For most computation in my experience, it is okay to ignore. Other times, rounding prior to output is sufficient. If attempting to count distinct values, then binning may be appropriate (e.g. https://pandas.pydata.org/docs/reference/api/pandas.cut.html).

And will it become a part of Pandas or at least be mentioned in documentation?

I wouldn't be opposed to a section in the Basic Functionality page of the User Guide. It would be great to link to an authoritative source if possible.

@rhshadrach rhshadrach added Docs Numeric Operations Arithmetic, Comparison, and Logical operations and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 21, 2023
@rhshadrach rhshadrach changed the title BUG: Extra float values appear after resampling in column used by groupby DOC: Extra float values appear after resampling in column used by groupby Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

3 participants