-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mean() with option na_rm=False
does not work
#65
Comments
pandas ignores NAs anyway in a >>> df.groupby('id').agg(np.mean)
value
id <float64>
A 2.0
B 3.0
>>> df.groupby('id').agg(np.nanmean)
value
id <float64>
A 2.0
B 3.0 Actually, the NAs in the first case should not be ignored, but pandas did that. I think this is also related: pandas-dev/pandas#15675 The current solution for |
Hey @pwwang , thanks for the very good feedback as always. Yes, it is the standard behavior of Pandas, which is wrong and dreadful in my opinion... Since Python would be much better if someday in the future https://github.com/h2oai/datatable could replace Pandas as the default data library |
It's fixed by ba8b3e7, and will be released in the next version. >>> df.groupby('id').agg(value=('value', lambda x: mean(x)))
value
id <float64>
A NaN
B 3.0 But then we lost pandas' optimization on df >> group_by(f.id) >> summarise(m=mean(f.value, na_rm=True))
# since pandas loses NAs anyway This needs to be documented, for sure. For the |
Great man! Thank you |
* 🔧 Add metadata for datasets * 📝 Mention datar-cli in README * 🔊 Send logs to stderr * 📌Pin depedency verions; 🚨 Switch to flake8; * 🔖 0.5.2 * 🔊 Update CHANGELOG * ⚡️ Optimize dplyr.arrange when data are series from the df * 🔧 Update coveragerc * 🐛 Fix #63 * 📝 Update doc for argument `by` for join functions (#62) * 🐛 Fix #65 * 🔖 0.5.3 * 🔥 Remove prints from tests
Hey @pwwang , I believe this issue regressed. In the latest import datar.all as d
from datar import f
import numpy as np
import pandas as pd
df = pd.DataFrame({
'id': ['A']*2 + ['B']*2,
'date':['2020-01-01','2020-02-01']*2,
'value': [2,np.nan,3,3]
})
df >> d.group_by(f.id) >> d.summarise(m=d.mean(f.value, na_rm=True)) returns:
Also, this used to work: df_mean = (df
>> d.group_by(f.id)
>> d.summarize(
value_np_nanmean = np.nanmean(f.value),
value_np_mean = np.mean(f.value),
)
) But now it throws these errors, respectively: |
It's all because See: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html Let's say we have In the old days, Now, with datar v0.15.3, datar-pandas v0.5.3, |
We should be able to support
into functions on different types of objects (i.e. Series, SeriousGroupBy) (The I am just short of time to do that. Pandas3 will require |
@pwwang , Yes, man, this is ridiculous, another reason why I hate Pandas.
No problem.
Yes, let's see if it will be finally solved. Thank you |
Please, consider the MWE below:
In
df_mean
, the first observation ofvalue_np_mean
andvalue_datar_mean
should beNAN
instead of2
.This is the same issue found in Pandas, which discards NAN / None observations automatically during calculations.
The only workaround I found is this: https://stackoverflow.com/questions/54106112/pandas-groupby-mean-not-ignoring-nans/54106520
The text was updated successfully, but these errors were encountered: