Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: 0/frame numeric ops buggy (GH9144) #9308

Merged
merged 1 commit into from
Feb 16, 2015

Conversation

Garrett-R
Copy link
Contributor

closes #9144
closes #8445


Here's the results from testing the vbenches related to DataFrames (I also added 6 vbenches).

Invoked with :
--ncalls: 10
--repeats: 10


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_float_div_by_zero                      |   1.5656 |  23.4006 |   0.0669 |
frame_float_floor_by_zero                    |   3.1077 |  24.2786 |   0.1280 |
groupby_frame_nth_none                       |   1.4918 |   2.1483 |   0.6944 |
groupby_frame_nth_any                        |   4.0633 |   5.4162 |   0.7502 |
dataframe_resample_max_string                |   1.3970 |   1.5892 |   0.8791 |
stat_ops_frame_mean_int_axis_1               |   3.4506 |   3.9006 |   0.8846 |
stat_ops_frame_mean_float_axis_1             |   3.6501 |   4.1200 |   0.8859 |
dataframe_resample_max_numpy                 |   1.4134 |   1.5861 |   0.8911 |
frame_reindex_upcast                         |   5.6926 |   6.3748 |   0.8930 |
stat_ops_frame_sum_int_axis_1                |   3.2424 |   3.6136 |   0.8973 |
dataframe_resample_min_numpy                 |   1.4320 |   1.5937 |   0.8985 |
stat_ops_frame_mean_float_axis_0             |   3.5722 |   3.9361 |   0.9075 |
frame_ctor_dtindex_Nanox2                    |   0.8544 |   0.9404 |   0.9086 |
frame_dropna_axis1_any                       |  17.6609 |  19.4360 |   0.9087 |
frame_ctor_dtindex_Hourx1                    |   0.8695 |   0.9554 |   0.9101 |
frame_ctor_dtindex_Secondx1                  |   0.8586 |   0.9384 |   0.9150 |
frame_ctor_dtindex_Microx2                   |   0.8564 |   0.9341 |   0.9168 |
dataframe_resample_mean_string               |   1.8672 |   2.0302 |   0.9197 |
frame_ctor_dtindex_Nanox1                    |   0.8526 |   0.9242 |   0.9225 |
stat_ops_level_frame_sum_multiple            |   4.7174 |   5.0987 |   0.9252 |
frame_ctor_dtindex_Hourx2                    |   0.8645 |   0.9342 |   0.9254 |
frame_xs_row                                 |   0.0242 |   0.0261 |   0.9269 |
frame_ctor_dtindex_BMonthBeginx1             |   1.0908 |   1.1755 |   0.9280 |
stat_ops_frame_sum_float_axis_0              |   3.6003 |   3.8713 |   0.9300 |
eval_frame_mult_python_one_thread            |  12.8115 |  13.7715 |   0.9303 |
stat_ops_frame_sum_int_axis_0                |   3.2616 |   3.4839 |   0.9362 |
eval_frame_mult_python                       |  12.8203 |  13.6336 |   0.9403 |
dataframe_resample_min_string                |   1.4186 |   1.5066 |   0.9416 |
frame_shift_axis0                            |   7.3947 |   7.8502 |   0.9420 |
stat_ops_frame_sum_float_axis_1              |   3.5332 |   3.7481 |   0.9427 |
frame_ctor_dtindex_BYearEndx2                |   1.0797 |   1.1452 |   0.9428 |
frame_fillna_many_columns_pad                |   4.3826 |   4.6457 |   0.9434 |
frame_ctor_dtindex_Secondx2                  |   0.8635 |   0.9153 |   0.9434 |
frame_get_dtype_counts                       |   0.0601 |   0.0635 |   0.9467 |
frame_dropna_axis1_all                       |  34.1749 |  36.0852 |   0.9471 |
frame_float_equal                            |   2.1541 |   2.2727 |   0.9478 |
dataframe_resample_mean_numpy                |   1.8673 |   1.9664 |   0.9496 |
append_frame_single_mixed                    |   1.2900 |   1.3544 |   0.9525 |
eval_frame_add_python                        |  13.0182 |  13.6672 |   0.9525 |
eval_frame_and_python_one_thread             |  23.5453 |  24.5942 |   0.9574 |
frame_drop_dup_inplace                       |   1.8290 |   1.9081 |   0.9585 |
groupby_frame_cython_many_columns            |   2.4763 |   2.5806 |   0.9596 |
frame_get_numeric_data                       |   0.0697 |   0.0725 |   0.9612 |
eval_frame_and_python                        |  23.6922 |  24.6444 |   0.9614 |
frame_mask_bools                             |   8.5507 |   8.8782 |   0.9631 |
frame_shift_axis_1                           |  11.5263 |  11.9162 |   0.9673 |
eval_frame_chained_cmp_python                |  68.9274 |  71.2359 |   0.9676 |
frame_multi_and_st                           |  20.8511 |  21.5107 |   0.9693 |
eval_frame_add_python_one_thread             |  11.9996 |  12.3749 |   0.9697 |
frame_mask_floats                            |   5.7624 |   5.9394 |   0.9702 |
frame_multi_and_no_ne                        |  21.2904 |  21.9065 |   0.9719 |
frame_apply_axis_1                           |  45.9807 |  47.2289 |   0.9736 |
join_dataframe_integer_2key                  |   3.5381 |   3.6330 |   0.9739 |
frame_ctor_nested_dict_int64                 |  48.7418 |  50.0032 |   0.9748 |
eval_frame_chained_cmp_python_one_thread     |  67.2400 |  68.9405 |   0.9753 |
frame_to_csv_mixed                           | 404.9059 | 414.6069 |   0.9766 |
groupby_frame_singlekey_integer              |   1.4465 |   1.4805 |   0.9770 |
frame_dropna_axis1_all_mixed_dtypes          | 137.6919 | 140.7975 |   0.9779 |
frame_insert_500_columns_end                 |  65.0207 |  66.4810 |   0.9780 |
frame_drop_dup_na_inplace                    |   1.6683 |   1.7045 |   0.9788 |
frame_iteritems                              |  17.1694 |  17.5347 |   0.9792 |
stat_ops_level_frame_sum                     |   2.1278 |   2.1730 |   0.9792 |
reindex_frame_level_align                    |   0.6034 |   0.6153 |   0.9807 |
frame_count_level_axis0_mixed_dtypes_multi   |  73.6246 |  75.0608 |   0.9809 |
dataframe_reindex                            |   0.2569 |   0.2615 |   0.9824 |
frame_reindex_axis1                          |  44.7410 |  45.5110 |   0.9831 |
groupby_frame_median                         |   5.1782 |   5.2658 |   0.9834 |
frame_to_csv_date_formatting                 |   6.7534 |   6.8611 |   0.9843 |
frame_fancy_lookup_all                       |  11.3160 |  11.4933 |   0.9846 |
frame_ctor_dtindex_DateOffsetx2              |   0.7447 |   0.7562 |   0.9848 |
join_dataframe_integer_key                   |   1.2000 |   1.2166 |   0.9863 |
frame_reindex_columns                        |   0.2247 |   0.2276 |   0.9872 |
frame_from_records_generator                 |  41.6522 |  42.1751 |   0.9876 |
frame_ctor_dtindex_QuarterBeginx2            |   0.9039 |   0.9145 |   0.9884 |
frame_sort_index_by_columns                  |  24.9708 |  25.2553 |   0.9887 |
join_dataframe_index_single_key_bigger       |   8.9800 |   9.0787 |   0.9891 |
frame_apply_ref_by_name                      |   8.7795 |   8.8757 |   0.9892 |
stat_ops_frame_mean_int_axis_0               |   3.2407 |   3.2742 |   0.9898 |
indexing_dataframe_boolean                   |  82.9152 |  83.7444 |   0.9901 |
frame_add                                    |   3.7720 |   3.8059 |   0.9911 |
frame_ctor_dtindex_BYearBeginx2              |   1.0778 |   1.0870 |   0.9915 |
frame_drop_duplicates_na                     |  14.2417 |  14.3591 |   0.9918 |
frame_ctor_dtindex_CBMonthBeginx1            |   2.2563 |   2.2738 |   0.9923 |
frame_ctor_dtindex_Weekx2                    |   0.7799 |   0.7859 |   0.9924 |
join_dataframe_index_single_key_small        |   8.2365 |   8.2989 |   0.9925 |
frame_dropna_axis1_any_mixed_dtypes          | 124.3344 | 125.2386 |   0.9928 |
frame_ctor_list_of_dict                      |  42.8818 |  43.1860 |   0.9930 |
frame_ctor_dtindex_YearEndx2                 |   0.8680 |   0.8737 |   0.9935 |
append_frame_single_homogenous               |   0.8964 |   0.9022 |   0.9936 |
groupby_frame_apply                          |  22.1934 |  22.3310 |   0.9938 |
frame_mult_no_ne                             |   3.8051 |   3.8274 |   0.9942 |
frame_nonunique_equal                        |   7.3191 |   7.3515 |   0.9956 |
frame_ctor_dtindex_BMonthBeginx2             |   1.0844 |   1.0890 |   0.9958 |
frame_apply_lambda_mean                      |   3.8462 |   3.8609 |   0.9962 |
frame_ctor_nested_dict                       |  46.5123 |  46.6852 |   0.9963 |
frame_count_level_axis1_mixed_dtypes_multi   |  61.2177 |  61.4444 |   0.9963 |
frame_ctor_dtindex_BMonthEndx2               |   0.9260 |   0.9288 |   0.9970 |
frame_html_repr_trunc_mi                     |  22.1488 |  22.2066 |   0.9974 |
frame_multi_and                              |  21.2545 |  21.3085 |   0.9975 |
frame_add_st                                 |   3.7661 |   3.7741 |   0.9979 |
frame_ctor_dtindex_BYearBeginx1              |   1.0904 |   1.0926 |   0.9980 |
join_dataframe_index_multi                   |  13.3128 |  13.3311 |   0.9986 |
frame_getitem_single_column                  |  12.7089 |  12.7029 |   1.0005 |
frame_object_equal                           |   7.3267 |   7.3224 |   1.0006 |
frame_from_records_generator_nrows           |   0.5950 |   0.5940 |   1.0016 |
frame_to_string_floats                       |  15.5732 |  15.5458 |   1.0018 |
frame_add_no_ne                              |   3.8073 |   3.7999 |   1.0019 |
frame_ctor_dtindex_CBMonthBeginx2            |   1.9428 |   1.9389 |   1.0020 |
frame_ctor_dtindex_BusinessDayx2             |   0.8365 |   0.8348 |   1.0020 |
frame_ctor_dtindex_BDayx2                    |   0.8400 |   0.8381 |   1.0023 |
frame_float_mod                              |   2.4775 |   2.4717 |   1.0024 |
frame_reindex_axis0                          |  42.1150 |  42.0129 |   1.0024 |
frame_drop_duplicates                        |  13.3927 |  13.3401 |   1.0039 |
sparse_frame_constructor                     |   3.7919 |   3.7743 |   1.0047 |
frame_iloc_big                               |   0.1020 |   0.1015 |   1.0049 |
frame_interpolate_some_good_infer            |   1.8705 |   1.8610 |   1.0051 |
indexing_dataframe_boolean_st                |  85.3268 |  84.8837 |   1.0052 |
frame_dropna_axis0_any                       |  18.0998 |  18.0034 |   1.0054 |
frame_count_level_axis1_multi                |  57.8269 |  57.5086 |   1.0055 |
indexing_dataframe_boolean_rows              |   0.2319 |   0.2306 |   1.0056 |
indexing_dataframe_boolean_rows_object       |   0.3911 |   0.3880 |   1.0080 |
frame_dropna_axis0_any_mixed_dtypes          | 125.2823 | 124.2290 |   1.0085 |
frame_repr_wide                              |   8.4213 |   8.3428 |   1.0094 |
frame_apply_pass_thru                        |   2.7872 |   2.7612 |   1.0094 |
frame_dtypes                                 |   0.0732 |   0.0725 |   1.0099 |
frame_to_html_mixed                          | 120.9245 | 119.7213 |   1.0101 |
frame_ctor_dtindex_DateOffsetx1              |   0.7530 |   0.7452 |   1.0105 |
frame_ctor_dtindex_Dayx1                     |   0.8844 |   0.8746 |   1.0112 |
groupby_frame_apply_overhead                 |   5.2761 |   5.2173 |   1.0113 |
frame_ctor_dtindex_Millix1                   |   0.9011 |   0.8910 |   1.0113 |
frame_count_level_axis0_multi                |  43.2266 |  42.7371 |   1.0115 |
reindex_frame_level_reindex                  |   0.6016 |   0.5946 |   1.0117 |
frame_ctor_dtindex_BQuarterBeginx1           |   1.1092 |   1.0958 |   1.0122 |
frame_reindex_both_axes                      |  13.9118 |  13.7180 |   1.0141 |
join_dataframe_index_single_key_bigger_sort  |  11.0337 |  10.8757 |   1.0145 |
frame_ctor_dtindex_Weekx1                    |   0.7509 |   0.7395 |   1.0154 |
frame_ctor_dtindex_BMonthEndx1               |   0.9711 |   0.9547 |   1.0172 |
indexing_dataframe_boolean_no_ne             |  87.0329 |  85.4171 |   1.0189 |
frame_fancy_lookup                           |   2.0553 |   2.0135 |   1.0208 |
frame_mult_st                                |   3.8758 |   3.7959 |   1.0210 |
frame_repr_tall                              |  12.1078 |  11.8041 |   1.0257 |
frame_insert_100_columns_begin               |  24.3260 |  23.6877 |   1.0269 |
frame_ctor_dtindex_QuarterEndx2              |   1.0204 |   0.9927 |   1.0279 |
frame_iteritems_cached                       |   0.3542 |   0.3440 |   1.0297 |
frame_ctor_dtindex_Easterx2                  |   0.9328 |   0.9050 |   1.0307 |
frame_interpolate                            |  64.3447 |  62.4263 |   1.0307 |
frame_html_repr_trunc_si                     |  17.4936 |  16.9715 |   1.0308 |
frame_mult                                   |   3.9272 |   3.8050 |   1.0321 |
frame_dropna_axis0_all_mixed_dtypes          | 142.3560 | 137.7486 |   1.0334 |
frame_from_series                            |   0.0670 |   0.0648 |   1.0338 |
frame_apply_np_mean                          |   4.2445 |   4.0938 |   1.0368 |
frame_interpolate_some_good                  |   1.0546 |   1.0159 |   1.0381 |
frame_ctor_dtindex_MonthBeginx1              |   0.9428 |   0.9080 |   1.0383 |
frame_ctor_dtindex_Minutex1                  |   0.8931 |   0.8599 |   1.0386 |
frame_constructor_ndarray                    |   0.0554 |   0.0532 |   1.0408 |
frame_ctor_dtindex_BQuarterEndx2             |   1.0568 |   1.0139 |   1.0423 |
frame_ctor_dtindex_QuarterEndx1              |   1.0460 |   1.0035 |   1.0423 |
frame_getitem_single_column2                 |  12.9310 |  12.3968 |   1.0431 |
frame_ctor_dtindex_MonthBeginx2              |   0.9435 |   0.9042 |   1.0435 |
frame_ctor_dtindex_Microx1                   |   0.9086 |   0.8704 |   1.0439 |
frame_ctor_dtindex_CustomBusinessDayx2       |   0.8865 |   0.8490 |   1.0441 |
frame_ctor_dtindex_CustomBusinessDayx1       |   0.8898 |   0.8517 |   1.0447 |
frame_to_csv2                                |  82.5006 |  78.9100 |   1.0455 |
frame_ctor_dtindex_BQuarterBeginx2           |   1.1443 |   1.0945 |   1.0455 |
frame_ctor_dtindex_BQuarterEndx1             |   1.0723 |   1.0249 |   1.0462 |
frame_ctor_dtindex_Dayx2                     |   0.9089 |   0.8669 |   1.0485 |
frame_ctor_dtindex_CDayx1                    |   0.8952 |   0.8536 |   1.0487 |
frame_ctor_dtindex_Easterx1                  |   0.9384 |   0.8919 |   1.0521 |
frame_ctor_dtindex_YearBeginx1               |   0.8962 |   0.8514 |   1.0526 |
frame_float_div                              |   4.8441 |   4.6001 |   1.0530 |
frame_reindex_both_axes_ix                   |  14.6171 |  13.8707 |   1.0538 |
frame_ctor_dtindex_YearEndx1                 |   0.9240 |   0.8754 |   1.0555 |
frame_ctor_dtindex_QuarterBeginx1            |   0.9843 |   0.9323 |   1.0558 |
frame_ctor_dtindex_BDayx1                    |   0.8610 |   0.8153 |   1.0561 |
frame_ctor_dtindex_MonthEndx2                |   0.9667 |   0.9151 |   1.0564 |
frame_ctor_dtindex_Millix2                   |   0.9122 |   0.8625 |   1.0576 |
frame_to_csv                                 |  96.1645 |  90.8887 |   1.0580 |
frame_ctor_dtindex_CDayx2                    |   0.9058 |   0.8557 |   1.0585 |
frame_boolean_row_select                     |   0.1848 |   0.1741 |   1.0615 |
frame_loc_dups                               |   0.7166 |   0.6750 |   1.0616 |
frame_assign_timeseries_index                |   0.5794 |   0.5451 |   1.0629 |
frame_ctor_dtindex_Minutex2                  |   0.9103 |   0.8556 |   1.0639 |
frame_ctor_dtindex_CBMonthEndx1              |   3.0158 |   2.8304 |   1.0655 |
frame_apply_user_func                        |  57.2865 |  53.6327 |   1.0681 |
frame_ctor_dtindex_CBMonthEndx2              |   3.0658 |   2.8691 |   1.0686 |
frame_ctor_dtindex_BusinessDayx1             |   0.8704 |   0.8101 |   1.0744 |
frame_ctor_dtindex_YearBeginx2               |   0.9055 |   0.8417 |   1.0758 |
frame_ctor_dtindex_MonthEndx1                |   0.9801 |   0.9096 |   1.0775 |
frame_iloc_dups                              |   0.1877 |   0.1736 |   1.0813 |
frame_dropna_axis0_all                       |  31.8504 |  29.3697 |   1.0845 |
frame_fillna_inplace                         |   8.8447 |   8.0231 |   1.1024 |
frame_ctor_dtindex_BYearEndx1                |   1.2137 |   1.0994 |   1.1040 |
frame_isnull                                 |   0.6357 |   0.5477 |   1.1606 |
frame_xs_mi_ix                               |   2.3058 |   1.9489 |   1.1831 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [e33f3bc] : BUG: Fix #9144 #8445  Fix how core.common._fill_zeros handles div and mod by zero
Base   [76195fb] : Merge pull request #9498 from jreback/consist

@Garrett-R
Copy link
Contributor Author

@jreback, my code is failing because I had assumed that we want 0/0 to be NaN, so implemented that as part of the bugfix, but there's a test that thinks 0/0 should be inf, which seems mistaken.

Should I go ahead and change that test?

@jreback
Copy link
Contributor

jreback commented Jan 20, 2015

see here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#id31

also IIRC there was some discussion w.r.t. to 0/0 (but might be in another issue). pls see if you can locate it.

@Garrett-R
Copy link
Contributor Author

Hmmm... I couldn't find any 0/0 discussion (there was a discussion in #3590 about mod 0 though), but in the release notes you linked to for v.0.12.0, it said "Fix modulo and integer division on Series,DataFrames to act similary to float dtypes to return np.nan or np.inf as appropriate" and it gave 4 examples. I think the fix (PR #3600) worked except for the third example as demonstrated here:

>>> p = pd.DataFrame({ 'first' : [4,5,8], 'second' : [0,0,3] })

>>> q = p.astype('float')

>>> print(p%0,'\n', q%0)
   first  second
0    NaN     NaN
1    NaN     NaN
2    NaN     NaN 
    first  second
0    NaN     NaN
1    NaN     NaN
2    NaN     NaN

>>> print(p%p,'\n', q%q)
   first  second
0      0     NaN
1      0     NaN
2      0       0 
    first  second
0      0     NaN
1      0     NaN
2      0       0

>>> print(p/p,'\n', q/q)    # Inconsistent!
   first    second
0      1       inf
1      1       inf
2      1  1.000000 
    first  second
0      1     NaN
1      1     NaN
2      1       1

>>> print(p/0,'\n', q/0)
   first  second
0    inf     inf
1    inf     inf
2    inf     inf 
    first  second
0    inf     inf
1    inf     inf
2    inf     inf

The results are the same for Python 2.

This PR would correct this.
(Edit: Actually, my PR fails to correct the inconsistency that's also present with print(p//p,'\n', q//q) and print(p//0,'\n', q//0), but I'll wait to hear what you think I should do before looking into how to change that.)

I see NumPy has some strange things going on (although it sounds like they're discussing fixing this in numpy/numpy#899), for example,

>>> x = np.int_(0)
>>> x / x
nan
>>> x // x
0

but I haven't ever seen 0/0 being infinite. I think we should change that...

BTW, I'll also write a release note once we decide how to handle this.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

cc @seth-p

iIIRC didn't we have s discussions about this?

what are your thoughts here?

@seth-p
Copy link
Contributor

seth-p commented Jan 22, 2015

I agree that 0/0, 0//0, and 0.0/0.0, should be NaN and not inf.

Is the issue with 0//0 that NaN is actually a float, and so an integer series or DataFrame column cannot contain NaN? So in cases where the result must be an integer (are there such cases?), I'm not sure what we want 0//0 to be -- 0 seems about as good as anything, I suppose. If there are no such cases, then never mind...

[A completely separate issue is that NaN is used both for (a) results that are not numbers (e.g. 0.0/0.0), and (b) missing values (e.g. None). I could see a reasonable argument for representing those differently.]

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

missing and undefined values are both represented by nan

I know others have different ways of doing this - bottoms line is pandss chose a long time ago to only have a single nan represent both - mainly this a perf issue (to have multiple missing value types)

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

@Garrett-R go ahead snd make the change to make 0/0 be nan (both floats and ints)
eill be an api change so needs a small example / mention in the API page

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Compat pandas objects compatability with Numpy or Python functions labels Jan 22, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 22, 2015
@Garrett-R
Copy link
Contributor Author

@seth-p, right, integer types don't support NaN, so you have to do one of the following for 0//0:

Given that we're already casting to float, and assuming no one's planning on changing that behavior, we might as well make it NaN. @jreback, cool, I'll get on it!

@jreback jreback changed the title Fix gh9144 BUG: 0/frame numeric ops buggy (GH9144) Jan 22, 2015
@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

@Garrett-R the int/float issue is already solved. Pandas handles this correctly. That said I am thinking about this again.

So now 1/0 -> inf, but 0/0 -> NaN. Isn't this inconsistent?

@Garrett-R
Copy link
Contributor Author

@jreback, I believe 1/0 -> inf and 0/0 -> NaN is conventional (assuming we're casting to floats). I wanted to try and find it in the IEEE floating point arithmetic document, but I think it's hidden behind an evil paywall.

Anyway, just check out the different languages:


C++

#include <iostream>
int main() {
    float x=0.0;
    std::cout << x/x << std::endl;
    std::cout << 1.0/x << std::endl;
}

outputs

-nan
inf

(the negative is irrelevant)


Julia

julia> 0/0
NaN

julia> 1/0
Inf

R

> 1/0
[1] NaN
> 0/0
[1] Inf

Mathematica

0/0
Indeterminate
1/0
ComplexInfinity

@jreback
Copy link
Contributor

jreback commented Jan 23, 2015

pls add a release note (maybe short example as well), just to make it obvious
put in api change section in 0.16.0

@Garrett-R
Copy link
Contributor Author

Sorry for the delay. I've added a release note.

I also modified one more thing. For modulus, there's an issue:

In [1]: s = pd.Series([0,1])

In [2]: s % 0
Out[2]: 
0   NaN
1   NaN
dtype: float64

In [3]: 0 % s        # Should give a NaN for 0%0
Out[3]: 
0    0
1    0
dtype: int64

The modified behavior is:

In [1]: s = pd.Series([0,1])

In [2]: s % 0
Out[2]: 
0   NaN
1   NaN
dtype: float64

In [3]: 0 % s
Out[3]: 
0   NaN
1     0
dtype: float64

# Floor division must be treated specially since NumPy
# causes 0//0 to be 0, but we want it to be NaN. (PR 9308)
if "floordiv" in name:
if not isinstance(x, np.ndarray) and x == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do: np.isscalar(..)

@Garrett-R Garrett-R force-pushed the fix_GH9144 branch 2 times, most recently from e0fb040 to aecae18 Compare February 10, 2015 08:12
if "floordiv" in name: # (PR 9308)
nan_mask = ((y == 0) & (x == 0)).ravel()
np.putmask(result, nan_mask, np.nan)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, you asked if it's possible to not special case this. I rewrote it to be more clean and in fact, as it's written now, you can get rid of the if "floordiv" in name: line, and it'll be fine. It turns out that it'll only affect the floordiv operators' results anyway. However, I think we should leave it as is for two reasons: 1) efficiency and (2) I think it makes it more readable since we really are special casing floordiv to ameliorate NumPy's strange treatment of floor division.

As for efficiency, we don't need to run these two lines for any other operator, for example,

xx = pd.DataFrame(np.random.random((10000, 10000)))
yy = xx / 0

The second line took 1.59s on average, while if you get rid of the if statement above, it takes 2.05s on average.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

hmm 2s seems quite long for this type of operation
can u profile and see what's up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's taking so long because my DataFrame is 10e8 floats (~1GB).

Profiling the _fill_zeros method (without the if statement), it spends 0.36s on the nan_mask = line (and nan_mask uses 94MB). It spends 0.06s on the following line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe try taking out the .ravel(). I think they are making copies here (and prob are not necessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, I just attempted to refactor _fill_zeros to remove all the raveling and reshaping (which aren't necessary). It actually slowed down my test code (shown below) from 42s to 57s.

According the NumPy docs for ravel, a "copy is made only if needed". In our case a copy wasn't being made. And the np.putmask operation seemed to take much longer for the original array than the raveled one ‒ the line np.putmask(result, mask, fill) (which I didn't modify) had its average runtime increase from 87ms to 548ms.

I'm also adding a comment to explain why we ravel and reshape.


Test code:

def f():
    xx % 0
    xx / 0
    xx // 0
    yy % 0
    yy / 0
    yy // 0

xx = pd.DataFrame(np.random.random((10000, 10000)))
yy = pd.DataFrame(np.random.random_integers(0,100, size=(10000,10000)))
for _ in range(5):
    f()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiling is the best as sometimes intuition is wrong :)

you might want to try reshaping explicitly first (just a guess) or taking a view

I think it's forcing a copy when one is not necessary I think

this op should be pretty fast - it's the copies that are getting in the way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call on the profiling!

Raveling and reshaping is helping. They're not causing any copies to be made which I confirmed by memory profiling.

But actually, there was one thing causing an unnecessary copy to be made (independent of whether we ravel or not) which was using .astype('float64'). I've changed it to .astype('float', copy=False). The reason for changing 'float64' to 'float' is that I don't see any reason to force float64; isn't it better to allow the computer architecture to decide between float32 and float64?

After this change, I tried time profiling again to verify that the raveling/reshaping is still helping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, are you surprised that a 1GB data frame take 1.5s to have a division operator performed on it and have all its values updated? That seemed pretty reasonable to me...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we always use float64 except if the user is explicit in not using it
so you cannot change what it is
and astype usually copies btw and not sure it's actually necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my comparision is direct numpy, e.g. what is the overhead that pandas is adding. Here is a 100MM frame (I constructed same as you). Note that using .values like this works (with no copy) only if its only a single dtype (as is the case here). So yes I think a 3x slowdown is too much. (granted we are doing more/better ops, but should be only a limited overhead).

In [65]: xx.size
Out[65]: 100000000

In [66]: xx.size/1e6
Out[66]: 100.0

In [67]: %timeit xx.values*2
1 loops, best of 3: 521 ms per loop

In [68]: %timeit xx.values/0
1 loops, best of 3: 613 ms per loop

@@ -131,6 +131,37 @@ methods (:issue:`9088`).
dtype: int64


- During division involving a ``Series`` or ``DataFrame``, `0/0` and `0//0` now give `np.nan` instead of `np.inf`. (:issue:`9144`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, I tried summarizing this more as suggested. Do you think it's fine now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good

@Garrett-R Garrett-R force-pushed the fix_GH9144 branch 2 times, most recently from 16529c3 to 19ecfc7 Compare February 10, 2015 23:36
@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

this should also fix #8445 yes? (if so, pls add that as a test as well), if not, see if its simple to extend to fix.

@Garrett-R Garrett-R force-pushed the fix_GH9144 branch 3 times, most recently from 5ba0f70 to 885407b Compare February 11, 2015 06:46
@Garrett-R
Copy link
Contributor Author

Thanks for explaining the float issue. I undid that change as suggested.


Yes, it fixes #8445. I've added that to the commit message.


You're not gonna like this but I added more special casing. It was to correct some new unexpected behavior I found: Series([-1,1]) / 0 gave signed infinities, while Series([-1,1]) // 0 gave only positive infinities. I believe there's no way to get around the special casing. I've added the appropriate test.


As for the slow division, that was existent before this PR. But good call on checking into that. I found that the _fill_zeros methods is unnecessary if the result is already a float since then everything will already be good. Therefore, this PR improves the truediv operation (or any Series/DataFrame operation resulting in a float) runtime by up to a factor of 10.

old behavior

In [1]: xx = pd.DataFrame(np.random.random((10000, 10000)))

In [2]: %timeit xx / 0
1 loops, best of 3: 1.59 s per loop

new behavior

In [1]: xx = pd.DataFrame(np.random.random((10000, 10000)))

In [2]: %timeit xx / 0
10 loops, best of 3: 165 ms per loop

shape = result.shape
result = result.ravel().astype('float64')
shape = result.shape
result = result.ravel().astype('float64', copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also do .view('float64') which IIRC is more idiomatic (but is basically the same)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think view is slightly different since it "can cause a reinterpretation of the bytes of memory".

In [1]: x = np.array([1,2])

In [2]: x.astype('float64')
Out[2]: array([ 1.,  2.])

In [3]: x.view('float64')
Out[3]: array([  4.94065646e-324,   9.88131292e-324])

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

pls add a release note for #8445 (you can simply add it on where you have #9144)

ok on the special casing, sometimes it is unavoidable.

can you add a couple of vbenches (though use a smaller matrix, maybe 1000x1000), the proportion should still be the same; do for float/int results (for floor/div), e.g. add 4 or so (or more if you think are necessary, e.g. maybe for module too). name then consistently and add somewhere in the vbench suite.
help on vbenches here: https://github.com/pydata/pandas/wiki/Performance-Testing.

These are mainly to prevent performance regressions if/when things are changed in the future. Post the results in the top of the PR as well.

thxs

@Garrett-R
Copy link
Contributor Author

@jreback, how's it looking now?

Also, I've edited my first comment in the PR to include the results of the vbenches (including the ones I added). Is this what you meant by "the top of the PR"?

@jreback
Copy link
Contributor

jreback commented Feb 15, 2015

@Garrett-R only show the relevant vbenches and compare vs current master

e.g. bring your master up to date, then
.\test_perf.sh -b master -t HEAD -r frame_

they top of the PR is correct (just replace it with the revised results).

@@ -131,6 +131,37 @@ methods (:issue:`9088`).
dtype: int64


- During division involving a ``Series`` or ``DataFrame``, `0/0` and `0//0` now give `np.nan` instead of `np.inf`. (:issue:`9144`, :issue:`8445`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify these with 2 backticks on each side (like you did for Series/DataFrame, this will highlite it as code

@jreback
Copy link
Contributor

jreback commented Feb 15, 2015

@Garrett-R minor doc changes, and revise the vbench, otherwise looks good to go.

@Garrett-R
Copy link
Contributor Author

@jreback, I've made the suggested doc changes and also revised the vbenches.

jreback added a commit that referenced this pull request Feb 16, 2015
BUG: 0/frame numeric ops buggy (GH9144)
@jreback jreback merged commit 0c95fef into pandas-dev:master Feb 16, 2015
@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

@Garrett-R thanks for this! nice work!

@Garrett-R Garrett-R deleted the fix_GH9144 branch February 16, 2015 21:00
# GH 6178
if np.isinf(fill):
np.putmask(result,(signs<0) & mask, -fill)
if "floordiv" in name: # (PR 9308)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on #19322 it looks like the problem would be solved by making this condition include other "div" operations. Was there a specific reason to only include floordiv here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't remember why only floordiv is here, but I do tend to comment non-obvious, intentional choices, so I'm guessing it was not intended that only floordiv be here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
4 participants