BUG: 0/frame numeric ops buggy (GH9144) #9308

Garrett-R · 2015-01-20T07:28:48Z

closes #9144
closes #8445

Here's the results from testing the vbenches related to DataFrames (I also added 6 vbenches).

Invoked with :
--ncalls: 10
--repeats: 10


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_float_div_by_zero                      |   1.5656 |  23.4006 |   0.0669 |
frame_float_floor_by_zero                    |   3.1077 |  24.2786 |   0.1280 |
groupby_frame_nth_none                       |   1.4918 |   2.1483 |   0.6944 |
groupby_frame_nth_any                        |   4.0633 |   5.4162 |   0.7502 |
dataframe_resample_max_string                |   1.3970 |   1.5892 |   0.8791 |
stat_ops_frame_mean_int_axis_1               |   3.4506 |   3.9006 |   0.8846 |
stat_ops_frame_mean_float_axis_1             |   3.6501 |   4.1200 |   0.8859 |
dataframe_resample_max_numpy                 |   1.4134 |   1.5861 |   0.8911 |
frame_reindex_upcast                         |   5.6926 |   6.3748 |   0.8930 |
stat_ops_frame_sum_int_axis_1                |   3.2424 |   3.6136 |   0.8973 |
dataframe_resample_min_numpy                 |   1.4320 |   1.5937 |   0.8985 |
stat_ops_frame_mean_float_axis_0             |   3.5722 |   3.9361 |   0.9075 |
frame_ctor_dtindex_Nanox2                    |   0.8544 |   0.9404 |   0.9086 |
frame_dropna_axis1_any                       |  17.6609 |  19.4360 |   0.9087 |
frame_ctor_dtindex_Hourx1                    |   0.8695 |   0.9554 |   0.9101 |
frame_ctor_dtindex_Secondx1                  |   0.8586 |   0.9384 |   0.9150 |
frame_ctor_dtindex_Microx2                   |   0.8564 |   0.9341 |   0.9168 |
dataframe_resample_mean_string               |   1.8672 |   2.0302 |   0.9197 |
frame_ctor_dtindex_Nanox1                    |   0.8526 |   0.9242 |   0.9225 |
stat_ops_level_frame_sum_multiple            |   4.7174 |   5.0987 |   0.9252 |
frame_ctor_dtindex_Hourx2                    |   0.8645 |   0.9342 |   0.9254 |
frame_xs_row                                 |   0.0242 |   0.0261 |   0.9269 |
frame_ctor_dtindex_BMonthBeginx1             |   1.0908 |   1.1755 |   0.9280 |
stat_ops_frame_sum_float_axis_0              |   3.6003 |   3.8713 |   0.9300 |
eval_frame_mult_python_one_thread            |  12.8115 |  13.7715 |   0.9303 |
stat_ops_frame_sum_int_axis_0                |   3.2616 |   3.4839 |   0.9362 |
eval_frame_mult_python                       |  12.8203 |  13.6336 |   0.9403 |
dataframe_resample_min_string                |   1.4186 |   1.5066 |   0.9416 |
frame_shift_axis0                            |   7.3947 |   7.8502 |   0.9420 |
stat_ops_frame_sum_float_axis_1              |   3.5332 |   3.7481 |   0.9427 |
frame_ctor_dtindex_BYearEndx2                |   1.0797 |   1.1452 |   0.9428 |
frame_fillna_many_columns_pad                |   4.3826 |   4.6457 |   0.9434 |
frame_ctor_dtindex_Secondx2                  |   0.8635 |   0.9153 |   0.9434 |
frame_get_dtype_counts                       |   0.0601 |   0.0635 |   0.9467 |
frame_dropna_axis1_all                       |  34.1749 |  36.0852 |   0.9471 |
frame_float_equal                            |   2.1541 |   2.2727 |   0.9478 |
dataframe_resample_mean_numpy                |   1.8673 |   1.9664 |   0.9496 |
append_frame_single_mixed                    |   1.2900 |   1.3544 |   0.9525 |
eval_frame_add_python                        |  13.0182 |  13.6672 |   0.9525 |
eval_frame_and_python_one_thread             |  23.5453 |  24.5942 |   0.9574 |
frame_drop_dup_inplace                       |   1.8290 |   1.9081 |   0.9585 |
groupby_frame_cython_many_columns            |   2.4763 |   2.5806 |   0.9596 |
frame_get_numeric_data                       |   0.0697 |   0.0725 |   0.9612 |
eval_frame_and_python                        |  23.6922 |  24.6444 |   0.9614 |
frame_mask_bools                             |   8.5507 |   8.8782 |   0.9631 |
frame_shift_axis_1                           |  11.5263 |  11.9162 |   0.9673 |
eval_frame_chained_cmp_python                |  68.9274 |  71.2359 |   0.9676 |
frame_multi_and_st                           |  20.8511 |  21.5107 |   0.9693 |
eval_frame_add_python_one_thread             |  11.9996 |  12.3749 |   0.9697 |
frame_mask_floats                            |   5.7624 |   5.9394 |   0.9702 |
frame_multi_and_no_ne                        |  21.2904 |  21.9065 |   0.9719 |
frame_apply_axis_1                           |  45.9807 |  47.2289 |   0.9736 |
join_dataframe_integer_2key                  |   3.5381 |   3.6330 |   0.9739 |
frame_ctor_nested_dict_int64                 |  48.7418 |  50.0032 |   0.9748 |
eval_frame_chained_cmp_python_one_thread     |  67.2400 |  68.9405 |   0.9753 |
frame_to_csv_mixed                           | 404.9059 | 414.6069 |   0.9766 |
groupby_frame_singlekey_integer              |   1.4465 |   1.4805 |   0.9770 |
frame_dropna_axis1_all_mixed_dtypes          | 137.6919 | 140.7975 |   0.9779 |
frame_insert_500_columns_end                 |  65.0207 |  66.4810 |   0.9780 |
frame_drop_dup_na_inplace                    |   1.6683 |   1.7045 |   0.9788 |
frame_iteritems                              |  17.1694 |  17.5347 |   0.9792 |
stat_ops_level_frame_sum                     |   2.1278 |   2.1730 |   0.9792 |
reindex_frame_level_align                    |   0.6034 |   0.6153 |   0.9807 |
frame_count_level_axis0_mixed_dtypes_multi   |  73.6246 |  75.0608 |   0.9809 |
dataframe_reindex                            |   0.2569 |   0.2615 |   0.9824 |
frame_reindex_axis1                          |  44.7410 |  45.5110 |   0.9831 |
groupby_frame_median                         |   5.1782 |   5.2658 |   0.9834 |
frame_to_csv_date_formatting                 |   6.7534 |   6.8611 |   0.9843 |
frame_fancy_lookup_all                       |  11.3160 |  11.4933 |   0.9846 |
frame_ctor_dtindex_DateOffsetx2              |   0.7447 |   0.7562 |   0.9848 |
join_dataframe_integer_key                   |   1.2000 |   1.2166 |   0.9863 |
frame_reindex_columns                        |   0.2247 |   0.2276 |   0.9872 |
frame_from_records_generator                 |  41.6522 |  42.1751 |   0.9876 |
frame_ctor_dtindex_QuarterBeginx2            |   0.9039 |   0.9145 |   0.9884 |
frame_sort_index_by_columns                  |  24.9708 |  25.2553 |   0.9887 |
join_dataframe_index_single_key_bigger       |   8.9800 |   9.0787 |   0.9891 |
frame_apply_ref_by_name                      |   8.7795 |   8.8757 |   0.9892 |
stat_ops_frame_mean_int_axis_0               |   3.2407 |   3.2742 |   0.9898 |
indexing_dataframe_boolean                   |  82.9152 |  83.7444 |   0.9901 |
frame_add                                    |   3.7720 |   3.8059 |   0.9911 |
frame_ctor_dtindex_BYearBeginx2              |   1.0778 |   1.0870 |   0.9915 |
frame_drop_duplicates_na                     |  14.2417 |  14.3591 |   0.9918 |
frame_ctor_dtindex_CBMonthBeginx1            |   2.2563 |   2.2738 |   0.9923 |
frame_ctor_dtindex_Weekx2                    |   0.7799 |   0.7859 |   0.9924 |
join_dataframe_index_single_key_small        |   8.2365 |   8.2989 |   0.9925 |
frame_dropna_axis1_any_mixed_dtypes          | 124.3344 | 125.2386 |   0.9928 |
frame_ctor_list_of_dict                      |  42.8818 |  43.1860 |   0.9930 |
frame_ctor_dtindex_YearEndx2                 |   0.8680 |   0.8737 |   0.9935 |
append_frame_single_homogenous               |   0.8964 |   0.9022 |   0.9936 |
groupby_frame_apply                          |  22.1934 |  22.3310 |   0.9938 |
frame_mult_no_ne                             |   3.8051 |   3.8274 |   0.9942 |
frame_nonunique_equal                        |   7.3191 |   7.3515 |   0.9956 |
frame_ctor_dtindex_BMonthBeginx2             |   1.0844 |   1.0890 |   0.9958 |
frame_apply_lambda_mean                      |   3.8462 |   3.8609 |   0.9962 |
frame_ctor_nested_dict                       |  46.5123 |  46.6852 |   0.9963 |
frame_count_level_axis1_mixed_dtypes_multi   |  61.2177 |  61.4444 |   0.9963 |
frame_ctor_dtindex_BMonthEndx2               |   0.9260 |   0.9288 |   0.9970 |
frame_html_repr_trunc_mi                     |  22.1488 |  22.2066 |   0.9974 |
frame_multi_and                              |  21.2545 |  21.3085 |   0.9975 |
frame_add_st                                 |   3.7661 |   3.7741 |   0.9979 |
frame_ctor_dtindex_BYearBeginx1              |   1.0904 |   1.0926 |   0.9980 |
join_dataframe_index_multi                   |  13.3128 |  13.3311 |   0.9986 |
frame_getitem_single_column                  |  12.7089 |  12.7029 |   1.0005 |
frame_object_equal                           |   7.3267 |   7.3224 |   1.0006 |
frame_from_records_generator_nrows           |   0.5950 |   0.5940 |   1.0016 |
frame_to_string_floats                       |  15.5732 |  15.5458 |   1.0018 |
frame_add_no_ne                              |   3.8073 |   3.7999 |   1.0019 |
frame_ctor_dtindex_CBMonthBeginx2            |   1.9428 |   1.9389 |   1.0020 |
frame_ctor_dtindex_BusinessDayx2             |   0.8365 |   0.8348 |   1.0020 |
frame_ctor_dtindex_BDayx2                    |   0.8400 |   0.8381 |   1.0023 |
frame_float_mod                              |   2.4775 |   2.4717 |   1.0024 |
frame_reindex_axis0                          |  42.1150 |  42.0129 |   1.0024 |
frame_drop_duplicates                        |  13.3927 |  13.3401 |   1.0039 |
sparse_frame_constructor                     |   3.7919 |   3.7743 |   1.0047 |
frame_iloc_big                               |   0.1020 |   0.1015 |   1.0049 |
frame_interpolate_some_good_infer            |   1.8705 |   1.8610 |   1.0051 |
indexing_dataframe_boolean_st                |  85.3268 |  84.8837 |   1.0052 |
frame_dropna_axis0_any                       |  18.0998 |  18.0034 |   1.0054 |
frame_count_level_axis1_multi                |  57.8269 |  57.5086 |   1.0055 |
indexing_dataframe_boolean_rows              |   0.2319 |   0.2306 |   1.0056 |
indexing_dataframe_boolean_rows_object       |   0.3911 |   0.3880 |   1.0080 |
frame_dropna_axis0_any_mixed_dtypes          | 125.2823 | 124.2290 |   1.0085 |
frame_repr_wide                              |   8.4213 |   8.3428 |   1.0094 |
frame_apply_pass_thru                        |   2.7872 |   2.7612 |   1.0094 |
frame_dtypes                                 |   0.0732 |   0.0725 |   1.0099 |
frame_to_html_mixed                          | 120.9245 | 119.7213 |   1.0101 |
frame_ctor_dtindex_DateOffsetx1              |   0.7530 |   0.7452 |   1.0105 |
frame_ctor_dtindex_Dayx1                     |   0.8844 |   0.8746 |   1.0112 |
groupby_frame_apply_overhead                 |   5.2761 |   5.2173 |   1.0113 |
frame_ctor_dtindex_Millix1                   |   0.9011 |   0.8910 |   1.0113 |
frame_count_level_axis0_multi                |  43.2266 |  42.7371 |   1.0115 |
reindex_frame_level_reindex                  |   0.6016 |   0.5946 |   1.0117 |
frame_ctor_dtindex_BQuarterBeginx1           |   1.1092 |   1.0958 |   1.0122 |
frame_reindex_both_axes                      |  13.9118 |  13.7180 |   1.0141 |
join_dataframe_index_single_key_bigger_sort  |  11.0337 |  10.8757 |   1.0145 |
frame_ctor_dtindex_Weekx1                    |   0.7509 |   0.7395 |   1.0154 |
frame_ctor_dtindex_BMonthEndx1               |   0.9711 |   0.9547 |   1.0172 |
indexing_dataframe_boolean_no_ne             |  87.0329 |  85.4171 |   1.0189 |
frame_fancy_lookup                           |   2.0553 |   2.0135 |   1.0208 |
frame_mult_st                                |   3.8758 |   3.7959 |   1.0210 |
frame_repr_tall                              |  12.1078 |  11.8041 |   1.0257 |
frame_insert_100_columns_begin               |  24.3260 |  23.6877 |   1.0269 |
frame_ctor_dtindex_QuarterEndx2              |   1.0204 |   0.9927 |   1.0279 |
frame_iteritems_cached                       |   0.3542 |   0.3440 |   1.0297 |
frame_ctor_dtindex_Easterx2                  |   0.9328 |   0.9050 |   1.0307 |
frame_interpolate                            |  64.3447 |  62.4263 |   1.0307 |
frame_html_repr_trunc_si                     |  17.4936 |  16.9715 |   1.0308 |
frame_mult                                   |   3.9272 |   3.8050 |   1.0321 |
frame_dropna_axis0_all_mixed_dtypes          | 142.3560 | 137.7486 |   1.0334 |
frame_from_series                            |   0.0670 |   0.0648 |   1.0338 |
frame_apply_np_mean                          |   4.2445 |   4.0938 |   1.0368 |
frame_interpolate_some_good                  |   1.0546 |   1.0159 |   1.0381 |
frame_ctor_dtindex_MonthBeginx1              |   0.9428 |   0.9080 |   1.0383 |
frame_ctor_dtindex_Minutex1                  |   0.8931 |   0.8599 |   1.0386 |
frame_constructor_ndarray                    |   0.0554 |   0.0532 |   1.0408 |
frame_ctor_dtindex_BQuarterEndx2             |   1.0568 |   1.0139 |   1.0423 |
frame_ctor_dtindex_QuarterEndx1              |   1.0460 |   1.0035 |   1.0423 |
frame_getitem_single_column2                 |  12.9310 |  12.3968 |   1.0431 |
frame_ctor_dtindex_MonthBeginx2              |   0.9435 |   0.9042 |   1.0435 |
frame_ctor_dtindex_Microx1                   |   0.9086 |   0.8704 |   1.0439 |
frame_ctor_dtindex_CustomBusinessDayx2       |   0.8865 |   0.8490 |   1.0441 |
frame_ctor_dtindex_CustomBusinessDayx1       |   0.8898 |   0.8517 |   1.0447 |
frame_to_csv2                                |  82.5006 |  78.9100 |   1.0455 |
frame_ctor_dtindex_BQuarterBeginx2           |   1.1443 |   1.0945 |   1.0455 |
frame_ctor_dtindex_BQuarterEndx1             |   1.0723 |   1.0249 |   1.0462 |
frame_ctor_dtindex_Dayx2                     |   0.9089 |   0.8669 |   1.0485 |
frame_ctor_dtindex_CDayx1                    |   0.8952 |   0.8536 |   1.0487 |
frame_ctor_dtindex_Easterx1                  |   0.9384 |   0.8919 |   1.0521 |
frame_ctor_dtindex_YearBeginx1               |   0.8962 |   0.8514 |   1.0526 |
frame_float_div                              |   4.8441 |   4.6001 |   1.0530 |
frame_reindex_both_axes_ix                   |  14.6171 |  13.8707 |   1.0538 |
frame_ctor_dtindex_YearEndx1                 |   0.9240 |   0.8754 |   1.0555 |
frame_ctor_dtindex_QuarterBeginx1            |   0.9843 |   0.9323 |   1.0558 |
frame_ctor_dtindex_BDayx1                    |   0.8610 |   0.8153 |   1.0561 |
frame_ctor_dtindex_MonthEndx2                |   0.9667 |   0.9151 |   1.0564 |
frame_ctor_dtindex_Millix2                   |   0.9122 |   0.8625 |   1.0576 |
frame_to_csv                                 |  96.1645 |  90.8887 |   1.0580 |
frame_ctor_dtindex_CDayx2                    |   0.9058 |   0.8557 |   1.0585 |
frame_boolean_row_select                     |   0.1848 |   0.1741 |   1.0615 |
frame_loc_dups                               |   0.7166 |   0.6750 |   1.0616 |
frame_assign_timeseries_index                |   0.5794 |   0.5451 |   1.0629 |
frame_ctor_dtindex_Minutex2                  |   0.9103 |   0.8556 |   1.0639 |
frame_ctor_dtindex_CBMonthEndx1              |   3.0158 |   2.8304 |   1.0655 |
frame_apply_user_func                        |  57.2865 |  53.6327 |   1.0681 |
frame_ctor_dtindex_CBMonthEndx2              |   3.0658 |   2.8691 |   1.0686 |
frame_ctor_dtindex_BusinessDayx1             |   0.8704 |   0.8101 |   1.0744 |
frame_ctor_dtindex_YearBeginx2               |   0.9055 |   0.8417 |   1.0758 |
frame_ctor_dtindex_MonthEndx1                |   0.9801 |   0.9096 |   1.0775 |
frame_iloc_dups                              |   0.1877 |   0.1736 |   1.0813 |
frame_dropna_axis0_all                       |  31.8504 |  29.3697 |   1.0845 |
frame_fillna_inplace                         |   8.8447 |   8.0231 |   1.1024 |
frame_ctor_dtindex_BYearEndx1                |   1.2137 |   1.0994 |   1.1040 |
frame_isnull                                 |   0.6357 |   0.5477 |   1.1606 |
frame_xs_mi_ix                               |   2.3058 |   1.9489 |   1.1831 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [e33f3bc] : BUG: Fix #9144 #8445  Fix how core.common._fill_zeros handles div and mod by zero
Base   [76195fb] : Merge pull request #9498 from jreback/consist

Garrett-R · 2015-01-20T08:19:35Z

@jreback, my code is failing because I had assumed that we want 0/0 to be NaN, so implemented that as part of the bugfix, but there's a test that thinks 0/0 should be inf, which seems mistaken.

Should I go ahead and change that test?

jreback · 2015-01-20T11:18:52Z

see here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#id31

also IIRC there was some discussion w.r.t. to 0/0 (but might be in another issue). pls see if you can locate it.

Garrett-R · 2015-01-21T06:27:57Z

Hmmm... I couldn't find any 0/0 discussion (there was a discussion in #3590 about mod 0 though), but in the release notes you linked to for v.0.12.0, it said "Fix modulo and integer division on Series,DataFrames to act similary to float dtypes to return np.nan or np.inf as appropriate" and it gave 4 examples. I think the fix (PR #3600) worked except for the third example as demonstrated here:

>>> p = pd.DataFrame({ 'first' : [4,5,8], 'second' : [0,0,3] })

>>> q = p.astype('float')

>>> print(p%0,'\n', q%0)
   first  second
0    NaN     NaN
1    NaN     NaN
2    NaN     NaN 
    first  second
0    NaN     NaN
1    NaN     NaN
2    NaN     NaN

>>> print(p%p,'\n', q%q)
   first  second
0      0     NaN
1      0     NaN
2      0       0 
    first  second
0      0     NaN
1      0     NaN
2      0       0

>>> print(p/p,'\n', q/q)    # Inconsistent!
   first    second
0      1       inf
1      1       inf
2      1  1.000000 
    first  second
0      1     NaN
1      1     NaN
2      1       1

>>> print(p/0,'\n', q/0)
   first  second
0    inf     inf
1    inf     inf
2    inf     inf 
    first  second
0    inf     inf
1    inf     inf
2    inf     inf

The results are the same for Python 2.

This PR would correct this.
(Edit: Actually, my PR fails to correct the inconsistency that's also present with print(p//p,'\n', q//q) and print(p//0,'\n', q//0), but I'll wait to hear what you think I should do before looking into how to change that.)

I see NumPy has some strange things going on (although it sounds like they're discussing fixing this in numpy/numpy#899), for example,

>>> x = np.int_(0)
>>> x / x
nan
>>> x // x
0

but I haven't ever seen 0/0 being infinite. I think we should change that...

BTW, I'll also write a release note once we decide how to handle this.

jreback · 2015-01-22T01:25:19Z

cc @seth-p

iIIRC didn't we have s discussions about this?

what are your thoughts here?

seth-p · 2015-01-22T01:47:10Z

I agree that 0/0, 0//0, and 0.0/0.0, should be NaN and not inf.

Is the issue with 0//0 that NaN is actually a float, and so an integer series or DataFrame column cannot contain NaN? So in cases where the result must be an integer (are there such cases?), I'm not sure what we want 0//0 to be -- 0 seems about as good as anything, I suppose. If there are no such cases, then never mind...

[A completely separate issue is that NaN is used both for (a) results that are not numbers (e.g. 0.0/0.0), and (b) missing values (e.g. None). I could see a reasonable argument for representing those differently.]

jreback · 2015-01-22T01:53:37Z

missing and undefined values are both represented by nan

I know others have different ways of doing this - bottoms line is pandss chose a long time ago to only have a single nan represent both - mainly this a perf issue (to have multiple missing value types)

jreback · 2015-01-22T01:55:19Z

@Garrett-R go ahead snd make the change to make 0/0 be nan (both floats and ints)
eill be an api change so needs a small example / mention in the API page

Garrett-R · 2015-01-22T09:31:45Z

@seth-p, right, integer types don't support NaN, so you have to do one of the following for 0//0:

choose an arbitrary value (NumPy's current behavior)
throw an exception (ex: C++ and Python)
cast to a float (this is actually what Pandas already does and what they're discussing doing in Make power and divide return floats from int inputs (like true_divide) (Trac #301) numpy/numpy#899)

assert (Series([0]) // Series([0])).dtype == np.float_
assert (Series([0]) // Series([1])).dtype == np.int_

Given that we're already casting to float, and assuming no one's planning on changing that behavior, we might as well make it NaN. @jreback, cool, I'll get on it!

jreback · 2015-01-22T11:09:31Z

@Garrett-R the int/float issue is already solved. Pandas handles this correctly. That said I am thinking about this again.

So now 1/0 -> inf, but 0/0 -> NaN. Isn't this inconsistent?

Garrett-R · 2015-01-22T20:25:22Z

@jreback, I believe 1/0 -> inf and 0/0 -> NaN is conventional (assuming we're casting to floats). I wanted to try and find it in the IEEE floating point arithmetic document, but I think it's hidden behind an evil paywall.

Anyway, just check out the different languages:

C++

#include <iostream>
int main() {
    float x=0.0;
    std::cout << x/x << std::endl;
    std::cout << 1.0/x << std::endl;
}

outputs

-nan
inf

(the negative is irrelevant)

Julia

julia> 0/0
NaN

julia> 1/0
Inf

R

> 1/0
[1] NaN
> 0/0
[1] Inf

Mathematica

0/0
Indeterminate
1/0
ComplexInfinity

jreback · 2015-01-23T01:24:05Z

pls add a release note (maybe short example as well), just to make it obvious
put in api change section in 0.16.0

Garrett-R · 2015-02-03T09:18:34Z

Sorry for the delay. I've added a release note.

I also modified one more thing. For modulus, there's an issue:

In [1]: s = pd.Series([0,1])

In [2]: s % 0
Out[2]: 
0   NaN
1   NaN
dtype: float64

In [3]: 0 % s        # Should give a NaN for 0%0
Out[3]: 
0    0
1    0
dtype: int64

The modified behavior is:

In [1]: s = pd.Series([0,1])

In [2]: s % 0
Out[2]: 
0   NaN
1   NaN
dtype: float64

In [3]: 0 % s
Out[3]: 
0   NaN
1     0
dtype: float64

jreback · 2015-02-05T13:44:01Z

pandas/core/common.py

+                # Floor division must be treated specially since NumPy
+                # causes 0//0 to be 0, but we want it to be NaN. (PR 9308)
+                if "floordiv" in name:
+                    if not isinstance(x, np.ndarray) and x == 0:


you can do: np.isscalar(..)

Garrett-R · 2015-02-10T08:20:36Z

pandas/core/common.py

+                if "floordiv" in name:  # (PR 9308)
+                    nan_mask = ((y == 0) & (x == 0)).ravel()
+                    np.putmask(result, nan_mask, np.nan)
+


@jreback, you asked if it's possible to not special case this. I rewrote it to be more clean and in fact, as it's written now, you can get rid of the if "floordiv" in name: line, and it'll be fine. It turns out that it'll only affect the floordiv operators' results anyway. However, I think we should leave it as is for two reasons: 1) efficiency and (2) I think it makes it more readable since we really are special casing floordiv to ameliorate NumPy's strange treatment of floor division.

As for efficiency, we don't need to run these two lines for any other operator, for example,

xx = pd.DataFrame(np.random.random((10000, 10000))) yy = xx / 0

The second line took 1.59s on average, while if you get rid of the if statement above, it takes 2.05s on average.

ok

hmm 2s seems quite long for this type of operation
can u profile and see what's up?

I guess it's taking so long because my DataFrame is 10e8 floats (~1GB).

Profiling the _fill_zeros method (without the if statement), it spends 0.36s on the nan_mask = line (and nan_mask uses 94MB). It spends 0.06s on the following line.

hmm, maybe try taking out the .ravel(). I think they are making copies here (and prob are not necessary).

@jreback, I just attempted to refactor _fill_zeros to remove all the raveling and reshaping (which aren't necessary). It actually slowed down my test code (shown below) from 42s to 57s.

According the NumPy docs for ravel, a "copy is made only if needed". In our case a copy wasn't being made. And the np.putmask operation seemed to take much longer for the original array than the raveled one ‒ the line np.putmask(result, mask, fill) (which I didn't modify) had its average runtime increase from 87ms to 548ms.

I'm also adding a comment to explain why we ravel and reshape.

Test code:

def f(): xx % 0 xx / 0 xx // 0 yy % 0 yy / 0 yy // 0 xx = pd.DataFrame(np.random.random((10000, 10000))) yy = pd.DataFrame(np.random.random_integers(0,100, size=(10000,10000))) for _ in range(5): f()

profiling is the best as sometimes intuition is wrong :)

you might want to try reshaping explicitly first (just a guess) or taking a view

I think it's forcing a copy when one is not necessary I think

this op should be pretty fast - it's the copies that are getting in the way

Yeah, good call on the profiling!

Raveling and reshaping is helping. They're not causing any copies to be made which I confirmed by memory profiling.

But actually, there was one thing causing an unnecessary copy to be made (independent of whether we ravel or not) which was using .astype('float64'). I've changed it to .astype('float', copy=False). The reason for changing 'float64' to 'float' is that I don't see any reason to force float64; isn't it better to allow the computer architecture to decide between float32 and float64?

After this change, I tried time profiling again to verify that the raveling/reshaping is still helping.

BTW, are you surprised that a 1GB data frame take 1.5s to have a division operator performed on it and have all its values updated? That seemed pretty reasonable to me...

no we always use float64 except if the user is explicit in not using it
so you cannot change what it is
and astype usually copies btw and not sure it's actually necessary

So my comparision is direct numpy, e.g. what is the overhead that pandas is adding. Here is a 100MM frame (I constructed same as you). Note that using .values like this works (with no copy) only if its only a single dtype (as is the case here). So yes I think a 3x slowdown is too much. (granted we are doing more/better ops, but should be only a limited overhead).

In [65]: xx.size Out[65]: 100000000 In [66]: xx.size/1e6 Out[66]: 100.0 In [67]: %timeit xx.values*2 1 loops, best of 3: 521 ms per loop In [68]: %timeit xx.values/0 1 loops, best of 3: 613 ms per loop

Garrett-R · 2015-02-10T08:27:39Z

doc/source/whatsnew/v0.16.0.txt

@@ -131,6 +131,37 @@ methods (:issue:`9088`).
    dtype: int64


+- During division involving a ``Series`` or ``DataFrame``, `0/0` and `0//0` now give `np.nan` instead of `np.inf`. (:issue:`9144`)


@jreback, I tried summarizing this more as suggested. Do you think it's fine now?

this is good

jreback · 2015-02-10T23:44:34Z

this should also fix #8445 yes? (if so, pls add that as a test as well), if not, see if its simple to extend to fix.

Garrett-R · 2015-02-11T07:00:05Z

Thanks for explaining the float issue. I undid that change as suggested.

Yes, it fixes #8445. I've added that to the commit message.

You're not gonna like this but I added more special casing. It was to correct some new unexpected behavior I found: Series([-1,1]) / 0 gave signed infinities, while Series([-1,1]) // 0 gave only positive infinities. I believe there's no way to get around the special casing. I've added the appropriate test.

As for the slow division, that was existent before this PR. But good call on checking into that. I found that the _fill_zeros methods is unnecessary if the result is already a float since then everything will already be good. Therefore, this PR improves the truediv operation (or any Series/DataFrame operation resulting in a float) runtime by up to a factor of 10.

old behavior

In [1]: xx = pd.DataFrame(np.random.random((10000, 10000)))

In [2]: %timeit xx / 0
1 loops, best of 3: 1.59 s per loop

new behavior

In [1]: xx = pd.DataFrame(np.random.random((10000, 10000)))

In [2]: %timeit xx / 0
10 loops, best of 3: 165 ms per loop

jreback · 2015-02-11T17:38:40Z

pandas/core/common.py

-                shape = result.shape
-                result = result.ravel().astype('float64')
+            shape = result.shape
+            result = result.ravel().astype('float64', copy=False)


you can also do .view('float64') which IIRC is more idiomatic (but is basically the same)

I think view is slightly different since it "can cause a reinterpretation of the bytes of memory".

In [1]: x = np.array([1,2]) In [2]: x.astype('float64') Out[2]: array([ 1., 2.]) In [3]: x.view('float64') Out[3]: array([ 4.94065646e-324, 9.88131292e-324])

jreback · 2015-02-11T17:45:06Z

pls add a release note for #8445 (you can simply add it on where you have #9144)

ok on the special casing, sometimes it is unavoidable.

can you add a couple of vbenches (though use a smaller matrix, maybe 1000x1000), the proportion should still be the same; do for float/int results (for floor/div), e.g. add 4 or so (or more if you think are necessary, e.g. maybe for module too). name then consistently and add somewhere in the vbench suite.
help on vbenches here: https://github.com/pydata/pandas/wiki/Performance-Testing.

These are mainly to prevent performance regressions if/when things are changed in the future. Post the results in the top of the PR as well.

thxs

Garrett-R · 2015-02-15T15:55:48Z

@jreback, how's it looking now?

Also, I've edited my first comment in the PR to include the results of the vbenches (including the ones I added). Is this what you meant by "the top of the PR"?

jreback · 2015-02-15T21:12:23Z

@Garrett-R only show the relevant vbenches and compare vs current master

e.g. bring your master up to date, then
.\test_perf.sh -b master -t HEAD -r frame_

they top of the PR is correct (just replace it with the revised results).

jreback · 2015-02-15T21:13:27Z

doc/source/whatsnew/v0.16.0.txt

@@ -131,6 +131,37 @@ methods (:issue:`9088`).
    dtype: int64


+- During division involving a ``Series`` or ``DataFrame``, `0/0` and `0//0` now give `np.nan` instead of `np.inf`. (:issue:`9144`, :issue:`8445`)


specify these with 2 backticks on each side (like you did for Series/DataFrame, this will highlite it as code

jreback · 2015-02-15T21:15:48Z

@Garrett-R minor doc changes, and revise the vbench, otherwise looks good to go.

…ros handles div and mod by zero

Garrett-R · 2015-02-16T08:49:52Z

@jreback, I've made the suggested doc changes and also revised the vbenches.

BUG: 0/frame numeric ops buggy (GH9144)

jreback · 2015-02-16T12:35:31Z

@Garrett-R thanks for this! nice work!

jbrockmendel · 2018-01-20T18:32:47Z

pandas/core/common.py

-                # GH 6178
-                if np.isinf(fill):
-                    np.putmask(result,(signs<0) & mask, -fill)
+            if "floordiv" in name:  # (PR 9308)


Working on #19322 it looks like the problem would be solved by making this condition include other "div" operations. Was there a specific reason to only include floordiv here?

I'm afraid I don't remember why only floordiv is here, but I do tend to comment non-obvious, intentional choices, so I'm guessing it was not intended that only floordiv be here.

Garrett-R force-pushed the fix_GH9144 branch from 182b3dd to 224cdf4 Compare January 20, 2015 08:14

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Compat pandas objects compatability with Numpy or Python functions labels Jan 22, 2015

jreback added this to the 0.16.0 milestone Jan 22, 2015

jreback changed the title ~~Fix gh9144~~ BUG: 0/frame numeric ops buggy (GH9144) Jan 22, 2015

Garrett-R force-pushed the fix_GH9144 branch from 224cdf4 to eb562c6 Compare February 3, 2015 09:15

Garrett-R force-pushed the fix_GH9144 branch from eb562c6 to 48657f3 Compare February 3, 2015 21:52

jreback reviewed Feb 5, 2015
View reviewed changes

Garrett-R force-pushed the fix_GH9144 branch 2 times, most recently from e0fb040 to aecae18 Compare February 10, 2015 08:12

Garrett-R reviewed Feb 10, 2015
View reviewed changes

Garrett-R force-pushed the fix_GH9144 branch from aecae18 to 940a98b Compare February 10, 2015 08:26

Garrett-R reviewed Feb 10, 2015
View reviewed changes

Garrett-R force-pushed the fix_GH9144 branch 2 times, most recently from 16529c3 to 19ecfc7 Compare February 10, 2015 23:36

Garrett-R force-pushed the fix_GH9144 branch 3 times, most recently from 5ba0f70 to 885407b Compare February 11, 2015 06:46

jreback reviewed Feb 11, 2015
View reviewed changes

Garrett-R force-pushed the fix_GH9144 branch from 885407b to 38856f3 Compare February 15, 2015 06:23

jreback reviewed Feb 15, 2015
View reviewed changes

BUG: Fix pandas-dev#9144 pandas-dev#8445 Fix how core.common._fill_ze…

85342ee

…ros handles div and mod by zero

Garrett-R force-pushed the fix_GH9144 branch from 38856f3 to 85342ee Compare February 16, 2015 08:30

jreback added a commit that referenced this pull request Feb 16, 2015

Merge pull request #9308 from Garrett-R/fix_GH9144

0c95fef

BUG: 0/frame numeric ops buggy (GH9144)

jreback merged commit 0c95fef into pandas-dev:master Feb 16, 2015

Garrett-R deleted the fix_GH9144 branch February 16, 2015 21:00

jbrockmendel reviewed Jan 20, 2018

View reviewed changes

jbrockmendel mentioned this pull request Jan 20, 2018

Index division by zero not filled #19322

Closed

		@@ -131,6 +131,37 @@ methods (:issue:`9088`).
		dtype: int64


		- During division involving a ``Series`` or ``DataFrame``, `0/0` and `0//0` now give `np.nan` instead of `np.inf`. (:issue:`9144`)

BUG: 0/frame numeric ops buggy (GH9144) #9308

BUG: 0/frame numeric ops buggy (GH9144) #9308

Conversation

Garrett-R commented Jan 20, 2015

Garrett-R commented Jan 20, 2015

jreback commented Jan 20, 2015

Garrett-R commented Jan 21, 2015

jreback commented Jan 22, 2015

seth-p commented Jan 22, 2015

jreback commented Jan 22, 2015

jreback commented Jan 22, 2015

Garrett-R commented Jan 22, 2015

jreback commented Jan 22, 2015

Garrett-R commented Jan 22, 2015

jreback commented Jan 23, 2015

Garrett-R commented Feb 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 10, 2015

Garrett-R commented Feb 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 11, 2015

Garrett-R commented Feb 15, 2015

jreback commented Feb 15, 2015

Choose a reason for hiding this comment

jreback commented Feb 15, 2015

Garrett-R commented Feb 16, 2015

jreback commented Feb 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment