Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement bits of numpy_helper in cython where possible #19450

Merged
merged 1 commit into from
Jan 31, 2018

Conversation

jbrockmendel
Copy link
Member

Like with the transition to tslibs.np_datetime, this implements pieces of numpy_helper.h directly in cython in util.pxd. The generated C should be equivalent to existing versions, but that is worth double-checking.

One dependency is removed from setup.py that was missed in #19415, should have been deleted there.

@codecov
Copy link

codecov bot commented Jan 30, 2018

Codecov Report

Merging #19450 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19450      +/-   ##
==========================================
- Coverage   91.62%    91.6%   -0.03%     
==========================================
  Files         150      150              
  Lines       48724    48724              
==========================================
- Hits        44644    44632      -12     
- Misses       4080     4092      +12
Flag Coverage Δ
#multiple 89.97% <ø> (-0.03%) ⬇️
#single 41.74% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4618a09...7034c8e. Read the comment docs.

@jbrockmendel jbrockmendel changed the title implement of numpy_helper in cython where possible implement bits of numpy_helper in cython where possible Jan 30, 2018
@jreback
Copy link
Contributor

jreback commented Jan 30, 2018

looks ok. needs a perf test, pls run with affinity!

@jreback jreback added Performance Memory or execution speed performance Clean labels Jan 30, 2018
@jbrockmendel
Copy link
Member Author

Will do. Might take a while. In the interim, I'm hopeful that #19301 may be good to go, possibly also #19336.

@jbrockmendel
Copy link
Member Author

One run without affinity, three with. Recently rebooted the machine, might explain the uncharacteristically stable results.

asv continuous -E virtualenv -f 1.1 master HEAD -b timeseries
[...]
    before     after       ratio
  [238499ab] [7034c8ef]
+    2.32ms     3.41ms      1.47  timeseries.ResampleSeries.time_resample('period', '5min', 'ohlc')
+    2.40ms     3.28ms      1.36  timeseries.ToDatetimeCache.time_dup_seconds_and_unit(False)
-  128.73ms   116.98ms      0.91  timeseries.DatetimeIndex.time_to_time('tz_aware')
-    2.59ms     2.35ms      0.91  timeseries.ResampleSeries.time_resample('datetime', '5min', 'ohlc')
-    6.95ms     3.58ms      0.51  timeseries.ToDatetimeCache.time_dup_seconds_and_unit(True)
taskset 4 asv continuous -E virtualenv -f 1.1 master HEAD -b timeseries
[...]
before     after       ratio
  [238499ab] [7034c8ef]
+    2.38ms     2.67ms      1.12  timeseries.ResampleSeries.time_resample('period', '5min', 'ohlc')
-    4.38μs     3.94μs      0.90  timeseries.DatetimeIndex.time_get('dst')
-  498.15ms   440.07ms      0.88  timeseries.ToDatetimeFormat.time_exact
-    7.55ms     6.48ms      0.86  timeseries.Factorize.time_factorize('Asia/Tokyo')
taskset 4 asv continuous -E virtualenv -f 1.1 master HEAD -b timeseries
[...]
before     after       ratio
  [238499ab] [7034c8ef]
+    2.05ms     2.31ms      1.13  timeseries.ResampleSeries.time_resample('datetime', '1D', 'ohlc')
-    1.86ms     1.68ms      0.91  timeseries.DatetimeIndex.time_add_timedelta('repeated')
taskset 4 asv continuous -E virtualenv -f 1.1 master HEAD -b timeseries
[...]
    before     after       ratio
  [238499ab] [7034c8ef]
+    2.40ms     2.66ms      1.11  timeseries.ResampleSeries.time_resample('datetime', '5min', 'ohlc')
-   17.84μs    16.14μs      0.90  timeseries.AsOf.time_asof_single('Series')
-  137.94ms   123.97ms      0.90  timeseries.DatetimeIndex.time_to_pydatetime('tz_aware')
-   11.96ms    10.71ms      0.90  timeseries.Iteration.time_iter_preexit(<function period_range at 0x7f37a14670c8>)

@jreback
Copy link
Contributor

jreback commented Jan 31, 2018

pls run on a lot more benchmarks, this change affects practically everything. I don't expect anything but, need to check.

@jbrockmendel
Copy link
Member Author

[...]
    before     after       ratio
  [238499ab] [7034c8ef]
+    1.47ms     2.32ms      1.58  sparse.ArithmeticBlock.time_intersect(nan)
+    1.11ms     1.49ms      1.34  inference.NumericInferOps.time_subtract(<type 'numpy.int64'>)
+    2.10ms     2.77ms      1.32  gil.ParallelRolling.time_rolling('rolling_max')
+    9.46μs    12.00μs      1.27  multiindex_object.GetLoc.time_med_get_loc
+    6.33ms     7.94ms      1.25  gil.ParallelReadCSV.time_read_csv('object')
+   52.96ms    65.09ms      1.23  gil.ParallelGroupbyMethods.time_parallel(4, 'max')
+   17.04μs    20.85μs      1.22  offset.OffestDatetimeArithmetic.time_subtract_10(<BusinessYearEnd: month=12>)
+    8.98μs    10.98μs      1.22  offset.OffestDatetimeArithmetic.time_apply(<BusinessMonthBegin>)
+   24.36ms    29.60ms      1.21  gil.ParallelGroupbyMethods.time_parallel(2, 'max')
+    2.02ms     2.45ms      1.21  stat_ops.Correlation.time_corr('pearson')
+   22.94μs    27.18μs      1.18  offset.OffestDatetimeArithmetic.time_subtract_10(<Day>)
+  808.74μs   952.88μs      1.18  inference.NumericInferOps.time_subtract(<type 'numpy.int32'>)
+    9.21μs    10.81μs      1.17  offset.OffestDatetimeArithmetic.time_apply(<BusinessYearEnd: month=12>)
+   10.13μs    11.67μs      1.15  offset.OffestDatetimeArithmetic.time_apply(<SemiMonthBegin: day_of_month=15>)
+    1.76ms     2.03ms      1.15  categoricals.Concat.time_concat
+    6.81μs     7.73μs      1.14  timestamp.TimestampProperties.time_is_quarter_end(None, 'B')
+    2.06ms     2.33ms      1.13  timeseries.ResampleSeries.time_resample('datetime', '1D', 'ohlc')
+     2.41s      2.73s      1.13  groupby.FirstLast.time_groupby_last('object')
+   11.65μs    13.06μs      1.12  offset.OffestDatetimeArithmetic.time_apply_np_dt64(<BusinessDay>)
+   17.12μs    19.16μs      1.12  period.PeriodUnaryMethods.time_asfreq('min')
+   63.06ms    70.42ms      1.12  timeseries.ToDatetimeISO8601.time_iso8601_tz_spaceformat
+   10.90μs    12.17μs      1.12  offset.OffestDatetimeArithmetic.time_apply_np_dt64(<DateOffset: kwds={'months': 2, 'days': 2}>)
+   69.86ms    77.84ms      1.11  gil.ParallelGroupbyMethods.time_parallel(4, 'var')
+    5.59ms     6.22ms      1.11  timeseries.Factorize.time_factorize('Asia/Tokyo')
+   15.32μs    17.02μs      1.11  offset.OffestDatetimeArithmetic.time_add_10(<SemiMonthEnd: day_of_month=15>)
+  837.40ns   928.48ns      1.11  period.PeriodProperties.time_property('M', 'quarter')
+   16.62μs    18.42μs      1.11  offset.OffestDatetimeArithmetic.time_subtract_10(<BusinessDay>)
+   11.01μs    12.20μs      1.11  offset.OffestDatetimeArithmetic.time_apply_np_dt64(<BusinessMonthEnd>)
+   58.41ms    64.71ms      1.11  rolling.Methods.time_rolling('Series', 1000, 'float', 'median')
+    9.21ms    10.20ms      1.11  stat_ops.FrameOps.time_op('skew', 'int', 0, False)
+  125.17ms   138.37ms      1.11  gil.ParallelGroupbyMethods.time_parallel(8, 'count')
+    5.92ms     6.53ms      1.10  groupby.Float32.time_sum
+  763.14μs   840.71μs      1.10  inference.NumericInferOps.time_subtract(<type 'numpy.uint32'>)
+   21.81ms    24.00ms      1.10  offset.OffsetDatetimeIndexArithmetic.time_add_offset(<CustomBusinessDay>)
+    4.74ms     5.22ms      1.10  stat_ops.SeriesMultiIndexOps.time_op(0, 'sem')
-    3.48μs     3.15μs      0.91  timedelta.TimedeltaConstructor.time_from_int
-   10.60μs     9.59μs      0.91  offset.OffestDatetimeArithmetic.time_apply(<QuarterBegin: startingMonth=3>)
-  119.58ms   107.69ms      0.90  gil.ParallelGroupbyMethods.time_parallel(8, 'min')
-   12.22μs    10.97μs      0.90  offset.OffestDatetimeArithmetic.time_apply_np_dt64(<YearBegin: month=1>)
-   30.60ms    27.43ms      0.90  gil.ParallelFactorize.time_parallel(4)
-   11.00μs     9.83μs      0.89  timedelta.TimedeltaConstructor.time_from_components
-   20.33μs    18.12μs      0.89  offset.OffestDatetimeArithmetic.time_add(<CustomBusinessDay>)
-  144.10ms   128.25ms      0.89  timeseries.DatetimeIndex.time_to_pydatetime('tz_aware')
-   29.40ms    26.11ms      0.89  gil.ParallelGroupbyMethods.time_parallel(2, 'sum')
-  325.11μs   288.09μs      0.89  inference.NumericInferOps.time_multiply(<type 'numpy.uint8'>)
-   17.48μs    15.46μs      0.88  offset.OffestDatetimeArithmetic.time_subtract_10(<BusinessQuarterEnd: startingMonth=3>)
-    6.91μs     6.09μs      0.88  algorithms.Duplicated.time_duplicated_string(False)
-  103.69ms    91.25ms      0.88  timedelta.ToTimedeltaErrors.time_convert('coerce')
-   16.28μs    14.30μs      0.88  offset.OffestDatetimeArithmetic.time_subtract(<YearBegin: month=1>)
-   12.94μs    11.32μs      0.87  timestamp.TimestampProperties.time_is_month_end(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-    3.12ms     2.72ms      0.87  algorithms.Duplicated.time_duplicated_int(False)
-  376.11μs   327.12μs      0.87  inference.NumericInferOps.time_multiply(<type 'numpy.uint16'>)
-  322.78μs   280.49μs      0.87  inference.NumericInferOps.time_multiply(<type 'numpy.int8'>)
-   12.12ms    10.52ms      0.87  multiindex_object.GetLoc.time_small_get_loc_warm
-  601.16ns   516.43ns      0.86  timestamp.TimestampProperties.time_weekday_name(None, 'B')
-   15.89μs    13.65μs      0.86  offset.OffestDatetimeArithmetic.time_add_10(<QuarterBegin: startingMonth=3>)
-  150.32ms   128.97ms      0.86  gil.ParallelGroupbyMethods.time_parallel(8, 'var')
-   11.73ms    10.03ms      0.85  multiindex_object.GetLoc.time_med_get_loc_warm
-   18.55ms    15.82ms      0.85  period.DataFramePeriodColumn.time_setitem_period_column
-  754.02μs   642.96μs      0.85  inference.NumericInferOps.time_add(<type 'numpy.uint32'>)
-   21.19μs    18.02μs      0.85  offset.OffestDatetimeArithmetic.time_add(<Day>)
-    7.92μs     6.73μs      0.85  timestamp.TimestampProperties.time_is_year_start(None, 'B')
-  123.69ms   101.66ms      0.82  gil.ParallelGroupbyMethods.time_parallel(8, 'max')
-   12.10μs     9.86μs      0.81  multiindex_object.GetLoc.time_string_get_loc
-   23.77μs    19.35μs      0.81  period.PeriodUnaryMethods.time_now('M')
-   77.71ms    63.17ms      0.81  groupby.MultiColumn.time_col_select_lambda_sum
-   85.82ms    69.71ms      0.81  join_merge.ConcatDataFrames.time_c_ordered(1, False)
taskset 4 asv continuous -E virtualenv -f 1.1 master HEAD
[...]
    before     after       ratio
  [238499ab] [7034c8ef]
+    2.05ms     3.27ms      1.59  gil.ParallelRolling.time_rolling('rolling_max')
+    1.16μs     1.46μs      1.25  timedelta.TimedeltaConstructor.time_from_missing
+    5.65ms     7.03ms      1.24  gil.ParallelReadCSV.time_read_csv('object')
+   65.03ms    79.15ms      1.22  gil.ParallelGroupbyMethods.time_parallel(4, 'var')
+    2.07ms     2.52ms      1.22  gil.ParallelRolling.time_rolling('rolling_min')
+    6.33μs     7.66μs      1.21  timedelta.TimedeltaConstructor.time_from_string
+    3.26μs     3.90μs      1.19  timedelta.TimedeltaConstructor.time_from_unit
+    9.94μs    11.80μs      1.19  multiindex_object.GetLoc.time_string_get_loc
+  288.30ns   336.67ns      1.17  timestamp.TimestampProperties.time_dayofweek(None, 'B')
+  720.24μs   839.99μs      1.17  inference.NumericInferOps.time_add(<type 'numpy.uint32'>)
+    5.20μs     6.06μs      1.16  timestamp.TimestampOps.time_replace_None('US/Eastern')
+  289.00ns   332.24ns      1.15  timestamp.TimestampProperties.time_is_quarter_start(None, None)
+  888.09μs     1.01ms      1.14  inference.NumericInferOps.time_multiply(<type 'numpy.int32'>)
+   15.10μs    17.12μs      1.13  offset.OffestDatetimeArithmetic.time_add_10(<BusinessDay>)
+  771.36μs   872.03μs      1.13  inference.ToNumericDowncast.time_downcast('datetime64', None)
+   11.50μs    12.98μs      1.13  timestamp.TimestampProperties.time_is_month_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
+   56.30ms    63.53ms      1.13  gil.ParallelGroupbyMethods.time_parallel(4, 'sum')
+   18.32μs    20.61μs      1.12  offset.OffestDatetimeArithmetic.time_subtract_10(<SemiMonthBegin: day_of_month=15>)
+    6.82μs     7.65μs      1.12  timestamp.TimestampProperties.time_is_quarter_end(None, 'B')
+   30.52μs    34.03μs      1.12  offset.OffestDatetimeArithmetic.time_subtract_10(<CustomBusinessDay>)
+    6.20ms     6.90ms      1.11  timeseries.Factorize.time_factorize(None)
+  985.93μs     1.10ms      1.11  inference.NumericInferOps.time_multiply(<type 'numpy.uint64'>)
+   11.13μs    12.37μs      1.11  offset.OffestDatetimeArithmetic.time_apply_np_dt64(<BusinessQuarterEnd: startingMonth=3>)
+  273.67μs   303.79μs      1.11  inference.NumericInferOps.time_multiply(<type 'numpy.int8'>)
+   15.61μs    17.33μs      1.11  offset.OffestDatetimeArithmetic.time_subtract_10(<YearEnd: month=12>)
+  771.44μs   854.41μs      1.11  inference.NumericInferOps.time_add(<type 'numpy.float32'>)
+  152.52μs   168.34μs      1.10  join_merge.Concat.time_concat_empty_right(1)
+    5.26ms     5.79ms      1.10  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'sem')
+   11.27μs    12.41μs      1.10  timestamp.TimestampProperties.time_is_year_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
+  134.58ms   148.19ms      1.10  gil.ParallelGroupbyMethods.time_parallel(8, 'var')
-   21.32μs    19.37μs      0.91  offset.OffestDatetimeArithmetic.time_add_10(<Day>)
-    1.21ms     1.09ms      0.91  inference.NumericInferOps.time_multiply(<type 'numpy.int64'>)
-   15.67μs    14.19μs      0.91  offset.OffestDatetimeArithmetic.time_add_10(<BusinessQuarterEnd: startingMonth=3>)
-   29.13ms    26.36ms      0.90  join_merge.Concat.time_concat_small_frames(0)
-    4.09ms     3.70ms      0.90  timeseries.ToDatetimeISO8601.time_iso8601
-   17.84μs    16.13μs      0.90  timeseries.AsOf.time_asof_single('Series')
-   18.21μs    16.35μs      0.90  offset.OffestDatetimeArithmetic.time_subtract_10(<BusinessYearEnd: month=12>)
-   65.01ms    58.22ms      0.90  gil.ParallelGroupbyMethods.time_parallel(4, 'max')
-   65.68ms    58.76ms      0.89  gil.ParallelGroupbyMethods.time_parallel(4, 'mean')
-    8.40ms     7.49ms      0.89  frame_methods.MaskBool.time_frame_mask_floats
-   31.52ms    28.11ms      0.89  gil.ParallelFactorize.time_parallel(4)
-  310.70μs   274.38μs      0.88  inference.NumericInferOps.time_add(<type 'numpy.int8'>)
-   29.68ms    26.20ms      0.88  gil.ParallelGroupbyMethods.time_parallel(2, 'last')
-  128.31ms   112.87ms      0.88  gil.ParallelGroupbyMethods.time_parallel(8, 'max')
-    1.01ms   886.81μs      0.88  inference.NumericInferOps.time_subtract(<type 'numpy.int32'>)
-   52.00ms    45.52ms      0.88  plotting.TimeseriesPlotting.time_plot_regular_compat
-   14.88μs    13.03μs      0.88  offset.OffestDatetimeArithmetic.time_add_10(<MonthBegin>)
-   18.81μs    16.36μs      0.87  offset.OffestDatetimeArithmetic.time_apply_np_dt64(<CustomBusinessDay>)
-    3.89ms     3.37ms      0.87  algorithms.Duplicated.time_duplicated_float('first')
-    1.38ms     1.19ms      0.86  stat_ops.FrameOps.time_op('prod', 'int', 0, False)
-   12.26μs    10.60μs      0.86  offset.OffestDatetimeArithmetic.time_apply_np_dt64(<BusinessYearEnd: month=12>)
-    3.09ms     2.62ms      0.85  inference.ToNumeric.time_from_numeric_str('coerce')
-    3.90ms     3.31ms      0.85  algorithms.Duplicated.time_duplicated_float('last')
-   17.10μs    14.35μs      0.84  offset.OffestDatetimeArithmetic.time_subtract(<BusinessMonthEnd>)
-   28.26ms    23.44ms      0.83  gil.ParallelGroupbyMethods.time_parallel(2, 'max')
-   29.33ms    23.86ms      0.81  gil.ParallelGroupbyMethods.time_parallel(2, 'prod')
-   85.89ms    69.82ms      0.81  join_merge.ConcatDataFrames.time_c_ordered(1, False)
-    2.79ms     2.14ms      0.76  stat_ops.Correlation.time_corr('pearson')
-   66.96ms    49.72ms      0.74  gil.ParallelGroupbyMethods.time_parallel(4, 'prod')
-  163.15ms   108.13ms      0.66  gil.ParallelGroupbyMethods.time_loop(8, 'sum')

@jreback jreback added this to the 0.23.0 milestone Jan 31, 2018
@jreback jreback merged commit 01cbc64 into pandas-dev:master Jan 31, 2018
@jreback
Copy link
Contributor

jreback commented Jan 31, 2018

thanks!

@jbrockmendel jbrockmendel deleted the unhelper branch February 11, 2018 21:58
harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants