[CLN] Make more of numpy_helper unnecessary #22344

jbrockmendel · 2018-08-14T15:26:50Z

No description provided.

codecov · 2018-08-15T01:18:29Z

Codecov Report

Merging #22344 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #22344   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         169      169           
  Lines       50709    50709           
=======================================
  Hits        46679    46679           
  Misses       4030     4030

Flag	Coverage Δ
#multiple	`90.46% <ø> (ø)`	⬆️
#single	`42.25% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf70d11...29972ea. Read the comment docs.

jorisvandenbossche · 2018-08-15T21:16:09Z

Do you know if there is any benefit in using get_value_1d? (what was the original reason for using it?)

jreback

pls perf check this

jbrockmendel · 2018-08-15T21:25:19Z

Do you know if there is any benefit in using get_value_1d? (what was the original reason for using it?)

This and #22329 were both spun off of a branch that was seeing if that could be made unnecessary.
For the most part, we can replace get_value_1d(arr, i) with arr[i] just fine. The one place in the tests where that doesn't work is a code both that goes through SparseArray.__getitem__, because that leads to a RecursionError. IIUC several other pandas objects used to subclass np.ndarray, so may have needed the same workaround to effectively call ndarray.__getitem__(arr, i).

After this and #22329 I'll be ready to make a PR to get rid of numpy_helper altogether, since get_value_1d and set_value_1d can be made unnecessary.

jorisvandenbossche · 2018-08-15T21:32:21Z

OK. It seems to have been introduced to make sure Series getitem would return numpy scalars: 96375da. But I suppose that is now covered by the tests.

jbrockmendel · 2018-08-16T03:11:25Z

pls perf check this

lib.infer_dtype with mixed-dtype should go through one of the affected paths:

In [3]: arr = np.array(range(10000), dtype='O')

In [4]: arr[100] = 'foo'

master:

In [10]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 21.5 µs per loop

In [11]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 25.6 µs per loop

In [12]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 19.9 µs per loop

PR:

In [7]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 20.1 µs per loop

In [8]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 24.3 µs per loop

In [9]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 19.9 µs per loop

jorisvandenbossche · 2018-08-16T07:20:45Z

Timings look good.

jreback · 2018-08-16T10:47:32Z

thanks. can you do a PR with those as asv's? (if not already there)

jbrockmendel · 2018-08-16T13:38:40Z

can you do a PR with those as asv's?

Sure. I think for low-level cython stuff we may need an extra layer (in cython) to call these tiny functions in a loop.

jorisvandenbossche · 2018-08-16T13:40:52Z

Sure. I think for low-level cython stuff we may need an extra layer (in cython) to call these tiny functions in a loop.

That should not be needed for infer_dtype (and also for the others I would try to check what the actual "higher level" usage is)

jbrockmendel · 2018-08-16T13:56:23Z

That should not be needed for infer_dtype

You're right about infer_dtype, but I'm suggesting we directly benchmark the affected code.

jorisvandenbossche · 2018-08-16T14:01:10Z

IMO we should always try to test higher level code / (semi) public APIs (to the extent possible of course). So if possible (if there is a reasonable higher level code, such as infer_dtypes already is) I would not directly benchmark the affected code. Otherwise that makes the benchmarks very internals-specific (making it more difficult to change internals, or to compare to alternative implementations)

jbrockmendel · 2018-08-16T16:14:43Z

So if possible (if there is a reasonable higher level code, such as infer_dtypes already is) I would not directly benchmark the affected code.

My issue with this is that the code affected by this PR constitutes a tiny fraction of the runtime of infer_dtypes. asv results are mostly noise anyway; more focused benchmarks can chip away at that problem.

Otherwise that makes the benchmarks very internals-specific (making it more difficult to change internals, or to compare to alternative implementations)

Part of the motivation I have in mind comes from a change I was looking at to implement pure-cython implementations of np.zeros(n, dtype=np.float64). The benchmark I wrote for that (unfortunately now-deleted) branch looked something like:

def time_zeros(size=10000, loops=1000, version=0):
     if version == 0:
          for n in range(loops):
              out = np.zeros(size, dtype=np.float64)
    else:
        for n in range(loops):
            out = zeros1d(size)
    return out

With comments documenting the runtime measurements for each version. The idea being to measure+document the alternative implementations.

commit b29dfc6 Author: Tom Augspurger <[email protected]> Date: Thu Aug 16 10:45:38 2018 -0500 Support NDFrame.shift with EAs Uses take internally. Closes pandas-dev#22386 commit b5d81cf Author: William Ayd <[email protected]> Date: Thu Aug 16 03:54:18 2018 -0700 Bump pytest (pandas-dev#22320) commit f07a790 Author: jbrockmendel <[email protected]> Date: Thu Aug 16 03:46:58 2018 -0700 Make more of numpy_helper unnecessary (pandas-dev#22344) commit 7b80d4d Author: Graham Inggs <[email protected]> Date: Thu Aug 16 12:43:02 2018 +0200 Drop redundant TestLocale (pandas-dev#22349) commit 6bcfc46 Author: Matthew Roeschke <[email protected]> Date: Thu Aug 16 03:32:31 2018 -0700 Fix failing dateutil test (pandas-dev#22354) commit 86e8f23 Author: jbrockmendel <[email protected]> Date: Thu Aug 16 03:08:09 2018 -0700 remove last cython: nprofile comments (pandas-dev#22371) commit 70e6f7c Author: Joris Van den Bossche <[email protected]> Date: Wed Aug 15 18:09:50 2018 +0200 DOC: edit docstring example to prevent segfault (pandas-dev#21824) (pandas-dev#22368)

commit c4b0b97 Author: Tom Augspurger <[email protected]> Date: Thu Aug 16 14:36:39 2018 -0500 Slice based commit c980035 Author: Tom Augspurger <[email protected]> Date: Thu Aug 16 14:20:21 2018 -0500 Updated commit b29dfc6 Author: Tom Augspurger <[email protected]> Date: Thu Aug 16 10:45:38 2018 -0500 Support NDFrame.shift with EAs Uses take internally. Closes pandas-dev#22386 commit b5d81cf Author: William Ayd <[email protected]> Date: Thu Aug 16 03:54:18 2018 -0700 Bump pytest (pandas-dev#22320) commit f07a790 Author: jbrockmendel <[email protected]> Date: Thu Aug 16 03:46:58 2018 -0700 Make more of numpy_helper unnecessary (pandas-dev#22344) commit 7b80d4d Author: Graham Inggs <[email protected]> Date: Thu Aug 16 12:43:02 2018 +0200 Drop redundant TestLocale (pandas-dev#22349) commit 6bcfc46 Author: Matthew Roeschke <[email protected]> Date: Thu Aug 16 03:32:31 2018 -0700 Fix failing dateutil test (pandas-dev#22354) commit 86e8f23 Author: jbrockmendel <[email protected]> Date: Thu Aug 16 03:08:09 2018 -0700 remove last cython: nprofile comments (pandas-dev#22371) commit 70e6f7c Author: Joris Van den Bossche <[email protected]> Date: Wed Aug 15 18:09:50 2018 +0200 DOC: edit docstring example to prevent segfault (pandas-dev#21824) (pandas-dev#22368)

Make more of numpy_helper unnecessary

29972ea

jbrockmendel changed the title ~~Make more of numpy_helper unnecessary~~ [CLN] Make more of numpy_helper unnecessary Aug 14, 2018

gfyoung added Internals Related to non-user accessible pandas implementation Clean labels Aug 14, 2018

gfyoung requested a review from jreback August 14, 2018 18:59

jbrockmendel mentioned this pull request Aug 15, 2018

CLN: de-duplicate index validation code #22329

Merged

jreback requested changes Aug 15, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Aug 16, 2018

jreback approved these changes Aug 16, 2018

View reviewed changes

jreback merged commit f07a790 into pandas-dev:master Aug 16, 2018

jbrockmendel deleted the nohelp branch August 16, 2018 13:37

Moons08 pushed a commit to Moons08/pandas that referenced this pull request Aug 17, 2018

Make more of numpy_helper unnecessary (pandas-dev#22344)

dfd1264

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Make more of numpy_helper unnecessary (pandas-dev#22344)

f333f0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CLN] Make more of numpy_helper unnecessary #22344

[CLN] Make more of numpy_helper unnecessary #22344

jbrockmendel commented Aug 14, 2018

codecov bot commented Aug 15, 2018 •

edited

Loading

jorisvandenbossche commented Aug 15, 2018

jreback left a comment

jbrockmendel commented Aug 15, 2018

jorisvandenbossche commented Aug 15, 2018

jbrockmendel commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

jreback commented Aug 16, 2018

jbrockmendel commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

jbrockmendel commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

jbrockmendel commented Aug 16, 2018

[CLN] Make more of numpy_helper unnecessary #22344

[CLN] Make more of numpy_helper unnecessary #22344

Conversation

jbrockmendel commented Aug 14, 2018

codecov bot commented Aug 15, 2018 • edited Loading

Codecov Report

jorisvandenbossche commented Aug 15, 2018

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Aug 15, 2018

jorisvandenbossche commented Aug 15, 2018

jbrockmendel commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

jreback commented Aug 16, 2018

jbrockmendel commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

jbrockmendel commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

jbrockmendel commented Aug 16, 2018

codecov bot commented Aug 15, 2018 •

edited

Loading