Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLN] Make more of numpy_helper unnecessary #22344

Merged
merged 1 commit into from
Aug 16, 2018

Conversation

jbrockmendel
Copy link
Member

No description provided.

@jbrockmendel jbrockmendel changed the title Make more of numpy_helper unnecessary [CLN] Make more of numpy_helper unnecessary Aug 14, 2018
@gfyoung gfyoung added Internals Related to non-user accessible pandas implementation Clean labels Aug 14, 2018
@gfyoung gfyoung requested a review from jreback August 14, 2018 18:59
@codecov
Copy link

codecov bot commented Aug 15, 2018

Codecov Report

Merging #22344 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22344   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         169      169           
  Lines       50709    50709           
=======================================
  Hits        46679    46679           
  Misses       4030     4030
Flag Coverage Δ
#multiple 90.46% <ø> (ø) ⬆️
#single 42.25% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf70d11...29972ea. Read the comment docs.

@jorisvandenbossche
Copy link
Member

Do you know if there is any benefit in using get_value_1d? (what was the original reason for using it?)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls perf check this

@jbrockmendel
Copy link
Member Author

Do you know if there is any benefit in using get_value_1d? (what was the original reason for using it?)

This and #22329 were both spun off of a branch that was seeing if that could be made unnecessary.
For the most part, we can replace get_value_1d(arr, i) with arr[i] just fine. The one place in the tests where that doesn't work is a code both that goes through SparseArray.__getitem__, because that leads to a RecursionError. IIUC several other pandas objects used to subclass np.ndarray, so may have needed the same workaround to effectively call ndarray.__getitem__(arr, i).

After this and #22329 I'll be ready to make a PR to get rid of numpy_helper altogether, since get_value_1d and set_value_1d can be made unnecessary.

@jorisvandenbossche
Copy link
Member

OK. It seems to have been introduced to make sure Series getitem would return numpy scalars: 96375da. But I suppose that is now covered by the tests.

@jbrockmendel
Copy link
Member Author

pls perf check this

lib.infer_dtype with mixed-dtype should go through one of the affected paths:

In [3]: arr = np.array(range(10000), dtype='O')

In [4]: arr[100] = 'foo'

master:

In [10]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 21.5 µs per loop

In [11]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 25.6 µs per loop

In [12]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 19.9 µs per loop

PR:

In [7]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 20.1 µs per loop

In [8]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 24.3 µs per loop

In [9]: %timeit pd.lib.infer_dtype(arr)
10000 loops, best of 3: 19.9 µs per loop

@jorisvandenbossche
Copy link
Member

Timings look good.

@jreback jreback added this to the 0.24.0 milestone Aug 16, 2018
@jreback jreback merged commit f07a790 into pandas-dev:master Aug 16, 2018
@jreback
Copy link
Contributor

jreback commented Aug 16, 2018

thanks. can you do a PR with those as asv's? (if not already there)

@jbrockmendel jbrockmendel deleted the nohelp branch August 16, 2018 13:37
@jbrockmendel
Copy link
Member Author

can you do a PR with those as asv's?

Sure. I think for low-level cython stuff we may need an extra layer (in cython) to call these tiny functions in a loop.

@jorisvandenbossche
Copy link
Member

Sure. I think for low-level cython stuff we may need an extra layer (in cython) to call these tiny functions in a loop.

That should not be needed for infer_dtype (and also for the others I would try to check what the actual "higher level" usage is)

@jbrockmendel
Copy link
Member Author

That should not be needed for infer_dtype

You're right about infer_dtype, but I'm suggesting we directly benchmark the affected code.

@jorisvandenbossche
Copy link
Member

IMO we should always try to test higher level code / (semi) public APIs (to the extent possible of course). So if possible (if there is a reasonable higher level code, such as infer_dtypes already is) I would not directly benchmark the affected code. Otherwise that makes the benchmarks very internals-specific (making it more difficult to change internals, or to compare to alternative implementations)

@jbrockmendel
Copy link
Member Author

So if possible (if there is a reasonable higher level code, such as infer_dtypes already is) I would not directly benchmark the affected code.

My issue with this is that the code affected by this PR constitutes a tiny fraction of the runtime of infer_dtypes. asv results are mostly noise anyway; more focused benchmarks can chip away at that problem.

Otherwise that makes the benchmarks very internals-specific (making it more difficult to change internals, or to compare to alternative implementations)

Part of the motivation I have in mind comes from a change I was looking at to implement pure-cython implementations of np.zeros(n, dtype=np.float64). The benchmark I wrote for that (unfortunately now-deleted) branch looked something like:

def time_zeros(size=10000, loops=1000, version=0):
     if version == 0:
          for n in range(loops):
              out = np.zeros(size, dtype=np.float64)
    else:
        for n in range(loops):
            out = zeros1d(size)
    return out

With comments documenting the runtime measurements for each version. The idea being to measure+document the alternative implementations.

Moons08 pushed a commit to Moons08/pandas that referenced this pull request Aug 17, 2018
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Aug 18, 2018
commit b29dfc6
Author: Tom Augspurger <[email protected]>
Date:   Thu Aug 16 10:45:38 2018 -0500

    Support NDFrame.shift with EAs

    Uses take internally.

    Closes pandas-dev#22386

commit b5d81cf
Author: William Ayd <[email protected]>
Date:   Thu Aug 16 03:54:18 2018 -0700

    Bump pytest (pandas-dev#22320)

commit f07a790
Author: jbrockmendel <[email protected]>
Date:   Thu Aug 16 03:46:58 2018 -0700

    Make more of numpy_helper unnecessary (pandas-dev#22344)

commit 7b80d4d
Author: Graham Inggs <[email protected]>
Date:   Thu Aug 16 12:43:02 2018 +0200

    Drop redundant TestLocale (pandas-dev#22349)

commit 6bcfc46
Author: Matthew Roeschke <[email protected]>
Date:   Thu Aug 16 03:32:31 2018 -0700

    Fix failing dateutil test (pandas-dev#22354)

commit 86e8f23
Author: jbrockmendel <[email protected]>
Date:   Thu Aug 16 03:08:09 2018 -0700

    remove last cython: nprofile comments (pandas-dev#22371)

commit 70e6f7c
Author: Joris Van den Bossche <[email protected]>
Date:   Wed Aug 15 18:09:50 2018 +0200

    DOC: edit docstring example to prevent segfault (pandas-dev#21824) (pandas-dev#22368)
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Aug 18, 2018
commit c4b0b97
Author: Tom Augspurger <[email protected]>
Date:   Thu Aug 16 14:36:39 2018 -0500

    Slice based

commit c980035
Author: Tom Augspurger <[email protected]>
Date:   Thu Aug 16 14:20:21 2018 -0500

    Updated

commit b29dfc6
Author: Tom Augspurger <[email protected]>
Date:   Thu Aug 16 10:45:38 2018 -0500

    Support NDFrame.shift with EAs

    Uses take internally.

    Closes pandas-dev#22386

commit b5d81cf
Author: William Ayd <[email protected]>
Date:   Thu Aug 16 03:54:18 2018 -0700

    Bump pytest (pandas-dev#22320)

commit f07a790
Author: jbrockmendel <[email protected]>
Date:   Thu Aug 16 03:46:58 2018 -0700

    Make more of numpy_helper unnecessary (pandas-dev#22344)

commit 7b80d4d
Author: Graham Inggs <[email protected]>
Date:   Thu Aug 16 12:43:02 2018 +0200

    Drop redundant TestLocale (pandas-dev#22349)

commit 6bcfc46
Author: Matthew Roeschke <[email protected]>
Date:   Thu Aug 16 03:32:31 2018 -0700

    Fix failing dateutil test (pandas-dev#22354)

commit 86e8f23
Author: jbrockmendel <[email protected]>
Date:   Thu Aug 16 03:08:09 2018 -0700

    remove last cython: nprofile comments (pandas-dev#22371)

commit 70e6f7c
Author: Joris Van den Bossche <[email protected]>
Date:   Wed Aug 15 18:09:50 2018 +0200

    DOC: edit docstring example to prevent segfault (pandas-dev#21824) (pandas-dev#22368)
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants