Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: de-duplicate index validation code #22329

Merged
merged 6 commits into from
Aug 22, 2018

Conversation

jbrockmendel
Copy link
Member

There are currently 3 nearly-identical versions of this code. I'm pretty sure the strictest one is the most-correct, so that is made into the only one.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention you used the "strictest" one: what is exactly the difference between the current versions?

Can you also do a quick check of performance of a call that hits those functions?

@@ -44,23 +44,50 @@ ctypedef fused numeric:
cnp.float64_t


cdef inline object get_value_at(ndarray arr, object loc):
cdef inline Py_ssize_t validate_indexer(ndarray arr, object loc) except? -1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the question mark in except? ? (since -1 can never be actually returned from the function without it being an error)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the except is to allow for the IndexError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment is about the ?, not the except itself

@jreback
Copy link
Contributor

jreback commented Aug 14, 2018

yeah these are in criticial paths of perf. pls do a check.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Clean labels Aug 14, 2018
@codecov
Copy link

codecov bot commented Aug 14, 2018

Codecov Report

Merging #22329 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22329   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         169      169           
  Lines       50709    50709           
=======================================
  Hits        46679    46679           
  Misses       4030     4030
Flag Coverage Δ
#multiple 90.46% <ø> (ø) ⬆️
#single 42.25% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5d81cf...49ac36b. Read the comment docs.

@jbrockmendel
Copy link
Member Author

yeah these are in criticial paths of perf. pls do a check.

Running asv now.

I don't think you need the question mark in except? ? (since -1 can never be actually returned from the function without it being an error)

The ? is there purely out of habit. Is there a cost to having it there?

what is exactly the difference between the current versions?

Corner case handling. e.g. if len(arr) == 10 and i = -4 they will all increment i += len(arr) and be OK. But if i = -14 only the strict one will check that i + len(arr) is still negative.

@jbrockmendel
Copy link
Member Author

asv results:

asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
+      9.97±0.5ms       16.1±0.3ms     1.61  frame_methods.Apply.time_apply_lambda_mean

asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
+        894±80μs      2.19±0.04ms     2.46  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

taskset 8 time asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
-        172±10ms        946±100μs     0.01  frame_ctor.FromRecords.time_frame_from_records_generator(None)

time taskset 4 asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
-     2.20±0.02ms      1.01±0.01ms     0.46  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

@jorisvandenbossche
Copy link
Member

The asv is clearly not really relevant. Can you do a quick check of the impacted functions with a direct call? Eg get_value_box is only used in Index.get_value and is eg hit when doing

In [4]: idx = pd.Index(['a', 'b', 'c', 'd'])

In [5]: s = pd.Series(range(len(idx)), index=idx)

In [6]: idx.get_value(s, 1)
Out[6]: 1

You can do a quick %timeit before/after (but with bigger index) to see if there is any significant change.

The ? is there purely out of habit. Is there a cost to having it there?

There is an extra check whether there is an error raised or not. I don't assume this is costly, but since we know that it will always be an error, I think it is more cleanly code-wise to reflect this in the except

@jorisvandenbossche
Copy link
Member

But if i = -14 only the strict one will check that i + len(arr) is still negative.

OK, that sounds like the good change!

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Aug 14, 2018 via email

@jbrockmendel
Copy link
Member Author

Indistinguishable:

Master:

In [2]: idx = pd.Int64Index(range(10**6))

In [3]: ser = pd.Series(idx, index=idx)

In [4]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 26.6 µs per loop

In [5]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 26.7 µs per loop

PR:

In [2]: idx = pd.Int64Index(range(10**6))

In [3]: ser = pd.Series(idx, index=idx)

In [4]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 26.8 µs per loop

In [5]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 25.6 µs per loop

@jorisvandenbossche
Copy link
Member

I don't think your example with Int64Index hits the function you modified (note my example used a string index, so it does integer position fallback indexing), but you can check that by adding a pdb trace or print statement to be sure.

@jorisvandenbossche jorisvandenbossche changed the title de-duplicate index validation code CLN: de-duplicate index validation code Aug 14, 2018
@jbrockmendel
Copy link
Member Author

In [2]: idx = pd.Index([str(x) for x in range(10**6)])

In [3]: ser = pd.Series(range(len(idx)), index=idx)

In [4]: %timeit idx.get_value(ser, 1000)

Master:

The slowest run took 7246.54 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 37.9 µs per loop

The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 19.5 µs per loop

The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 18.8 µs per loop

The slowest run took 5.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 18.4 µs per loop

The slowest run took 4.27 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop

PR:

The slowest run took 9550.30 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 29.1 µs per loop

The slowest run took 5.77 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 19.1 µs per loop

The slowest run took 5.49 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 19.3 µs per loop

10000 loops, best of 3: 25.7 µs per loop

The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 18.2 µs per loop

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add these as asvs?

@jorisvandenbossche
Copy link
Member

@jreback I am not sure that is needed. It should already be covered by other indexing benchmarks that use get_value under the hood
(I mainly here wanted to check the very specific code path to be sure we are checking the correct thing, as a sanity check)


i = util.validate_indexer(arr, loc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since get_value_at (the util version below, which is called by the get_value_at in this file) now has the same validation, is it then not unnecessary to call the validation here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you’re right, will update.

@jbrockmendel
Copy link
Member Author

Following this and #22344 it tentatively looks like we'll be ready to get rid of numpy_helper (and chunks of util) altogether.

@jbrockmendel
Copy link
Member Author

@jreback gentle ping. After this we can get rid of a bunch of old numpy_helper code.

@jreback jreback added this to the 0.24.0 milestone Aug 22, 2018
@jreback jreback merged commit 9346e79 into pandas-dev:master Aug 22, 2018
@jreback
Copy link
Contributor

jreback commented Aug 22, 2018

thanks!

yeah lots of PRs!

@jbrockmendel jbrockmendel deleted the check_index branch August 22, 2018 13:47
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants