CLN: de-duplicate index validation code #22329

jbrockmendel · 2018-08-14T01:43:07Z

There are currently 3 nearly-identical versions of this code. I'm pretty sure the strictest one is the most-correct, so that is made into the only one.

…eck_index

jorisvandenbossche

You mention you used the "strictest" one: what is exactly the difference between the current versions?

Can you also do a quick check of performance of a call that hits those functions?

jorisvandenbossche · 2018-08-14T09:42:00Z

pandas/_libs/util.pxd

@@ -44,23 +44,50 @@ ctypedef fused numeric:
    cnp.float64_t


-cdef inline object get_value_at(ndarray arr, object loc):
+cdef inline Py_ssize_t validate_indexer(ndarray arr, object loc) except? -1:


I don't think you need the question mark in except? ? (since -1 can never be actually returned from the function without it being an error)

the except is to allow for the IndexError

my comment is about the ?, not the except itself

jreback · 2018-08-14T10:18:08Z

yeah these are in criticial paths of perf. pls do a check.

codecov · 2018-08-14T10:42:40Z

Codecov Report

Merging #22329 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #22329   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         169      169           
  Lines       50709    50709           
=======================================
  Hits        46679    46679           
  Misses       4030     4030

Flag	Coverage Δ
#multiple	`90.46% <ø> (ø)`	⬆️
#single	`42.25% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5d81cf...49ac36b. Read the comment docs.

jbrockmendel · 2018-08-14T15:26:53Z

yeah these are in criticial paths of perf. pls do a check.

Running asv now.

I don't think you need the question mark in except? ? (since -1 can never be actually returned from the function without it being an error)

The ? is there purely out of habit. Is there a cost to having it there?

what is exactly the difference between the current versions?

Corner case handling. e.g. if len(arr) == 10 and i = -4 they will all increment i += len(arr) and be OK. But if i = -14 only the strict one will check that i + len(arr) is still negative.

jbrockmendel · 2018-08-14T18:32:46Z

asv results:

asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
+      9.97±0.5ms       16.1±0.3ms     1.61  frame_methods.Apply.time_apply_lambda_mean

asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
+        894±80μs      2.19±0.04ms     2.46  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

taskset 8 time asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
-        172±10ms        946±100μs     0.01  frame_ctor.FromRecords.time_frame_from_records_generator(None)

time taskset 4 asv continuous -f 1.1 -E virtualenv master HEAD -b frame
[...]
       before           after         ratio
     [ffae1587]       [7d2e4330]
-     2.20±0.02ms      1.01±0.01ms     0.46  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

jorisvandenbossche · 2018-08-14T19:06:59Z

The asv is clearly not really relevant. Can you do a quick check of the impacted functions with a direct call? Eg get_value_box is only used in Index.get_value and is eg hit when doing

In [4]: idx = pd.Index(['a', 'b', 'c', 'd'])

In [5]: s = pd.Series(range(len(idx)), index=idx)

In [6]: idx.get_value(s, 1)
Out[6]: 1

You can do a quick %timeit before/after (but with bigger index) to see if there is any significant change.

The ? is there purely out of habit. Is there a cost to having it there?

There is an extra check whether there is an error raised or not. I don't assume this is costly, but since we know that it will always be an error, I think it is more cleanly code-wise to reflect this in the except

jorisvandenbossche · 2018-08-14T19:07:24Z

But if i = -14 only the strict one will check that i + len(arr) is still negative.

OK, that sounds like the good change!

jbrockmendel · 2018-08-14T19:27:49Z

I’ll remove the question mark when possible. At the moment the power company has decided to do some surprise maintenance, so that might be a while... As to the asv, yah it’s not clearly relevant, but it’s low-level code that gets used a lot so I guessed that running all frame benchmarks would hit it. I’ll try your suggestion when the power comes back.

…

On Tue, Aug 14, 2018 at 12:07 PM Joris Van den Bossche < ***@***.***> wrote: But if i = -14 only the strict one will check that i + len(arr) is still negative. OK, that sounds like the good change! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#22329 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHtGeGJKyGmQX9uObIMNzyA-3pr8ob_5ks5uQyAEgaJpZM4V7qQO> .

jbrockmendel · 2018-08-14T21:11:25Z

Indistinguishable:

Master:

In [2]: idx = pd.Int64Index(range(10**6))

In [3]: ser = pd.Series(idx, index=idx)

In [4]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 26.6 µs per loop

In [5]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 26.7 µs per loop

PR:

In [2]: idx = pd.Int64Index(range(10**6))

In [3]: ser = pd.Series(idx, index=idx)

In [4]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 26.8 µs per loop

In [5]: %timeit idx.get_value(ser, 1000)
10000 loops, best of 3: 25.6 µs per loop

jorisvandenbossche · 2018-08-14T21:22:29Z

I don't think your example with Int64Index hits the function you modified (note my example used a string index, so it does integer position fallback indexing), but you can check that by adding a pdb trace or print statement to be sure.

jbrockmendel · 2018-08-14T22:01:19Z

In [2]: idx = pd.Index([str(x) for x in range(10**6)])

In [3]: ser = pd.Series(range(len(idx)), index=idx)

In [4]: %timeit idx.get_value(ser, 1000)

Master:

The slowest run took 7246.54 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 37.9 µs per loop

The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 19.5 µs per loop

The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 18.8 µs per loop

The slowest run took 5.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 18.4 µs per loop

The slowest run took 4.27 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop

PR:

The slowest run took 9550.30 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 29.1 µs per loop

The slowest run took 5.77 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 19.1 µs per loop

The slowest run took 5.49 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 19.3 µs per loop

10000 loops, best of 3: 25.7 µs per loop

The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 18.2 µs per loop

jreback

can u add these as asvs?

jorisvandenbossche · 2018-08-14T22:46:21Z

@jreback I am not sure that is needed. It should already be covered by other indexing benchmarks that use get_value under the hood
(I mainly here wanted to check the very specific code path to be sure we are checking the correct thing, as a sanity check)

jorisvandenbossche · 2018-08-14T22:52:19Z

pandas/_libs/index.pyx


+    i = util.validate_indexer(arr, loc)


Since get_value_at (the util version below, which is called by the get_value_at in this file) now has the same validation, is it then not unnecessary to call the validation here as well?

I think you’re right, will update.

…eck_index

jbrockmendel · 2018-08-15T16:13:12Z

Following this and #22344 it tentatively looks like we'll be ready to get rid of numpy_helper (and chunks of util) altogether.

…eck_index

jbrockmendel · 2018-08-21T14:34:36Z

@jreback gentle ping. After this we can get rid of a bunch of old numpy_helper code.

jreback · 2018-08-22T10:38:03Z

thanks!

yeah lots of PRs!

jbrockmendel added 2 commits August 13, 2018 18:42

de-duplicate index validation code

1172bef

Merge branch 'master' of https://github.com/pandas-dev/pandas into ch…

7d2e433

…eck_index

jorisvandenbossche reviewed Aug 14, 2018

View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Clean labels Aug 14, 2018

remove question mark

92849f4

jorisvandenbossche changed the title ~~de-duplicate index validation code~~ CLN: de-duplicate index validation code Aug 14, 2018

jreback reviewed Aug 14, 2018

View reviewed changes

jorisvandenbossche reviewed Aug 14, 2018

View reviewed changes

jbrockmendel added 2 commits August 14, 2018 19:15

Merge branch 'master' of https://github.com/pandas-dev/pandas into ch…

cc4324c

…eck_index

Per suggestion, avoid duplicated validation

68e0a67

jorisvandenbossche approved these changes Aug 15, 2018

View reviewed changes

jbrockmendel mentioned this pull request Aug 15, 2018

[CLN] Make more of numpy_helper unnecessary #22344

Merged

Merge branch 'master' of https://github.com/pandas-dev/pandas into ch…

49ac36b

…eck_index

This was referenced Aug 19, 2018

Use fused types to avoid tempita #22411

Closed

POC Use fused types more, tempita less #22432

Closed

jreback added this to the 0.24.0 milestone Aug 22, 2018

jreback merged commit 9346e79 into pandas-dev:master Aug 22, 2018

jbrockmendel deleted the check_index branch August 22, 2018 13:47

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

CLN: de-duplicate index validation code (pandas-dev#22329)

ea4431c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: de-duplicate index validation code #22329

CLN: de-duplicate index validation code #22329

jbrockmendel commented Aug 14, 2018

jorisvandenbossche left a comment

jorisvandenbossche Aug 14, 2018

jreback Aug 14, 2018

jorisvandenbossche Aug 14, 2018

jreback commented Aug 14, 2018

codecov bot commented Aug 14, 2018 •

edited

Loading

jbrockmendel commented Aug 14, 2018

jbrockmendel commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jbrockmendel commented Aug 14, 2018 via email

jbrockmendel commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jbrockmendel commented Aug 14, 2018

jreback left a comment

jorisvandenbossche commented Aug 14, 2018

jorisvandenbossche Aug 14, 2018

jbrockmendel Aug 15, 2018

jbrockmendel commented Aug 15, 2018

jbrockmendel commented Aug 21, 2018

jreback commented Aug 22, 2018

CLN: de-duplicate index validation code #22329

CLN: de-duplicate index validation code #22329

Conversation

jbrockmendel commented Aug 14, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Aug 14, 2018

Choose a reason for hiding this comment

jreback Aug 14, 2018

Choose a reason for hiding this comment

jorisvandenbossche Aug 14, 2018

Choose a reason for hiding this comment

jreback commented Aug 14, 2018

codecov bot commented Aug 14, 2018 • edited Loading

Codecov Report

jbrockmendel commented Aug 14, 2018

jbrockmendel commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jbrockmendel commented Aug 14, 2018 via email

jbrockmendel commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jbrockmendel commented Aug 14, 2018

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 14, 2018

jorisvandenbossche Aug 14, 2018

Choose a reason for hiding this comment

jbrockmendel Aug 15, 2018

Choose a reason for hiding this comment

jbrockmendel commented Aug 15, 2018

jbrockmendel commented Aug 21, 2018

jreback commented Aug 22, 2018

codecov bot commented Aug 14, 2018 •

edited

Loading