CLN: handle EAs and fast path (no bounds checking) in safe_sort #25696

jorisvandenbossche · 2019-03-12T20:33:59Z

This is a possible alternative solution to what we have been discussing in #25592.

This moves the logic into safe_sort, with:

adding a check_outofbounds keyword to disable extra checks (otherwise the performance benefit of take_1d is lost)
fixing safe_sort to work for EAs

The check_outofbounds make it a bit more complicated, but without it, we can't benefit of the performance improvement for which take_1d was used originally in factorize.

(another solution is to simply decide that this performance improvement is not worth this extra code, and we simply use the current safe_sort (but fixed to work for EAs) in factorize)

Need to add some more tests for the combination of EAs with a custom na_sentinel (a case that is currently broken)

…inel

…inel2

pep8speaks · 2019-03-12T20:34:02Z

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-06 18:52:38 UTC

jreback · 2019-03-12T20:50:29Z

pandas/core/sorting.py

@@ -425,6 +427,10 @@ def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False):
    assume_unique : bool, default False
        When True, ``values`` are assumed to be unique, which can speed up
        the calculation. Ignored when ``labels`` is None.
+    check_outofbounds : bool, default True


this is not a bad name but not consistent across pandas, we use verify elsewhere.

can you update & add a versionadded tag

jreback · 2019-03-12T20:53:02Z

pandas/core/sorting.py

-    # deal with them here without performance loss using `mode='wrap'`.)
-    new_labels = reverse_indexer.take(labels, mode='wrap')
-    np.putmask(new_labels, mask, na_sentinel)
+    if na_sentinel == -1:


would rather just fix take_1d

codecov · 2019-03-12T21:32:05Z

Codecov Report

Merging #25696 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25696      +/-   ##
==========================================
+ Coverage   91.28%   91.28%   +<.01%     
==========================================
  Files         173      173              
  Lines       52967    52969       +2     
==========================================
+ Hits        48351    48353       +2     
  Misses       4616     4616

Flag	Coverage Δ
#multiple	`89.86% <100%> (ø)`	⬆️
#single	`41.74% <57.89%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/sorting.py	`98.36% <100%> (+0.06%)`	⬆️
pandas/core/algorithms.py	`94.72% <100%> (-0.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8fad16...c6203cb. Read the comment docs.

codecov · 2019-03-12T21:32:27Z

Codecov Report

Merging #25696 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25696      +/-   ##
==========================================
- Coverage   91.98%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52374    52375       +1     
==========================================
- Hits        48178    48175       -3     
- Misses       4196     4200       +4

Flag	Coverage Δ
#multiple	`90.53% <100%> (ø)`	⬆️
#single	`40.75% <55.55%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/sorting.py	`98.35% <100%> (+0.06%)`	⬆️
pandas/core/algorithms.py	`94.73% <100%> (-0.07%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7120725...151aa6a. Read the comment docs.

jreback · 2019-03-20T01:34:19Z

pandas/core/sorting.py

@@ -425,6 +427,10 @@ def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False):
    assume_unique : bool, default False
        When True, ``values`` are assumed to be unique, which can speed up
        the calculation. Ignored when ``labels`` is None.
+    check_outofbounds : bool, default True


can you update & add a versionadded tag

jreback · 2019-03-20T01:34:32Z

pandas/core/sorting.py

@@ -461,7 +467,8 @@ def sort_mixed(values):
        return np.concatenate([nums, np.asarray(strs, dtype=object)])

    sorter = None
-    if PY3 and lib.infer_dtype(values, skipna=False) == 'mixed-integer':
+    if (PY3 and not isinstance(values, ABCExtensionArray)


hah no more PY3 needed!

jreback · 2019-03-20T01:34:50Z

pandas/core/sorting.py

@@ -461,7 +467,8 @@ def sort_mixed(values):
        return np.concatenate([nums, np.asarray(strs, dtype=object)])

    sorter = None
-    if PY3 and lib.infer_dtype(values, skipna=False) == 'mixed-integer':
+    if (PY3 and not isinstance(values, ABCExtensionArray)


use is_extension_array

jorisvandenbossche · 2019-04-05T07:10:36Z

Coming back to this

would rather just fix take_1d

For take_1d itself, I am not sure it makes a lot of sense to support this. Currently, it is by definition that -1 values in the indexer passed to take signal missing values. The only use case for having this configurable would be this one, while it complicates the interface throughout pandas (eg also EAs would need to support this, as take_1d dispatch to them)

…inel2

…s) + add whatsnew

…inel2

jreback · 2019-04-20T16:59:47Z

can you merge master

…inel2

jorisvandenbossche · 2019-05-06T18:14:43Z

@jreback conflicts resolved, if you can take another look

jreback · 2019-05-07T01:01:55Z

thanks @jorisvandenbossche nice cleanup & tests.

jorisvandenbossche added 6 commits March 7, 2019 15:55

BUG: fix usage of na_sentinel with sort=True in factorize()

7356997

fix dtype

e1ab3a4

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

a9c880e

…inel

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

db30797

…inel

Attempt to include it in safe_sort

ba944eb

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

c6203cb

…inel2

jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 12, 2019

jorisvandenbossche added this to the 0.25.0 milestone Mar 12, 2019

jorisvandenbossche mentioned this pull request Mar 12, 2019

BUG: fix usage of na_sentinel with sort=True in factorize() #25592

Merged

jreback requested changes Mar 12, 2019

View reviewed changes

jreback requested changes Mar 20, 2019

View reviewed changes

jorisvandenbossche added 8 commits April 5, 2019 09:11

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

d70b447

…inel2

feedback Jeff

fdf330a

add tests for safe_sort

b08ea6d

additional test for other na_sentinel in case of out of bound indices

9de26fc

additional test for EA with custom na_sentinel

bcb8c7e

update factorize test for EAs with custom na_sentinel (which now work…

13f6706

…s) + add whatsnew

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

8db84e7

…inel2

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

d0cef9e

…inel2

jorisvandenbossche added 2 commits May 6, 2019 20:12

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

5157e89

…inel2

linting

e350641

more linting

151aa6a

jreback approved these changes May 7, 2019

View reviewed changes

jreback merged commit a2686c6 into pandas-dev:master May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: handle EAs and fast path (no bounds checking) in safe_sort #25696

CLN: handle EAs and fast path (no bounds checking) in safe_sort #25696

jorisvandenbossche commented Mar 12, 2019

pep8speaks commented Mar 12, 2019 •

edited

Loading

jreback Mar 12, 2019

jreback Mar 20, 2019

jreback Mar 12, 2019

codecov bot commented Mar 12, 2019

codecov bot commented Mar 12, 2019 •

edited

Loading

jreback Mar 20, 2019

jreback Mar 20, 2019

jreback Mar 20, 2019

jorisvandenbossche commented Apr 5, 2019

jreback commented Apr 20, 2019

jorisvandenbossche commented May 6, 2019

jreback commented May 7, 2019

CLN: handle EAs and fast path (no bounds checking) in safe_sort #25696

CLN: handle EAs and fast path (no bounds checking) in safe_sort #25696

Conversation

jorisvandenbossche commented Mar 12, 2019

pep8speaks commented Mar 12, 2019 • edited Loading

Comment last updated at 2019-05-06 18:52:38 UTC

jreback Mar 12, 2019

Choose a reason for hiding this comment

jreback Mar 20, 2019

Choose a reason for hiding this comment

jreback Mar 12, 2019

Choose a reason for hiding this comment

codecov bot commented Mar 12, 2019

Codecov Report

codecov bot commented Mar 12, 2019 • edited Loading

Codecov Report

jreback Mar 20, 2019

Choose a reason for hiding this comment

jreback Mar 20, 2019

Choose a reason for hiding this comment

jreback Mar 20, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 5, 2019

jreback commented Apr 20, 2019

jorisvandenbossche commented May 6, 2019

jreback commented May 7, 2019

pep8speaks commented Mar 12, 2019 •

edited

Loading

codecov bot commented Mar 12, 2019 •

edited

Loading