Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: handle EAs and fast path (no bounds checking) in safe_sort #25696

Merged
merged 17 commits into from
May 7, 2019

Conversation

jorisvandenbossche
Copy link
Member

This is a possible alternative solution to what we have been discussing in #25592.

This moves the logic into safe_sort, with:

  • adding a check_outofbounds keyword to disable extra checks (otherwise the performance benefit of take_1d is lost)
  • fixing safe_sort to work for EAs

The check_outofbounds make it a bit more complicated, but without it, we can't benefit of the performance improvement for which take_1d was used originally in factorize.

(another solution is to simply decide that this performance improvement is not worth this extra code, and we simply use the current safe_sort (but fixed to work for EAs) in factorize)

Need to add some more tests for the combination of EAs with a custom na_sentinel (a case that is currently broken)

@pep8speaks
Copy link

pep8speaks commented Mar 12, 2019

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-06 18:52:38 UTC

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 12, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Mar 12, 2019
@@ -425,6 +427,10 @@ def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False):
assume_unique : bool, default False
When True, ``values`` are assumed to be unique, which can speed up
the calculation. Ignored when ``labels`` is None.
check_outofbounds : bool, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a bad name but not consistent across pandas, we use verify elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update & add a versionadded tag

# deal with them here without performance loss using `mode='wrap'`.)
new_labels = reverse_indexer.take(labels, mode='wrap')
np.putmask(new_labels, mask, na_sentinel)
if na_sentinel == -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would rather just fix take_1d

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #25696 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25696      +/-   ##
==========================================
+ Coverage   91.28%   91.28%   +<.01%     
==========================================
  Files         173      173              
  Lines       52967    52969       +2     
==========================================
+ Hits        48351    48353       +2     
  Misses       4616     4616
Flag Coverage Δ
#multiple 89.86% <100%> (ø) ⬆️
#single 41.74% <57.89%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/sorting.py 98.36% <100%> (+0.06%) ⬆️
pandas/core/algorithms.py 94.72% <100%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8fad16...c6203cb. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #25696 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25696      +/-   ##
==========================================
- Coverage   91.98%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52374    52375       +1     
==========================================
- Hits        48178    48175       -3     
- Misses       4196     4200       +4
Flag Coverage Δ
#multiple 90.53% <100%> (ø) ⬆️
#single 40.75% <55.55%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/sorting.py 98.35% <100%> (+0.06%) ⬆️
pandas/core/algorithms.py 94.73% <100%> (-0.07%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7120725...151aa6a. Read the comment docs.

@@ -425,6 +427,10 @@ def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False):
assume_unique : bool, default False
When True, ``values`` are assumed to be unique, which can speed up
the calculation. Ignored when ``labels`` is None.
check_outofbounds : bool, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update & add a versionadded tag

@@ -461,7 +467,8 @@ def sort_mixed(values):
return np.concatenate([nums, np.asarray(strs, dtype=object)])

sorter = None
if PY3 and lib.infer_dtype(values, skipna=False) == 'mixed-integer':
if (PY3 and not isinstance(values, ABCExtensionArray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah no more PY3 needed!

@@ -461,7 +467,8 @@ def sort_mixed(values):
return np.concatenate([nums, np.asarray(strs, dtype=object)])

sorter = None
if PY3 and lib.infer_dtype(values, skipna=False) == 'mixed-integer':
if (PY3 and not isinstance(values, ABCExtensionArray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_extension_array

@jorisvandenbossche
Copy link
Member Author

Coming back to this

would rather just fix take_1d

For take_1d itself, I am not sure it makes a lot of sense to support this. Currently, it is by definition that -1 values in the indexer passed to take signal missing values. The only use case for having this configurable would be this one, while it complicates the interface throughout pandas (eg also EAs would need to support this, as take_1d dispatch to them)

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

can you merge master

@jorisvandenbossche
Copy link
Member Author

@jreback conflicts resolved, if you can take another look

@jreback jreback merged commit a2686c6 into pandas-dev:master May 7, 2019
@jreback
Copy link
Contributor

jreback commented May 7, 2019

thanks @jorisvandenbossche nice cleanup & tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants