Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
whats-new.rst
api.rst
I've made a start with implementations of
topk
andargtopk
.This'll work uncontroversially for DataArrays with and without NaNs, although I'm getting mostly stuck on what
skipna=True
should entail.There's a number of choices to make, however, which are probably best illustrated with this PR.
nputils.py
. The question that arises here for me: I guess this would work withcupy
too, but I don't quite oversee what's the best way to integrate it? Because dask providestopk
of its own, it appears a bit exceptional.partition
andargpartition
. This version works differs fromnumpy.partition
in its handling of NaNs, however. Using numpy's partition has the benefit that it's consistent with dask.topk
feels mostly similar to quantile, since it shortens, but doesn't reduce the dimension entirely. argmin also supports anaxis
argument next todim
(though exclusive) -- is the axis argument desirable?variable.py
borrows somewhat fromquantile
, since the result has an axis with dimk
size instead oflen(q)
. Unlikequantile
, dask'stopk
andargtopk
do not support tuple arguments for the axis (althoughtopk
accepts it and produces an unexpected result), so part of the stacking and unraveling functionality of_unravel_argminmax
is required. I've currently duplicated the relevant lines to keep changes clearly visible.apply_ufuncs
to judge whetherdask="allowed"
works gracefully with the dasktopk
andargtopk
functions, my guess is that it should.quantile
returns a result with a new dimension and coordinate calledquantile
, I've mimicked this andtopk
andargtopk
return a result with a newtopk
orargtopk
dimension respectively. I was thinking no labels are required fortopk
, but since both positive k values (for largest) and negative k values (for smallest) are possible, it's probably smart to return labelsrange(0, k)
andrange(-k, 0)
?idxtopk
would make sense too?skipna=False
is giving me some headaches. A (naive) implementation as in this PR is assymetric. Numpy partition (and thus the dask version too) sorts NaNs towards the end of the array, such that k > 0 will return NaNs, but k < 0 will not. For the testing, I figuredda.topk(k=-1, skipna=False)
should equalda.min(skipna=False)
andda.topk(k=1, skipna=False)
, should equalda.max()
, but this isn't the case. k=1 will return a NaN value since numpy partition moves the NaN to the end; k=-1 will not. I currently gravitate towards accepting this assymetry, since e.g.np.sort
will also move NaNs to the back and it feels forced to fetch NaNs for k=-1 to match.min(skipna=False)
. On the other hand, Python'ssorted
behaves differently, according to IEEE 754 NaNs are not orderable ... and I reckon you'd mostly useskipna=False
when you want to ensure that no NaNs are present?duck_array_ops
, maybe it belong innanops
as it resembles_nan_argminmax_object
and_nan_minmax_object
, but is again slightly different. But I didn't like the circular imports that it seems to require; in duck_array_ops it decides whether to use dask or numpy (via nputils), but the masking of NaNs is required for both.