Add k_smallest_relaxed and variants #925

adamreichold · 2024-04-27T11:26:58Z

This implements the algorithm described in which consumes twice the amount of memory as the existing k_smallest algorithm but achieves linear time in the number of elements in the input.

I expect this to fail the MSRV job as select_nth_unstable was stabilized in 1.49 and is therefore not available in 1.43.1. I decided to propose it anyway for the time when an MSRV increase is planned or if deemed sufficiently useful as reason for one.

codecov · 2024-04-27T11:50:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.65%. Comparing base (6814180) to head (88e5a2d).
Report is 108 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #925      +/-   ##
==========================================
+ Coverage   94.38%   94.65%   +0.26%     
==========================================
  Files          48       49       +1     
  Lines        6665     7215     +550     
==========================================
+ Hits         6291     6829     +538     
- Misses        374      386      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Philippe-Cholet

I did not check the docs yet. I would if we decide to eventually add this.
As you mentionned, it breaks the MSRV a little so it won't be added right now but the MSRV should become 1.51 soon enough.
Good first iteration. Thanks for adding tests.

I appreciate the idea here and it sure makes sense to use select_nth_unstable_by which I was not familiar with.
Complexity sure seems interesting. I however wonder what is the performance difference with the un-relaxed versions in practice. A compared benchmark might be valuable, we don't have one yet so I think this might help if you want to add one.

I note that we would have 12 methods... k_(smallest|largest)[_relaxed][_by[_key]] which is a lot!

src/k_smallest.rs

src/lib.rs

adamreichold · 2024-04-27T13:42:50Z

I note that we would have 12 methods... k_(smallest|largest)[_relaxed][_by[_key]] which is a lot!

I agree and if pressed to make a choice, I would argue that the relaxed would be a reasonable general choice, i.e. could completely replace the heap-based versions.

For one, k usually is bounded by small fixed numbers and I do not really see a lot of situations where I have the memory to compute the top 100 but not the enough to invest into the top 200. And since the implementation can directly work with select_nth_unstable_by, the complications of having a custom heap implementation could be avoided.

However, making this choice does not appear obviously better than having twelve methods though.

Alternatively, the combinatorics could be reigned in by adding a "strategy-like" type parameter, e.g. Strict and Relaxed to indicate the desired memory bound.

Finally, I am not sure the k_largest* methods really carry their weight now that the k_smallest_by and k_smallest_by_key methods exist as in applications of these methods that I am aware of, adapting the way items are sorted is required in any case and hence choosing ascending or descending order does not really increase call site complexity.

Philippe-Cholet · 2024-04-27T13:45:33Z

@phimuemue @jswrenn I don't know if you would agree on the current PR but have 12 similar methods seems a bit much to me. k_smallest variants are in itertools 0.13.0 which is not released yet, so we could still change some things without breaking and avoid the 12 methods. We could have

iter.top_k(k).smallest() // redirected from the current k_smallest (that we could deprecate)
iter.top_k(k).smallest_by(f)
iter.top_k(k).smallest_by_key(key)
iter.top_k(k).largest()
iter.top_k(k).largest_by(f)
iter.top_k(k).largest_by_key(key)
And once MSRV >= 1.49, this PR would introduce:
iter.top_k(k).relaxed().smallest()
iter.top_k(k).relaxed().smallest_by(f)
iter.top_k(k).relaxed().smallest_by_key(key)
iter.top_k(k).relaxed().largest()
iter.top_k(k).relaxed().largest_by(f)
iter.top_k(k).relaxed().largest_by_key(key)

Implementation draft (click to see)

trait Itertools: Iterator {
    fn top_k(self, k: usize) -> Top<Self> { ... }
    #[deprecate...]
    fn k_smallest(self, k: usize) -> ... { self.top_k().smallest() }
}

pub struct GeneralTop<I, A> {
    iter: I,
    k: usize,
    marker: PhantomData<A>,
}
pub type Top<I> = GeneralTop<I, Unrelaxed>;
pub type TopRelaxed<I> = GeneralTop<I, Relaxed>;

impl<I: Iterator> Top<I> {
    pub fn relaxed(self) -> TopRelaxed<I> { ... }
}
impl<I: Iterator, A> GeneralTop<I, A> {
    pub fn smallest(self) -> ...
    pub fn smallest_by<F: FnMut...>(self, f: F) -> ...
    pub fn smallest_by_key<F: FnMut...>(self, key: F) -> ...
    pub fn largest(self) -> ...
    pub fn largest_by<F: FnMut...>(self, f: F) -> ...
    pub fn largest_by_key<F: FnMut...>(self, key: F) -> ...
}

The relaxed version would be added once the MSRV is big enough.

"Minimal allocation" vs "linear complexity" are both valuable tradeoffs here, but maybe it would be nice to not multiply methods.

EDIT: There is also alternatives @adamreichold described above.
We could also temporarily withdraw k_smallest variants from the next release and add them back with relaxed versions. It might be better if we want the relaxed versions to be the default ones and heap versions the alternative.

iter.top_k(k) // relaxed
iter.top_k(k).min_alloc() // the currently pushed versions
// both would have the 6 methods  (smallest|largest)[_by[_key]]

adamreichold · 2024-04-27T19:09:28Z

Looking at the results of the benchmark pushed as part of the second commit, I would summarize them as no difference for the ascending/best case

a small but consistent lead in the random/average case

and a large improvement for the descending/worst case

phimuemue · 2024-04-27T20:18:10Z

Hi there, nice algorithm. Unsure about what to do best. And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?

As for the 12 methods. I think that - if we decide to keep all of them - we should just offer 12 methods right on Itertools. I still feel the same for our grouping_map thing: It adds an indirection, and - from what I've seen so far - does not help in composability or reusability (see our trials to extend it to custom hashers). Sure, it occupies lots of space in docs.rs, but I do not see a real problem with that.

Sidenote: What's the unstable/stable behavior of the current implementation?

adamreichold · 2024-04-28T05:37:38Z

Sidenote: What's the unstable/stable behavior of the current implementation?

It is unstable, at least because it calls into the unstable selection/sorting routines from std.

adamreichold · 2024-04-28T05:48:42Z

And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?

The original issue and I think I agree with the assessment there: Further increasing the capacity is not too interesting as it will only change the constant factor attached to the O(n) part of the runtime, i.e. I don't think it is worth it complicating the API by giving this choice to the calling code.

Philippe-Cholet · 2024-04-28T07:45:41Z

Ok about having 12 methods. The scaling factor 2 is fine.

TODO:

Close this OR add this as an alternative OR replace our heap-based versions with it.
Personally I would choose to keep this as an alternative as both tradeoffs are valuable enough?
A bit unsure about the name: change "relaxed"?
OR should we make the relaxed versions the default and the heap-based ones the ones with a longer name instead? It might be a better choice perf-wise but do so would be a breaking change IMHO.
Review docs.
Wait a bit for MSRV>=1.49 to merge this.

adamreichold · 2024-04-28T07:55:22Z

A bit unsure about the name: change "relaxed"?

Note that I am very much not invested in that name and made it up on the spot when I had to name the methods somehow to avoid conflicting with the existing ones. I think the only requirement for the name is that they should reflect the space-time trade-off somehow.

Philippe-Cholet

I've been thinking about changing "relaxed" but the only alternative I found is "linear" which I kinda find worse (compare docs is necessary to understand anyway) so I'm fine with "relaxed".
I finally reviewed the docs (one small change to do).
Then I would be fine to merge this once our MSRV is big enough (probably later this year).

src/lib.rs

Philippe-Cholet

Thanks!
Now we wait.

This implements the algorithm described in [1] which consumes twice the amount of memory as the existing `k_smallest` algorithm but achieves linear time in the number of elements in the input. [1] https://quickwit.io/blog/top-k-complexity

Philippe-Cholet · 2024-07-03T08:48:17Z

@adamreichold The long wait has ended, thanks again for this!

Philippe-Cholet reviewed Apr 27, 2024

View reviewed changes

src/k_smallest.rs Show resolved Hide resolved

src/k_smallest.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

Philippe-Cholet reviewed Apr 30, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Philippe-Cholet approved these changes May 1, 2024

View reviewed changes

Philippe-Cholet mentioned this pull request May 16, 2024

Feature request: incremental sort #508

Open

Philippe-Cholet mentioned this pull request Jun 2, 2024

Add next_array and collect_array #560

Merged

adamreichold added 2 commits July 3, 2024 10:40

Add k_smallest_relaxed and variants

31c7fd8

This implements the algorithm described in [1] which consumes twice the amount of memory as the existing `k_smallest` algorithm but achieves linear time in the number of elements in the input. [1] https://quickwit.io/blog/top-k-complexity

Add benchmarks for k_smallest implementation variants.

88e5a2d

Philippe-Cholet added this pull request to the merge queue Jul 3, 2024

Philippe-Cholet added this to the next milestone Jul 3, 2024

Merged via the queue into rust-itertools:master with commit e41a62b Jul 3, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add k_smallest_relaxed and variants #925

Add k_smallest_relaxed and variants #925

adamreichold commented Apr 27, 2024 •

edited

Loading

codecov bot commented Apr 27, 2024 •

edited

Loading

Philippe-Cholet left a comment

adamreichold commented Apr 27, 2024 •

edited

Loading

Philippe-Cholet commented Apr 27, 2024 •

edited

Loading

adamreichold commented Apr 27, 2024

phimuemue commented Apr 27, 2024 •

edited

Loading

adamreichold commented Apr 28, 2024

adamreichold commented Apr 28, 2024

Philippe-Cholet commented Apr 28, 2024

adamreichold commented Apr 28, 2024

Philippe-Cholet left a comment

Philippe-Cholet left a comment

Philippe-Cholet commented Jul 3, 2024

Add k_smallest_relaxed and variants #925

Add k_smallest_relaxed and variants #925

Conversation

adamreichold commented Apr 27, 2024 • edited Loading

codecov bot commented Apr 27, 2024 • edited Loading

Codecov Report

Philippe-Cholet left a comment

Choose a reason for hiding this comment

adamreichold commented Apr 27, 2024 • edited Loading

Philippe-Cholet commented Apr 27, 2024 • edited Loading

adamreichold commented Apr 27, 2024

phimuemue commented Apr 27, 2024 • edited Loading

adamreichold commented Apr 28, 2024

adamreichold commented Apr 28, 2024

Philippe-Cholet commented Apr 28, 2024

adamreichold commented Apr 28, 2024

Philippe-Cholet left a comment

Choose a reason for hiding this comment

Philippe-Cholet left a comment

Choose a reason for hiding this comment

Philippe-Cholet commented Jul 3, 2024

adamreichold commented Apr 27, 2024 •

edited

Loading

codecov bot commented Apr 27, 2024 •

edited

Loading

adamreichold commented Apr 27, 2024 •

edited

Loading

Philippe-Cholet commented Apr 27, 2024 •

edited

Loading

phimuemue commented Apr 27, 2024 •

edited

Loading