-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add k_smallest_relaxed and variants #925
Add k_smallest_relaxed and variants #925
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #925 +/- ##
==========================================
+ Coverage 94.38% 94.65% +0.26%
==========================================
Files 48 49 +1
Lines 6665 7215 +550
==========================================
+ Hits 6291 6829 +538
- Misses 374 386 +12 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not check the docs yet. I would if we decide to eventually add this.
As you mentionned, it breaks the MSRV a little so it won't be added right now but the MSRV should become 1.51 soon enough.
Good first iteration. Thanks for adding tests.
I appreciate the idea here and it sure makes sense to use select_nth_unstable_by
which I was not familiar with.
Complexity sure seems interesting. I however wonder what is the performance difference with the un-relaxed versions in practice. A compared benchmark might be valuable, we don't have one yet so I think this might help if you want to add one.
I note that we would have 12 methods... k_(smallest|largest)[_relaxed][_by[_key]]
which is a lot!
I agree and if pressed to make a choice, I would argue that the relaxed would be a reasonable general choice, i.e. could completely replace the heap-based versions. For one, However, making this choice does not appear obviously better than having twelve methods though. Alternatively, the combinatorics could be reigned in by adding a "strategy-like" type parameter, e.g. Finally, I am not sure the |
@phimuemue @jswrenn I don't know if you would agree on the current PR but have 12 similar methods seems a bit much to me.
Implementation draft (click to see)trait Itertools: Iterator {
fn top_k(self, k: usize) -> Top<Self> { ... }
#[deprecate...]
fn k_smallest(self, k: usize) -> ... { self.top_k().smallest() }
}
pub struct GeneralTop<I, A> {
iter: I,
k: usize,
marker: PhantomData<A>,
}
pub type Top<I> = GeneralTop<I, Unrelaxed>;
pub type TopRelaxed<I> = GeneralTop<I, Relaxed>;
impl<I: Iterator> Top<I> {
pub fn relaxed(self) -> TopRelaxed<I> { ... }
}
impl<I: Iterator, A> GeneralTop<I, A> {
pub fn smallest(self) -> ...
pub fn smallest_by<F: FnMut...>(self, f: F) -> ...
pub fn smallest_by_key<F: FnMut...>(self, key: F) -> ...
pub fn largest(self) -> ...
pub fn largest_by<F: FnMut...>(self, f: F) -> ...
pub fn largest_by_key<F: FnMut...>(self, key: F) -> ...
} The relaxed version would be added once the MSRV is big enough. "Minimal allocation" vs "linear complexity" are both valuable tradeoffs here, but maybe it would be nice to not multiply methods. EDIT: There is also alternatives @adamreichold described above.
|
Hi there, nice algorithm. Unsure about what to do best. And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant? As for the 12 methods. I think that - if we decide to keep all of them - we should just offer 12 methods right on Sidenote: What's the unstable/stable behavior of the current implementation? |
It is unstable, at least because it calls into the unstable selection/sorting routines from |
The original issue and I think I agree with the assessment there: Further increasing the capacity is not too interesting as it will only change the constant factor attached to the |
Ok about having 12 methods. The scaling factor 2 is fine. TODO:
|
Note that I am very much not invested in that name and made it up on the spot when I had to name the methods somehow to avoid conflicting with the existing ones. I think the only requirement for the name is that they should reflect the space-time trade-off somehow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about changing "relaxed" but the only alternative I found is "linear" which I kinda find worse (compare docs is necessary to understand anyway) so I'm fine with "relaxed".
I finally reviewed the docs (one small change to do).
Then I would be fine to merge this once our MSRV is big enough (probably later this year).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Now we wait.
This implements the algorithm described in [1] which consumes twice the amount of memory as the existing `k_smallest` algorithm but achieves linear time in the number of elements in the input. [1] https://quickwit.io/blog/top-k-complexity
@adamreichold The long wait has ended, thanks again for this! |
This implements the algorithm described in which consumes twice the amount of memory as the existing
k_smallest
algorithm but achieves linear time in the number of elements in the input.I expect this to fail the MSRV job as
select_nth_unstable
was stabilized in 1.49 and is therefore not available in 1.43.1. I decided to propose it anyway for the time when an MSRV increase is planned or if deemed sufficiently useful as reason for one.