-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to set the number of hits to track accurately #36357
Conversation
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but like it in general.
[[search-request-track-total-hits]] | ||
=== Track total hits | ||
|
||
The total hit count can't be computed accurately without visiting all matches, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add "generally" since match_all and term
queries without deletions can still get their hit count computed for free
which is costly for queries that match lots of documents. The `track_total_hits` | ||
parameter allows you to control how the total number of hits should be tracked. | ||
When set to `true` the search response will track the number of hits that match | ||
the query accurately: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe be explicit and say that total.relation will always be equal to eq
when track_total_hits is set to true?
{ | ||
"track_total_hits": true, | ||
"query" : { | ||
"match_all" : {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example might be a bit confusing since the hit count of match_all can be computed easily. Use eg. a range query instead?
<1> The total number of hits is unknown. | ||
|
||
Given that it is often enough to have a lower bound of the number of hits, | ||
such as "there are more than 1000 hits", it is also possible to set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rephrase to "there are 1000 hits or more" since the lower bound is inclusive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or "at least"
if (totalHits < trackTotalHitsUpTo) { | ||
return new TotalHits(totalHits, totalHitsRelation); | ||
} else { | ||
return new TotalHits(trackTotalHitsUpTo, Relation.GREATER_THAN_OR_EQUAL_TO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment: Lucene can give us more information but it seems that we never want to return accurate hit counts if the count is greater than the upTo
so that users do not get used to accurate hit counts when they didn't actually ask for accuracy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
} | ||
if (restRequest.paramAsBoolean(TOTAL_HITS_AS_INT_PARAM, false)) { | ||
throw new IllegalArgumentException("[" + TOTAL_HITS_AS_INT_PARAM + "] cannot be used " + | ||
"if the tracking of total hits is not accurate (true) or disabled (false), got " + trackTotalHitsUpTo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the message right? "not accurate" happens when the parameter is an integer value rather than true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is completely off, thanks. I pushed a fix
boolean hasFilterCollector) throws IOException { | ||
super(reader, query, sortAndFormats, scrollContext.lastEmittedDoc, numHits, trackMaxScore, | ||
trackTotalHits, hasFilterCollector); | ||
SearchContext.TRACK_TOTAL_HITS_ACCURATE, hasFilterCollector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why accurate? we don't need hit counts since we already have it via numHits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the logic is wrong here, I reverted it to disable the tracking if the scroll context already contains the total hits:
331d8fe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz, I pushed some commits to address your comments.
63c233e
to
c840019
Compare
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested. It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected. Relates elastic#33028
The merge sort of sorted segments can produce an invalid sort if the sort field is an Integer/Long that uses reverse order and contains values equal to Integer/Long#MIN_VALUE. These values are always sorted first during a merge (instead of last because of the reverse order) due to this bug. Indices affected by the bug can be detected by running the CheckIndex command on a distribution that contains the fix (7.6+).
c840019
to
6287563
Compare
run gradle build tests 1 |
run gradle build tests 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
of the total number of hits that match the query. | ||
(see <<index-modules-index-sorting,_Index Sorting_>> for more details). | ||
Defaults to true. | ||
It also accepts an integer which in this case represents the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi , this is problematic in strongly-typed languages, since the track_total_hits
parameter is defined as a boolean
, and eg. the Go test suite fails with cannot use 4 (type int) as type bool in field value
:
elasticsearch/rest-api-spec/src/main/resources/rest-api-spec/test/search/220_total_hits_object.yml
Line 117 in b34e7d4
track_total_hits: 4 |
I understand the motivation here, but we should at least revisit the parameter definitions, and support something like "type" : ["boolean","number]"
, so the code generators can make decisions here.
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested.
It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected.
Relates #33028