Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to set the number of hits to track accurately #36357

Merged
merged 14 commits into from
Jan 4, 2019

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Dec 7, 2018

In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested.
It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected.

Relates #33028

@jimczi jimczi added >feature :Search/Search Search-related issues that do not fall into other categories v7.0.0 labels Dec 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but like it in general.

[[search-request-track-total-hits]]
=== Track total hits

The total hit count can't be computed accurately without visiting all matches,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add "generally" since match_all and term queries without deletions can still get their hit count computed for free

which is costly for queries that match lots of documents. The `track_total_hits`
parameter allows you to control how the total number of hits should be tracked.
When set to `true` the search response will track the number of hits that match
the query accurately:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe be explicit and say that total.relation will always be equal to eq when track_total_hits is set to true?

{
"track_total_hits": true,
"query" : {
"match_all" : {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example might be a bit confusing since the hit count of match_all can be computed easily. Use eg. a range query instead?

<1> The total number of hits is unknown.

Given that it is often enough to have a lower bound of the number of hits,
such as "there are more than 1000 hits", it is also possible to set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rephrase to "there are 1000 hits or more" since the lower bound is inclusive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or "at least"

if (totalHits < trackTotalHitsUpTo) {
return new TotalHits(totalHits, totalHitsRelation);
} else {
return new TotalHits(trackTotalHitsUpTo, Relation.GREATER_THAN_OR_EQUAL_TO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment: Lucene can give us more information but it seems that we never want to return accurate hit counts if the count is greater than the upTo so that users do not get used to accurate hit counts when they didn't actually ask for accuracy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}
if (restRequest.paramAsBoolean(TOTAL_HITS_AS_INT_PARAM, false)) {
throw new IllegalArgumentException("[" + TOTAL_HITS_AS_INT_PARAM + "] cannot be used " +
"if the tracking of total hits is not accurate (true) or disabled (false), got " + trackTotalHitsUpTo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the message right? "not accurate" happens when the parameter is an integer value rather than true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is completely off, thanks. I pushed a fix

boolean hasFilterCollector) throws IOException {
super(reader, query, sortAndFormats, scrollContext.lastEmittedDoc, numHits, trackMaxScore,
trackTotalHits, hasFilterCollector);
SearchContext.TRACK_TOTAL_HITS_ACCURATE, hasFilterCollector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why accurate? we don't need hit counts since we already have it via numHits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the logic is wrong here, I reverted it to disable the tracking if the scroll context already contains the total hits:
331d8fe

Copy link
Contributor Author

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz, I pushed some commits to address your comments.

@jimczi jimczi force-pushed the track_total_hits_lower_bound branch from 63c233e to c840019 Compare December 9, 2018 19:12
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested.
It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected.

Relates elastic#33028
The merge sort of sorted segments can produce an invalid
sort if the sort field is an Integer/Long that uses reverse order and contains values equal to
Integer/Long#MIN_VALUE. These values are always sorted first during a merge
(instead of last because of the reverse order) due to this bug.
Indices affected by the bug can be detected by running the CheckIndex command on a
distribution that contains the fix (7.6+).
@jimczi jimczi force-pushed the track_total_hits_lower_bound branch from c840019 to 6287563 Compare December 9, 2018 19:55
@jimczi
Copy link
Contributor Author

jimczi commented Dec 26, 2018

run gradle build tests 1

@jimczi
Copy link
Contributor Author

jimczi commented Dec 26, 2018

run gradle build tests 2

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimczi jimczi merged commit e38cf1d into elastic:master Jan 4, 2019
@jimczi jimczi deleted the track_total_hits_lower_bound branch January 4, 2019 19:36
of the total number of hits that match the query.
(see <<index-modules-index-sorting,_Index Sorting_>> for more details).
Defaults to true.
It also accepts an integer which in this case represents the number of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi , this is problematic in strongly-typed languages, since the track_total_hits parameter is defined as a boolean, and eg. the Go test suite fails with cannot use 4 (type int) as type bool in field value:

I understand the motivation here, but we should at least revisit the parameter definitions, and support something like "type" : ["boolean","number]", so the code generators can make decisions here.

@colings86 colings86 removed the v7.0.0 label Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants