Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added max/min Scores to Pecentage Heuristic #26027

Closed
wants to merge 1 commit into from
Closed

Added max/min Scores to Pecentage Heuristic #26027

wants to merge 1 commit into from

Conversation

nirmalc
Copy link
Contributor

@nirmalc nirmalc commented Aug 2, 2017

Adds two params to Percentage heuristic; min_score and max_score.

minScore and maxScore to Percentile allows interesting aggregations like exclusive terms . Might be solution for #23818

Adds two params to Percentage heuristic; min_score and max_score.

minScore and maxScore to Percentile allows interesting
aggregations like exclusive terms
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@rjernst rjernst added the review label Aug 2, 2017
@rjernst rjernst requested a review from markharwood August 2, 2017 16:35
@rjernst
Copy link
Member

rjernst commented Aug 2, 2017

@markharwood Can you take a look at this? I don't know a lot about percentage heuristic, but this seems pretty hacky to me. I would rather a solution be more composable?

@markharwood
Copy link
Contributor

markharwood commented Aug 2, 2017

From a quick look at the code and the motivation behind adding it (finding distinct terms) I'm not sure it will work as intended. The mistake is in assuming that there is a single shard/index in play and the heuristic sees all data. In a distributed system the significance heuristic is invoked at shard-level and then again at the reducer node.
Even if we could bypass the min/max constraints at a shard level this sort of "distinct term" calculation will typically require a global view of the data and we can't assume it is practical to stream all stats for all terms back to a central reducer node when we have distributed indices and high cardinality fields.

I can't see a way around that limitation so I'm reluctant to accept this as a solution to the problem as it stands.

@nirmalc
Copy link
Contributor Author

nirmalc commented Aug 2, 2017

Thanks for taking a look , yes - i couldnt think of an inexpensive way to truly be distinct across cluster, so this is more or less "Approximately between x% - y%" for percent heuristics.

@markharwood
Copy link
Contributor

Approximately between x% - y%" for percent heuristics.

For your use case and on time-based indices the error margin is unbounded. Consider a query for new IPs today across 7 single-sharded daily indexes. 6 of the indexes have an empty set for the foreground (they are not today's content) so will supply no background stats for any IPs because there are no hits. In contrast the index for today will have a foreground set that is exactly the same as its background set and score every IP a perfect "1" without any consideration of the other 6 days.

@nirmalc nirmalc closed this Aug 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants