Added max/min Scores to Pecentage Heuristic #26027

nirmalc · 2017-08-02T16:09:42Z

Adds two params to Percentage heuristic; min_score and max_score.

minScore and maxScore to Percentile allows interesting aggregations like exclusive terms . Might be solution for #23818

Adds two params to Percentage heuristic; min_score and max_score. minScore and maxScore to Percentile allows interesting aggregations like exclusive terms

elasticmachine · 2017-08-02T16:09:43Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

rjernst · 2017-08-02T16:36:31Z

@markharwood Can you take a look at this? I don't know a lot about percentage heuristic, but this seems pretty hacky to me. I would rather a solution be more composable?

markharwood · 2017-08-02T17:09:03Z

From a quick look at the code and the motivation behind adding it (finding distinct terms) I'm not sure it will work as intended. The mistake is in assuming that there is a single shard/index in play and the heuristic sees all data. In a distributed system the significance heuristic is invoked at shard-level and then again at the reducer node.
Even if we could bypass the min/max constraints at a shard level this sort of "distinct term" calculation will typically require a global view of the data and we can't assume it is practical to stream all stats for all terms back to a central reducer node when we have distributed indices and high cardinality fields.

I can't see a way around that limitation so I'm reluctant to accept this as a solution to the problem as it stands.

nirmalc · 2017-08-02T17:16:55Z

Thanks for taking a look , yes - i couldnt think of an inexpensive way to truly be distinct across cluster, so this is more or less "Approximately between x% - y%" for percent heuristics.

markharwood · 2017-08-02T20:26:05Z

Approximately between x% - y%" for percent heuristics.

For your use case and on time-based indices the error margin is unbounded. Consider a query for new IPs today across 7 single-sharded daily indexes. 6 of the indexes have an empty set for the foreground (they are not today's content) so will supply no background stats for any IPs because there are no hits. In contrast the index for today will have a foreground set that is exactly the same as its background set and score every IP a perfect "1" without any consideration of the other 6 days.

Added max/min Scores to Pecentage Heuristic

6e22725

Adds two params to Percentage heuristic; min_score and max_score. minScore and maxScore to Percentile allows interesting aggregations like exclusive terms

rjernst added the review label Aug 2, 2017

rjernst requested a review from markharwood August 2, 2017 16:35

nirmalc closed this Aug 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added max/min Scores to Pecentage Heuristic #26027

Added max/min Scores to Pecentage Heuristic #26027

nirmalc commented Aug 2, 2017

elasticmachine commented Aug 2, 2017

rjernst commented Aug 2, 2017

markharwood commented Aug 2, 2017 •

edited

Loading

nirmalc commented Aug 2, 2017

markharwood commented Aug 2, 2017

Added max/min Scores to Pecentage Heuristic #26027

Added max/min Scores to Pecentage Heuristic #26027

Conversation

nirmalc commented Aug 2, 2017

elasticmachine commented Aug 2, 2017

rjernst commented Aug 2, 2017

markharwood commented Aug 2, 2017 • edited Loading

nirmalc commented Aug 2, 2017

markharwood commented Aug 2, 2017

markharwood commented Aug 2, 2017 •

edited

Loading