Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range Field Histogram Aggregation #41545

Conversation

not-napoleon
Copy link
Member

@not-napoleon not-napoleon commented Apr 25, 2019

Work in progress

TODO:
[x] Range ValuesSource
[x] Fork the HistogramAggregationFactory to return different concrete classes based on field type (similar to Terms agg)
[x] Binary decode logic for ranges - see #41206
[x] Implement RangeHistogramAggregation builder logic
[x] Implement RangeHistogramAggregation leaf collector logic
[x] Deal with IP range related edge cases
[ ] Yaml & Doc tests

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@polyfractal polyfractal mentioned this pull request Apr 29, 2019
11 tasks
if (valuesSource instanceof ValuesSource.Numeric) {
return createAggregator((ValuesSource.Numeric) valuesSource, parent, pipelineAggregators, metaData);
}
else if (valuesSource instanceof ValuesSource.Bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to our future-selves: we should make sure this doesn't "bleed" over to other types that also use bytes... I think Strings will show up as ValuesSource.Bytes (from ValueType.String -> ValuesSourceType.BYTES -> ValuesSource.Bytes)

Unless that's handled elsewhere and we can ignore :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, we should have a more specific ValuesSource here, now that I've added a range specific choice. Nice catch.

@@ -87,7 +88,7 @@ public Builder scriptFunction(Function<SortedSetDocValues, ScriptDocValues<?>> s
CircuitBreakerService breakerService, MapperService mapperService) {
// Ignore Circuit Breaker
final String fieldName = fieldType.name();
if (BINARY_INDEX_FIELD_NAMES.contains(fieldName)) {
if (BINARY_INDEX_FIELD_NAMES.contains(fieldName)|| fieldType.getClass() == RangeFieldMapper.RangeFieldType.class) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: @polyfractal and I kicked this around, and we think the correct short term solution (i.e. for this PR) is to add a rangeType() method or similar to flag this, similar to how numeric and script work, rather than relying on the field type.

…stogram

 Conflicts:
	server/src/main/java/org/elasticsearch/index/mapper/RangeFieldMapper.java
	server/src/test/java/org/elasticsearch/search/aggregations/pipeline/DerivativeAggregatorTests.java
	x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/RollupJobIdentifierUtilTests.java
@not-napoleon not-napoleon changed the base branch from master to feature-range-aggregations May 7, 2019 19:20
@@ -71,6 +72,7 @@ public final Index index() {

private NumericType numericType;
private Function<SortedSetDocValues, ScriptDocValues<?>> scriptFunction = AbstractAtomicOrdinalsFieldData.DEFAULT_SCRIPT_FUNCTION;
private RangeType rangeType;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently only being used as a flag, but the overhead between storing a boolean and storing an enum reference isn't high, so it seemed worth leaving ourselves access to the more robust data.

@@ -179,6 +180,17 @@ public FieldData(IndexFieldData<?> indexFieldData) {
public SortedBinaryDocValues bytesValues(LeafReaderContext context) {
return indexFieldData.load(context).getBytesValues();
}

public static class RangeFieldData extends FieldData {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hierarchy in ValuesSource is a little arcane, and I just picked something that looked reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was wondering about this as well. Perhaps we should add another "top-level" VS instead of extending Bytes.FieldData? So we'd have a total of Numeric, Bytes, GeoPoint, WithScript and now Range ?

Not fully sure of the consequences of doing that. But it seems like Ranges are more of a distinct "thing" rather than a specialization of Bytes (even though it technically is).

Not sure, thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having just done some more thinking on the ValuesSource refactor, I'm in agreement that Range should be a peer to Numeric, GeoPoint, etc. I'll wire that up shortly.

}

@Override
protected Aggregator createUnmapped(Aggregator parent, List<PipelineAggregator> pipelineAggregators, Map<String, Object> metaData)
throws IOException {
return createAggregator(null, parent, pipelineAggregators, metaData);
return new NumericHistogramAggregator(name, factories, interval, offset, order, keyed, minDocCount, minBound, maxBound,
null, config.format(), context, parent, pipelineAggregators, metaData);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost positive this is the wrong thing to do here, but I'm not sure what the right thing to do is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this is a bit of a split personality in the framework right now. Setting the valuesSource to null will use a no-op collector in the histo's getLeafCollector(). Then it uses the same histo aggregator to build an empty result.

Terms agg (and a few others) instead have an "unmapped" version of the agg (UnmappedTerms) which is used instead. This unmapped version uses the same no-op collector, and knows how to build an empty response.

I'm not sure if there is a preference or why it's a split-personality. I've pinged @colings86 to see if he knows the history here :)

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Did a pass and left some comments/questions, and a few style nits (sorry!). :)

if (valuesSource instanceof ValuesSource.Numeric) {
return new NumericHistogramAggregator(name, factories, interval, offset, order, keyed, minDocCount, minBound, maxBound,
(ValuesSource.Numeric) valuesSource, config.format(), context, parent, pipelineAggregators, metaData);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: else if/else should go next to last clause's bracket

@Override
public Double doubleValue(Object endpointValue) {
assert endpointValue instanceof Double;
return (Double)endpointValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: should have space between cast and variable

@@ -407,6 +429,11 @@ public BytesRef encodeRanges(Set<RangeFieldMapper.Range> ranges) throws IOExcept
return LONG.decodeRanges(bytes);
}

@Override
public Double doubleValue (Object endpointValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: space between method name and arguments

@@ -179,6 +180,17 @@ public FieldData(IndexFieldData<?> indexFieldData) {
public SortedBinaryDocValues bytesValues(LeafReaderContext context) {
return indexFieldData.load(context).getBytesValues();
}

public static class RangeFieldData extends FieldData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was wondering about this as well. Perhaps we should add another "top-level" VS instead of extending Bytes.FieldData? So we'd have a total of Numeric, Bytes, GeoPoint, WithScript and now Range ?

Not fully sure of the consequences of doing that. But it seems like Ranges are more of a distinct "thing" rather than a specialization of Bytes (even though it technically is).

Not sure, thoughts?

private final LongHash bucketOrds;

RangeHistogramAggregator(String name, AggregatorFactories factories, double interval, double offset,
BucketOrder order, boolean keyed, long minDocCount, double minBound, double maxBound,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nittiest of nits: spacing of the rest of the args aren't consistent with the first line :)

}

@Override
protected Aggregator createUnmapped(Aggregator parent, List<PipelineAggregator> pipelineAggregators, Map<String, Object> metaData)
throws IOException {
return createAggregator(null, parent, pipelineAggregators, metaData);
return new NumericHistogramAggregator(name, factories, interval, offset, order, keyed, minDocCount, minBound, maxBound,
null, config.format(), context, parent, pipelineAggregators, metaData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this is a bit of a split personality in the framework right now. Setting the valuesSource to null will use a no-op collector in the histo's getLeafCollector(). Then it uses the same histo aggregator to build an empty result.

Terms agg (and a few others) instead have an "unmapped" version of the agg (UnmappedTerms) which is used instead. This unmapped version uses the same no-op collector, and knows how to build an empty response.

I'm not sure if there is a preference or why it's a split-personality. I've pinged @colings86 to see if he knows the history here :)

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two questions, I think this is good to go once the ANY/missing stuff is integrated! Gimme a ping once that gets merged into this PR and I'll do a quick skim. 👍

assert bucket == 0;
if (values.advanceExact(doc)) {
// Is it possible for valuesCount to be > 1 here? Multiple ranges are encoded into the same BytesRef in the binary doc
// values, so it isn't clear what we'd be iterating over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^ ?


for (int i = 0; i < valuesCount; i++) {
BytesRef encodedRanges = values.nextValue();
// This list should be sorted by start-of-range, I think?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will things go pear shaped if we ever get non-sorted ranges? Should we toss an assertion in here?

Or does it not particularly matter what order the ranges come in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters, but the encoding sorts them. Doesn't hurt to throw an assert in though, I'll do that.

 Conflicts:
	server/src/test/java/org/elasticsearch/search/aggregations/bucket/histogram/NumericHistogramAggregatorTests.java
…stogram

 Conflicts:
	server/src/main/java/org/elasticsearch/search/aggregations/bucket/histogram/HistogramAggregationBuilder.java
	server/src/main/java/org/elasticsearch/search/aggregations/bucket/histogram/HistogramAggregatorFactory.java
@not-napoleon
Copy link
Member Author

@elasticmachine run elasticsearch-ci/bwc

@not-napoleon
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think there are a few out-of-date comments, and curious about comments left in last review for RangeHistogramAggregator (about assertions, sorted input), but otherwise LGTM! 🎉

@@ -252,6 +269,7 @@ public VS toValuesSource(QueryShardContext context, Function<Object, ValuesSourc
} else if (valueSourceType() == ValuesSourceType.ANY) {
vs = (VS) resolveMissingAny.apply(missing());
} else {
// TODO: Do we need a missing case for Range values type?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need this comment anymore?

@@ -293,6 +311,7 @@ private VS originalValuesSource() {
if (valueSourceType() == ValuesSourceType.BYTES) {
return (VS) bytesScript();
}
// TODO: Do we need a range script case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, think this is outdated?

@not-napoleon not-napoleon merged commit 1d83f4d into elastic:feature-range-aggregations Jul 16, 2019
@not-napoleon not-napoleon deleted the feature/range-field-histogram branch July 16, 2019 13:36
@polyfractal
Copy link
Contributor

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants