-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: SingleBucketAggregation should create a single bucket #8510
Comments
+1 While I agree that it makes sense to have This is especially an issue in NEST, since it's a strongly-typed client and we need to know the exact C# type to deserialize the agg response to. The cc @Mpdreamz |
+1, being able to program against both in a similar fashion would be great. Something being either a leave or node is an easier mental model then having two types of nodes where the shape is dependent on parent metadata not actually returned in the response. That said: |
-1 on the JSON side ... Single bucket ages both semantically and logically hold no buckets... It'd be like saying that java shouldn't have objects and instead you should always work with collections.... And to work with a single object you'll need to have a collection of size 1... Just so you'd work with objects the same way you work with collections. Yes... You need to know what you're working with... On the lang clients on the other hand it might make sense to generalize the tree traversal |
@uboness i think it depends on the consumer. If a user writes a single agg then it makes sense to return that value without the intermediate bucket. For general purpose tools like Kibana it may make more sense to have a standard representation. I'm wondering if this more verbose syntax should be a query time option. |
even with the generic consumers (e.g. kibana), they still need to distinguish between metrics and buckets right? and I also wonder how generic is it? For example, I'm sure the kibana is very much aware of the different types of aggs it exposes to the users and so there's dedicated code for each anyway (visualize the different aggs differently regardless of their nature). One of the things I always liked in the json structure of aggs is that it's very intuitive and human readable. Navigating like |
Cannot say that for me but if enough people feel that way then maybe @clintongormley proposal to have query time parameter flag for that might make sense.
This is actually the problem and why I opened the issue. It would be great to have the result in a such generic way that one would not have to know the type of aggregation because the structure is always the same. As for Kibana, I am unfamiliar with the implementation details and how much of a relief this change would actually be, so I summon @rashidkpc to join the discussion. |
Interesting discussion, I am +1 on making the structure generic if possible. I think it would help a lot those users who write client code that interacts with elasticsearch, by making their life easier and their code better. There is a little price in terms of readability for the human eye, if that's a big concern we could have some kind of |
From a Kibana perspective @javanna hits the nail on the head. Dealing with aggregations that don't have a buckets array is a real pain. While yes, we have different code for different aggs it would be really nice to be able to treat at least all bucket aggs the same |
well.. for me, having an output for a simple filter agg like the following qualifies as wrong & misleading: "aggregations": {
"last_year": {
"doc_count": 130,
"buckets": [
{
"avg_price" : { "value" : 56.3 }
}
]
}
}
Further more, a common functionality in multi bucket aggs is that every bucket has an identifying key... what is the key in the example above? One can also request the buckets to be returned as an object (instead of an array) where each bucket is keyed by its key... and now what? how would the response look like? will we have a fake key for single bucket aggs? a fixed fake key? or require the user to provide a key? and if we do... what will the user call it.. I can't think of any key that would make sense, cause we're trying to name something that has no name. The way I look at JSON responses is that their structure should be as self documenting as possible. We don't have schemas (and we don't want schemas) so the output needs to be semantically correct - that what makes a good API IMO. Now, perhaps you can write tools around this API that make your life easier if you really need to treat all agg responses the same (as I mentioned above, you can always have this code on the client side). I don't consider this to be human vs. machine - returning semantically correct structures is not for "humans" only.. at least IMO. |
I am not a fan of returning a |
+1 making the result consistent, or at the very least providing a flag that enables the behavior
Conceptually, I think it is the exact opposite. When I explain buckets to people, I tell them a bucket is simply a criteria that documents can match. If the document matches the criteria, it is added to the bucket. Some buckets dynamically add criteria as they encounter new values ( A filter bucket has a single criteria: does this doc match the filter? If yes, add to the bucket. It's a boolean operation, so it only has one bucket, but conceptually it behaves identically to any other bucket. Ditto for Global bucket (criteria is "all"). For most practical applications, you don't really know what the results coming back are. The only way to definitively know is by keeping the original request and serializing each level based on the request. Alternatively, you get to play the introspection game at every level. To make the point: if we followed that logic to it's extreme, the Real life exampleHere is a very real, simple example that you might find on an ecommerce site. This isn't a "generic" tool like Kibana, but would definitely benefit from generic responses. When you hit the main page, you get a tree of all colors and all brands: {
"aggs":{
"colors":{
"terms":{
"field":"color",
"size":10
},
"aggs":{
"brand":{
"terms":{
"field":"brand",
"size":10
},
"aggs":{
"avgPrice":{
"avg":{ "field":"price" }
}
}
}
}
}
}
} Now imagine a user selects {
"aggs":{
"colors":{
"filter":{
"term":{
"color":"Red"
}
},
"aggs":{
"brand":{
"filter":{
"term":{
"brand":"Toyota"
}
},
"aggs":{
"avgPrice":{
"avg":{ "field":"price" }
}
}
}
}
}
} The $first = $response['aggs']['colors'];
// Using `terms` for `color`
if ($first['buckets'] !== null) {
foreach ($first['buckets'] as $color) {
$second = $color;
// using `terms` for `brand`
if ($second['buckets'] !== null) {
foreach ($second['buckets'] as $brand) {
$avgPrice = $brand['avgPrice']['value'];
// ... do business logic here ...
}
} else { // using `filter` for `brand`
$avgPrice = $second['avgPrice']['value'];
// ... do business logic here ...
}
}
} else { // using `filter` for `color`
$second = $first['brand']; // NOTE: we have to hardcode the agg name here!
// using `terms` for `brand`
if ($second['buckets'] !== null) {
foreach ($second['buckets'] as $brand) {
$avgPrice = $brand['avgPrice']['value'];
// ... do business logic here ...
}
} else { // using `filter` for `brand`
$avgPrice = $second['avgPrice']['value'];
// ... do business logic here ...
}
} If everything came back as a bucket, you can just iterate over everything: foreach ($response['aggs']['colors']['buckets'] as $color) {
foreach ($color['brand']['buckets'] as $brand) {
$avgPrice = $brand['avgPrice']['value'];
// ... do business logic here ...
}
} |
@polyfractal I'm totally with you on how you explain aggs (I invented it :D) to your example, the way I look at it, this agg is wrongly constructed: {
"aggs":{
"colors":{
"filter":{
"term":{
"color":"Red"
}
},
"aggs":{
"brand":{
"filter":{
"term":{
"brand":"Toyota"
}
},
"aggs":{
"avgPrice":{
"avg":{ "field":"price" }
}
}
}
}
}
} it should be: {
"aggs":{
"red":{
"filter":{
"term":{
"color":"Red"
}
},
"aggs":{
"toyota":{
"filter":{
"term":{
"brand":"Toyota"
}
},
"aggs":{
"avgPrice":{
"avg":{ "field":"price" }
}
}
}
}
}
} because with the second agg, you're effectively asking for the avg price of a red toyota, and when the response comes back you do: $response['aggs'][$selected_color][$selected_brand]['avgPrice']['value'] perhaps it's just a purist way of looking at things, but it is semantically correct. That said, I understand the practical aspect of having it and as I mentioned above, like @jpountz I'm fine if the APIs provide helper methods there (so @polyfractal you can add it to your php client)... but in the JSON, it just feels wrong to me. |
Looking at the discussion that happened on this ticket, I would say that there was no consensus on making this change hence we may want to close it. Or should we discuss it again on the next FixItFriday @clintongormley ? |
@javanna Should probably be a discussion for the aggs team - part of the decision about whether we need to refactor the aggs framework and how we'd do it. Marking has high hanging fruit |
cc @colings86 |
We spoke about this in the aggs team meeting and although there was almost a consensus feeling that this would be good to fix, there currently isn't a way to change the response format of an aggregation without making a hard braking change with no period of deprecation which makes this change very tricky and we didn't feel the current arguments for this change warrant such a harsh break int eh product on a major version or not. If/When we implement #11184 we should revisit this since it would provide a path for this change to be made but until then this is effectively stalled |
After discussing this again I am closing this issue because there is not a definitive argument for why this would be better and after the format being set for this long we should only change the format if there is a very compelling reason. In this case there are good arguments on both sides and there is no definitive argument for this. |
SingleBucketAggregations
(like filter aggregation) have no methodgetBuckets()
and also the json response contains no buckets array. This saves some space but also makes it harder to traverse the aggregation tree because when looking at the result one always has to know if the aggregation that produced the current level was aSingleBucketAggregation
or aMultiBucketAggregation
(like terms agg).Example for json:
Request with top level multi bucket agg:
Request with top level single bucket agg:
multi bucket yields:
single bucket yields:
although the two requests have the same level of "nestedness". If I was to post process the result I would have to change whichever application is consuming it when I change the top level aggregation from single to multibucket or the other way round.
The following would be more convenient for the second request:
This also affects the coming soon
getProperty
method for aggregations which is currently implemented to be consistent with the different behavior of single and multi buckets: #8421 (comment)The text was updated successfully, but these errors were encountered: