Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: SingleBucketAggregation should create a single bucket #8510

Closed
brwe opened this issue Nov 17, 2014 · 17 comments
Closed

Aggregations: SingleBucketAggregation should create a single bucket #8510

brwe opened this issue Nov 17, 2014 · 17 comments

Comments

@brwe
Copy link
Contributor

brwe commented Nov 17, 2014

SingleBucketAggregations (like filter aggregation) have no method getBuckets() and also the json response contains no buckets array. This saves some space but also makes it harder to traverse the aggregation tree because when looking at the result one always has to know if the aggregation that produced the current level was a SingleBucketAggregation or a MultiBucketAggregation (like terms agg).

Example for json:

Request with top level multi bucket agg:

POST testidx/_search?size=0
{
  "aggs": {
    "terms": {
      "terms": {
        "field": "label",
        "size": 10
      },
      "aggs": {
        "histogram": {
          "histogram": {
            "field": "num",
            "interval": 1
          }
        }
      }
    }
  }
}

Request with top level single bucket agg:

POST testidx/_search?size=0
{
  "aggs": {
    "terms": {
      "filter": {
        "term": {
          "label": "label_a"
        }
      },
      "aggs": {
        "histogram": {
          "histogram": {
            "field": "num",
            "interval": 1
          }
        }
      }
    }
  }
}

multi bucket yields:

"aggregations": {
      "terms": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "label_a",
               "doc_count": 3,
               "histogram": {
                  "buckets": [
                     {
                        "key": 1,
                        "doc_count": 1
                     },
                    ...

single bucket yields:


"aggregations": {
      "terms": {
         "doc_count": 3,
         "histogram": {
            "buckets": [
               {
                  "key": 1,
                  "doc_count": 1
               },
               ...

although the two requests have the same level of "nestedness". If I was to post process the result I would have to change whichever application is consuming it when I change the top level aggregation from single to multibucket or the other way round.

The following would be more convenient for the second request:

"aggregations": {
    "terms": {
      "doc_count": 3,
      "buckets": [
        {
          "key": "some_key_that_makes_sense",
          "histogram": {
            "buckets": [
              {
                "key": 1,
                "doc_count": 1
              },
            ...

This also affects the coming soon getProperty method for aggregations which is currently implemented to be consistent with the different behavior of single and multi buckets: #8421 (comment)

@gmarz
Copy link
Contributor

gmarz commented Nov 17, 2014

+1

While I agree that it makes sense to have SingleBucketAggregation return the same structure as MultiBucketAggregation, you could also leverage the new meta field which was added in #8279 to specifically to deal with scenarios like this.

This is especially an issue in NEST, since it's a strongly-typed client and we need to know the exact C# type to deserialize the agg response to.

The meta field will greatly simplify this code in particular: https://github.com/elasticsearch/elasticsearch-net/blob/develop/src/Nest/Resolvers/Converters/Aggregations/AggregationConverter.cs#L26, since we'll be able to associate each agg request with the response type in the meta field.

cc @Mpdreamz

@Mpdreamz
Copy link
Member

+1, being able to program against both in a similar fashion would be great. Something being either a leave or node is an easier mental model then having two types of nodes where the shape is dependent on parent metadata not actually returned in the response.

That said: "key": "some_key_that_makes_sense", is that a trivial problem? If we can't give it any meaningful key for SingleBucketAggregation's, leaving it out might also cause code that looks for buckets and keys to misbehave. Having said this I remember generated keys being discussed somewhere but my memory is drawing a blank on where and what the conclusion was :)

@uboness
Copy link
Contributor

uboness commented Nov 18, 2014

-1 on the JSON side ... Single bucket ages both semantically and logically hold no buckets... It'd be like saying that java shouldn't have objects and instead you should always work with collections.... And to work with a single object you'll need to have a collection of size 1... Just so you'd work with objects the same way you work with collections.

Yes... You need to know what you're working with... On the lang clients on the other hand it might make sense to generalize the tree traversal

@clintongormley
Copy link
Contributor

@uboness i think it depends on the consumer. If a user writes a single agg then it makes sense to return that value without the intermediate bucket. For general purpose tools like Kibana it may make more sense to have a standard representation. I'm wondering if this more verbose syntax should be a query time option.

@uboness
Copy link
Contributor

uboness commented Nov 18, 2014

even with the generic consumers (e.g. kibana), they still need to distinguish between metrics and buckets right? and I also wonder how generic is it? For example, I'm sure the kibana is very much aware of the different types of aggs it exposes to the users and so there's dedicated code for each anyway (visualize the different aggs differently regardless of their nature).

One of the things I always liked in the json structure of aggs is that it's very intuitive and human readable. Navigating like last_year.monthly_purchases.january.avg_price... is more intuitive than last_year.xxxxx.monthly_purchases.january.avg_price where last_year is a filter agg (no clue what this xxxxx is as it has not logical explanation)

@brwe
Copy link
Contributor Author

brwe commented Nov 18, 2014

the json structure of aggs is that it's very intuitive and human readable.

Cannot say that for me but if enough people feel that way then maybe @clintongormley proposal to have query time parameter flag for that might make sense.

Yes... You need to know what you're working with...

This is actually the problem and why I opened the issue. It would be great to have the result in a such generic way that one would not have to know the type of aggregation because the structure is always the same.
Metrics and buckets would still have to be distinguished with this change and I wonder if this is not also something we could change in a second step but have no smart idea right now.

As for Kibana, I am unfamiliar with the implementation details and how much of a relief this change would actually be, so I summon @rashidkpc to join the discussion.

@javanna
Copy link
Member

javanna commented Nov 19, 2014

Interesting discussion, I am +1 on making the structure generic if possible. I think it would help a lot those users who write client code that interacts with elasticsearch, by making their life easier and their code better. There is a little price in terms of readability for the human eye, if that's a big concern we could have some kind of human flag as Clint mentioned...not sure how much that would end up complicating the codebase on our end though. Would be interesting to hear opinions from actual users here!

@rashidkpc
Copy link

From a Kibana perspective @javanna hits the nail on the head. Dealing with aggregations that don't have a buckets array is a real pain. While yes, we have different code for different aggs it would be really nice to be able to treat at least all bucket aggs the same

@uboness
Copy link
Contributor

uboness commented Nov 19, 2014

well.. for me, having an output for a simple filter agg like the following qualifies as wrong & misleading:

"aggregations": {
    "last_year": {
        "doc_count": 130,
        "buckets": [
            {
                "avg_price" : { "value" : 56.3 }
            }
        ]
    }
}
  • wrong - because filter aggs should not have buckets (it's meaningless)
  • misleading - because the response indicates that filter agg can have more than one
    bucket.

Further more, a common functionality in multi bucket aggs is that every bucket has an identifying key... what is the key in the example above? One can also request the buckets to be returned as an object (instead of an array) where each bucket is keyed by its key... and now what? how would the response look like? will we have a fake key for single bucket aggs? a fixed fake key? or require the user to provide a key? and if we do... what will the user call it.. I can't think of any key that would make sense, cause we're trying to name something that has no name.

The way I look at JSON responses is that their structure should be as self documenting as possible. We don't have schemas (and we don't want schemas) so the output needs to be semantically correct - that what makes a good API IMO. Now, perhaps you can write tools around this API that make your life easier if you really need to treat all agg responses the same (as I mentioned above, you can always have this code on the client side). I don't consider this to be human vs. machine - returning semantically correct structures is not for "humans" only.. at least IMO.

@jpountz
Copy link
Contributor

jpountz commented Nov 20, 2014

I am not a fan of returning a buckets element for all aggregations in the json either. However, I don't know if this would help, but I would be ok with having a getBuckets helper method on the Java API for all bucket aggregations that would return a list with a single element (the agg itself) for single-bucket aggregations.

@polyfractal
Copy link
Contributor

+1 making the result consistent, or at the very least providing a flag that enables the behavior

Single bucket ages both semantically and logically hold no buckets
wrong - because filter aggs should not have buckets (it's meaningless)

Conceptually, I think it is the exact opposite. When I explain buckets to people, I tell them a bucket is simply a criteria that documents can match. If the document matches the criteria, it is added to the bucket. Some buckets dynamically add criteria as they encounter new values (terms, histogram), while other's are hard-coded upfront (range)

A filter bucket has a single criteria: does this doc match the filter? If yes, add to the bucket. It's a boolean operation, so it only has one bucket, but conceptually it behaves identically to any other bucket. Ditto for Global bucket (criteria is "all").

For most practical applications, you don't really know what the results coming back are. The only way to definitively know is by keeping the original request and serializing each level based on the request. Alternatively, you get to play the introspection game at every level.

To make the point: if we followed that logic to it's extreme, the range bucket should change it's output when only one range is requested, since we are expecting the user to know that only one range is coming back.

Real life example

Here is a very real, simple example that you might find on an ecommerce site. This isn't a "generic" tool like Kibana, but would definitely benefit from generic responses. When you hit the main page, you get a tree of all colors and all brands:

{
   "aggs":{
      "colors":{
         "terms":{
            "field":"color",
            "size":10
         },
         "aggs":{
            "brand":{
               "terms":{
                  "field":"brand",
                  "size":10
               },
               "aggs":{
                  "avgPrice":{
                     "avg":{  "field":"price" }
                  }
               }
            }
         }
      }
   }
}

Now imagine a user selects "Red" and "Toyota". The aggregation changes to:

{
"aggs":{
   "colors":{
      "filter":{
         "term":{
            "color":"Red"
         }
      },
      "aggs":{
         "brand":{
            "filter":{
               "term":{
                  "brand":"Toyota"
               }
            },
            "aggs":{
               "avgPrice":{
                  "avg":{ "field":"price" }
               }
            }
         }
      }
   }
}

The terms buckets are replaced with a filter bucket for those two selected terms. Presently, to parse both of these responses, the application needs to introspect each level to determine if it switches from terms to filter:

$first = $response['aggs']['colors'];

// Using `terms` for `color`
if ($first['buckets'] !== null) {
  foreach ($first['buckets'] as $color) {
    $second = $color;

    //  using `terms` for `brand`
    if ($second['buckets'] !== null) {
      foreach ($second['buckets'] as $brand) {
        $avgPrice = $brand['avgPrice']['value'];
        // ... do business logic here ...
      }
    } else {   //  using `filter` for `brand`
      $avgPrice = $second['avgPrice']['value'];
      // ... do business logic here ...
    }
  }
} else {  // using `filter` for `color`
  $second = $first['brand'];              // NOTE: we have to hardcode the agg name here!

  //  using `terms` for `brand`
  if ($second['buckets'] !== null) {
    foreach ($second['buckets'] as $brand) {
      $avgPrice = $brand['avgPrice']['value'];
      // ... do business logic here ...
    }
  } else {   //  using `filter` for `brand`
      $avgPrice = $second['avgPrice']['value'];
      // ... do business logic here ...
    }
} 

If everything came back as a bucket, you can just iterate over everything:

foreach ($response['aggs']['colors']['buckets'] as $color) {
  foreach ($color['brand']['buckets'] as $brand) {
    $avgPrice = $brand['avgPrice']['value'];   
    // ... do business logic here ...
  }
}

@uboness
Copy link
Contributor

uboness commented Nov 26, 2014

@polyfractal I'm totally with you on how you explain aggs (I invented it :D)

to your example, the way I look at it, this agg is wrongly constructed:

{
"aggs":{
   "colors":{
      "filter":{
         "term":{
            "color":"Red"
         }
      },
      "aggs":{
         "brand":{
            "filter":{
               "term":{
                  "brand":"Toyota"
               }
            },
            "aggs":{
               "avgPrice":{
                  "avg":{ "field":"price" }
               }
            }
         }
      }
   }
}

it should be:

{
"aggs":{
   "red":{
      "filter":{
         "term":{
            "color":"Red"
         }
      },
      "aggs":{
         "toyota":{
            "filter":{
               "term":{
                  "brand":"Toyota"
               }
            },
            "aggs":{
               "avgPrice":{
                  "avg":{ "field":"price" }
               }
            }
         }
      }
   }
}

because with the second agg, you're effectively asking for the avg price of a red toyota, and when the response comes back you do:

$response['aggs'][$selected_color][$selected_brand]['avgPrice']['value']

perhaps it's just a purist way of looking at things, but it is semantically correct. That said, I understand the practical aspect of having it and as I mentioned above, like @jpountz I'm fine if the APIs provide helper methods there (so @polyfractal you can add it to your php client)... but in the JSON, it just feels wrong to me.

@javanna
Copy link
Member

javanna commented May 5, 2017

Looking at the discussion that happened on this ticket, I would say that there was no consensus on making this change hence we may want to close it. Or should we discuss it again on the next FixItFriday @clintongormley ?

@clintongormley
Copy link
Contributor

@javanna Should probably be a discussion for the aggs team - part of the decision about whether we need to refactor the aggs framework and how we'd do it. Marking has high hanging fruit

@javanna
Copy link
Member

javanna commented Jun 16, 2017

cc @colings86

@colings86
Copy link
Contributor

We spoke about this in the aggs team meeting and although there was almost a consensus feeling that this would be good to fix, there currently isn't a way to change the response format of an aggregation without making a hard braking change with no period of deprecation which makes this change very tricky and we didn't feel the current arguments for this change warrant such a harsh break int eh product on a major version or not. If/When we implement #11184 we should revisit this since it would provide a path for this change to be made but until then this is effectively stalled

@colings86
Copy link
Contributor

After discussing this again I am closing this issue because there is not a definitive argument for why this would be better and after the format being set for this long we should only change the format if there is a very compelling reason. In this case there are good arguments on both sides and there is no definitive argument for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants