Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support querying nested fields in KQL #44554

Closed
Bargs opened this issue Aug 30, 2019 · 13 comments · Fixed by #47070
Closed

Support querying nested fields in KQL #44554

Bargs opened this issue Aug 30, 2019 · 13 comments · Fixed by #47070
Assignees
Labels
enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@Bargs
Copy link
Contributor

Bargs commented Aug 30, 2019

KQL doesn't currently support querying on nested fields. To support them properly we'll need to introduce some new syntax to the language. Nested fields can be queried in two different ways:

  1. Parts of the query may only match a single nested doc (bool inside nested). This is what most people want when querying on a nested field.
  2. Parts of the query may match different nested docs (nested inside bool). This is how a regular object field works but nested fields can be queried in the same way. Although generally less useful, there are occasions where one might want to query a nested field in this way.
@Bargs Bargs added enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Aug 30, 2019
@Bargs Bargs self-assigned this Aug 30, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app

@ppadovani
Copy link

ppadovani commented Sep 4, 2019

IMHO - You should infer nested automatically based on the known properties of the fields, and use scoping in the query to define when and how you combine nested queries.

Using your examples above:
level1.foo:foo and level1.bar:qux
is equivalent to:
{level1.foo:foo} and {level1.bar:qux}

Now in terms of defining standard syntax, I would not use curly braces and instead use standard parens to define scope and infer from how the expression is defined within the parens to combine nested fields with common paths.

Continuing with the example from above:
(level1.foo:foo and level1.bar:qux)
would combine the two expressions into a nested query using a must.

To complete the example in my mind
(level2.bar:bar and level1.foo:foo)
scoping of the parens does not combine nested queries due to differing nested paths, but may imply additional scoping within the overall query should it be more complex.

One other point, you cannot ever assume that the field path defined in an expression can be used to define the nested path(s) for the nested query. I could easily have a non nested object on level1 that resolves into a longer dotted path whose nested path is still just level1. Therefore the underlying parser must already have full knowledge about the properties and paths of the fields in question and can leverage that during parsing.

{
  "level1": [
    {
      "foo": "foo",
      "bar": "bar",
       "mini": {
            "junk": "trash",
            "drink": "beer"
        }
    },
    {
      "foo": "baz",
      "bar": "qux"
    }
  ]
}

@Bargs
Copy link
Contributor Author

Bargs commented Sep 5, 2019

Thanks for the input @ppadovani, I think it's highly valuable since you've worked on this problem in your own plugin.

I have mixed feelings about inferring nested automatically. It's certainly doable and it would save the user a little typing, but I'm afraid it could cause confusion. Say you start with the following query:

foo:foo

We're automatically adding the nested query for the user. They may not even know this is a nested field. Now they expand their query, wanting to match multiple parts of individual nested documents. The natural thing to do is this:

foo:foo and bar:qux

We're automatically wrapping each side of the and in a nested query, but this probably isn't what the user wants. They actually want { foo:foo and bar:qux }. But because we inferred nested for them, we set them off on the wrong foot. If we require braces around every nested field, it gives the user an indication that there is something special about this field and they need to make an extra decision. We can display a helpful toast message the first time they query on a nested field to explain what it's all about. And with autocomplete, we can automatically insert the {} is most cases so there won't be much additional typing.

I liken this to if statements in Javascript. It's possible to write an if statement without curly braces. But then it's easy to add a line that's meant to be part of the conditional, but is not. So most JS developers agree it's a best practice to always include the curly braces, even for conditionals that wrap a single line.

====================================================

On the topic of using regular parens, I considered it before but I think it adds some troublesome ambiguity to the language. Take the following query:

level1.foo:foo and (level1.bar:bar or level1.bar:qux)

Today, the parens override precedence which creates an ES query like this:

{
  "bool": {
    "filter": [
      {
        "match": {
          "level1.foo": "foo"
        }
      },
      {
        "bool": {
          "should": [
            {
              "match": {
                "level1.bar": "bar"
              }
            },
            {
              "match": {
                "level1.bar": "qux"
              }
            }
          ]
        }
      }
    ]
  }
}

If we make parens indicate nesting groups as well, it won't be possible to override precedence without creating a nested group. In other words, it won't be possible to create the following query:

{
  "bool": {
    "filter": [
      {
        "match": {
          "level1.foo": "foo"
        }
      },
      {
        "bool": {
          "should": [
            {
              "nested": {
                "path": "level1",
                "query": {
                  "match": {
                    "level1.bar": "bar"
                  }
                }
              }
            },
            {
              "nested": {
                "path": "level1",
                "query": {
                  "match": {
                    "level1.bar": "qux"
                  }
                }
              }
            }          ]
        }
      }
    ]
  }
}

If we use the curly brace syntax I've proposed, we can generate the previous ES query with a KQL expression like this:

level1.foo:foo and ({level1.bar:bar} or {level1.bar:qux})

@ppadovani
Copy link

I was reviewing my notes and code last night in order to refresh my memory on this. You are correct that just using parens when they are already used for precedence would not allow for the flexibility that might be needed for more complex queries. I got around this problem by introducing the unary keyword EXISTS that must precede the precedence parens. The idea behind the keyword was that everything within the expression that follows the keyword must exist for it be evaluated to true. This would be the equivalent of using your curly braces.

I do not believe the EXISTS unary would be the correct symbol to use, but the basic idea of using a preceding unary to indicate a 'grouping' or 'scope' might be an alternative way to think about this problem. The key would be to find a unary symbol that is somewhat intuitive for the problem at hand.

Bottom line, I feel that introducing the curly braces may induce weariness around the use of braces and parens. I liken it to the language LISP which, if you ever tried to use it or took a class using it in college, quickly became synonymous with "Lots of Idiotic Stupid Parentheses". :-)

@Bargs
Copy link
Contributor Author

Bargs commented Sep 6, 2019

haha, I like Lisp 😅

I take your meaning though. Something like this could be an option:

NEST (level1.bar:bar or level1.bar:qux) though we'd come up with something better than NEST

I found another problem with the current syntax I've proposed, which I'm trying to think through now. Say you have the following document, where level1 and level2 are both nested type fields.

{
  "level1": [
    {
      "level2": [
        {
          "foo": "bar"
        },
        {
          "foo": "baz"
        }
      ]
    },
    {
      "level2": [
        {
          "foo": "tux"
        },
        {
          "foo": "asd"
        }
      ]
    }
  ]
}

With the curly brace syntax there's no way to specify at which level you want to do the grouping. In other words I can't create the following ES query where I want to match a single document at level1 but I'm happy to match any document at level2:

POST nesttest/_search
{
  "query": {
    "nested": {
      "path": "level1",
      "query": {
        "bool": {
          "filter": [
            {
              "nested": {
                "path": "level1.level2",
                "query": {
                  "match": {
                    "level1.level2.foo": "bar"
                  }
                }
              }
            },
            {
              "nested": {
                "path": "level1.level2",
                "query": {
                  "match": {
                    "level1.level2.foo": "baz"
                  }
                }
              }
            }
          ]
        }
      }
    }
  }
}

Maybe the syntax should require you to specify the level at which to do the grouping. For example, to mimic the above ES query I could do something like

level1:{ level2.foo:bar and level2.foo:baz }

If I wanted to group at level2 I would do this:

level1.level2:{ foo:bar and foo:baz }

I kind of like this because it makes the scope of the grouping really clear. It also sort of mirrors the shape of the actual nested objects which is nice. I thought I might be able to use regular parens in this case, but that would conflict with our existing field:(value1 and value2) syntax. I tried to think through whether that syntax could support both cases, but I don't think it can.

@ppadovani
Copy link

ppadovani commented Sep 7, 2019

This use case is why I like the unary approach. In the implementation of my plugin, the query would look like:

EXISTS(level1.level2.foo:bar and leval1.level2.foo:baz)

Let's assume for the moment the unary approach using a keyword is taken. In my mind what the query is doing here is a form of grouping/collecting query expressions by nested path. So what about COLLATE?

collate: to collect and combine (texts, information, or sets of figures) in proper order.

COLLATE( level1.level2.foo:bar and leval1.level2.foo:baz )

Let's say that level1 also had another field called 'drink', how would the unary approach look?

COLLATE( level1.drink:water and COLLATE( level1.level2.foo:bar and level1.level2.foo:baz ) )

You grouping approach might look like:

level1:{ drink:water and level2:{ foo:bar and foo:baz } }

I do like the second approach in terms of the feel of the presentation to the user, however it would require the user to know what the nested paths were defined as. If level2 had another non-nested object within which foo existed say 'fly', then the user would have to know to write the query like:

level1.level2:{ fly.foo:bar and fly.foo:baz }

Whereas a unary approach the user doesn't have to know/remember the nested path. This would look like:

COLLATE( level1.level2.fly.foo:bar and level1.level2.fly.foo:baz )

Your second approach is more compact but requires the user to understand the underlying structure and storage method of the data, where the more verbose unary approach requires them to just know the basic document structure. Additionally, I believe your second approach will require the parser to fully validate the outer 'level1' (path) against the internal terms adding complexity. How would type ahead work with your second approach?

Here is another query to chew on:

COLLATE( level1.drink:water and level1.level2.foo:bar )

This would be a valid query, as the outer nested 'level1' would be combined, but the inner 'level2' would still have its own inner nested query. I think the second approach might look like:

level1:{ drink:water and level2:{ foo:bar } }

Finally, what if you wanted to query for documents where you didn't want to group the second level?

Unary approach:

COLLATE( level1.drink:water and ( level1.level2.foo:bar and level1.level2.foo:baz ) )

I think this might be the other approach?

level1:{ drink:water and ( level2:{ foo:bar } and level2:{ foo:baz } ) }

I think there are pros and cons to both approaches, the key is choosing an approach with the fewest cons. ;-)

@Bargs
Copy link
Contributor Author

Bargs commented Sep 9, 2019

Your second approach is more compact but requires the user to understand the underlying structure and storage method of the data, where the more verbose unary approach requires them to just know the basic document structure.

I think the unary approach does still require the user to know which fields are nested though.

COLLATE( COLLATE( level1.level2.fly.foo:bar and level1.level2.fly.foo:baz ) )

In this example the user has to know that level1 and level2 are nested but fly is not, in order to get the right number of COLLATEs. And someone reading this query has to understand why there is a double collate at the beginning. To me level1.level2:{ fly.foo:bar and fly.foo:baz } is much more clear.

Additionally, I believe your second approach will require the parser to fully validate the outer 'level1' (path) against the internal terms adding complexity. How would type ahead work with your second approach?

I'm not 100% sure I understand what you mean here. If you mean that the outer field will need to be combined with the inner field to get the full field name, that's correct but I believe we've already established a pattern in our parser that can be used to solve this.

The typeahead code will definitely get more complex since it will need to know about the outer field if there is one, but it's certainly doable.

The unary approach (or even the curly braces I originally proposed, which differ only in the special symbol being used) would probably be less complicated to implement and I do like that we'd continue to have the full field names everywhere. But the biggest thing I'm struggling with right now is that double COLLATE. I think that'll look very strange and mysterious.

@ppadovani
Copy link

ppadovani commented Sep 10, 2019

Oops... I looked back at the double COLLATE, you only need one and I corrected my comment above. The reason you only need one, is that due to the internal knowledge of the nested field structure we can imply that there are two nested queries an outer one for path=level1 and an inner for path=level2. The user doesn't have to know that.

What I meant about the path validation, was that the parser can't just blindly combine the paths, it has to actually have logic around validating that the outer path given is correct based on the inner query parameters. What will the parser do if the inner query contains A and B and only A is valid for the level1 path?

@Bargs
Copy link
Contributor Author

Bargs commented Sep 10, 2019

@ppadovani hmmm, I assumed the double COLLATE indicated that the nesting should happen at level2. If that's not the case, how would you differentiate between these two queries:

level1.level2:{ fly.foo:bar and fly.foo:baz }
level1:{ level2.fly.foo:bar and level2.fly.foo:baz }

The first one requires both query clauses to match a single level2 document. The second query only requires each query clause to match a single level1 document. The matches could come from multiple level2 documents.

@Bargs
Copy link
Contributor Author

Bargs commented Sep 10, 2019

the parser can't just blindly combine the paths, it has to actually have logic around validating that the outer path given is correct based on the inner query parameters. What will the parser do if the inner query contains A and B and only A is valid for the level1 path?

We'll have to throw an error in that situation. But there's a similar issue with the unary approach. Say A and B are both nested fields, and the user tries to put them both in the same COLLATE. The nested query we create for the COLLATE can only have one path, so we can't include both A and B and we'll get an error.

@ppadovani
Copy link

Hmm... this is a good one to think through. So, the first example would generate something like:

nested {
   path= level1
   nested {
     path= level1.level2
      must {
          fly.foo=bar
          fly.foo=baz
     }
   }
}

whereas the second might look like:

nested {
  path=test1
   must {
      nested {
         path=test1.test2
         must {
             fly.foo=bar
         }
       }
      nested {
        path=test1.test2
        must {
           fly.foo=baz
         }
      }
   }
}

So, thinking through the unary approach, I think you are correct and I am mistaken. My original assertion of:
COLLATE(level1.level2.fly.foo:bar and level1.level2.fly.foo:baz)
only satisfies the second query but not the first.

To satisfy the fist query a double COLLATE must be applied and used to infer depth and therefore requires the user to understand the underlying structure. I guess nothing comes for free? ;-)

But even this isn't as intuitive as I would like:
COLLATE( COLLATE( level1.level2.fly.foo:bar and level1.level2.fly.foo:baz ) )

I guess you could possibly combine approaches?
COLLATE level1( COLLATE level2( fly.foo:bar and fly.foo:baz))

If we did combine approaches, wouldn't this still be valid, and one could infer that we are collating at level2 but would still generate the first nested query above?
COLLATE level2( level1.level2.foo:bar and level1.level2.foo:bar)

@Bargs
Copy link
Contributor Author

Bargs commented Sep 10, 2019

If we did combine approaches, wouldn't this still be valid, and one could infer that we are collating at level2 but would still generate the first nested query above?

What if there are two level2s? e.g.

COLLATE level2( level1.level2.level2.foo:bar and level1.level2.level2.foo:bar)

I don't think we can avoid requiring the full path

@Bargs
Copy link
Contributor Author

Bargs commented Oct 1, 2019

PR #47070

yctercero added a commit that referenced this issue Jul 22, 2020
… logic (#72490)

## Summary

This PR is meant to update the exception builder logic to handle nested fields. If you're unfamiliar with nested fields, you can read up more on it [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and [here](#44554). It also does a bit of cleanup, so though it may look like a lot of changes, parts of it were just moving some things around.
yctercero added a commit to yctercero/kibana that referenced this issue Jul 22, 2020
… logic (elastic#72490)

## Summary

This PR is meant to update the exception builder logic to handle nested fields. If you're unfamiliar with nested fields, you can read up more on it [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and [here](elastic#44554). It also does a bit of cleanup, so though it may look like a lot of changes, parts of it were just moving some things around.
yctercero added a commit to yctercero/kibana that referenced this issue Jul 22, 2020
… logic (elastic#72490)

## Summary

This PR is meant to update the exception builder logic to handle nested fields. If you're unfamiliar with nested fields, you can read up more on it [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and [here](elastic#44554). It also does a bit of cleanup, so though it may look like a lot of changes, parts of it were just moving some things around.
yctercero added a commit that referenced this issue Jul 23, 2020
… logic (#72490) (#72982)

## Summary

This PR is meant to update the exception builder logic to handle nested fields. If you're unfamiliar with nested fields, you can read up more on it [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and [here](#44554). It also does a bit of cleanup, so though it may look like a lot of changes, parts of it were just moving some things around.
yctercero added a commit that referenced this issue Jul 23, 2020
… logic (#72490) (#72983)

## Summary

This PR is meant to update the exception builder logic to handle nested fields. If you're unfamiliar with nested fields, you can read up more on it [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and [here](#44554). It also does a bit of cleanup, so though it may look like a lot of changes, parts of it were just moving some things around.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants