Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Fix issue with field names containing "." #37364

Merged
merged 11 commits into from
Jan 15, 2019
Merged

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Jan 11, 2019

Adjust FieldExtractor to be handle fields which contain . in their
name, regardless where they fall in in the document hierarchy. E.g.:

{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}

Fixes: #37128

Adjust FieldExtractor to be handle fields which contain `.` in their
name, regardless where they fall in in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: elastic#37128
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@matriv matriv requested a review from astefan January 11, 2019 14:00
@matriv
Copy link
Contributor Author

matriv commented Jan 11, 2019

I tried to make use of the XContentMapValues methods extractRawValues and filter to replace the "ugly" code in the FieldExtractor but this seems not possible as the extractRawValues doesn't distinguish between a missing entry in the map and a null entry. For example if we have an Array [null, null] then the extractRawValues will return an empty List which prevents us from handling it correctly and throwing the "Arrays not supported" exception.

} else {
throw new SqlIllegalArgumentException("Cannot extract value [{}] from source", fieldName);
}
}
return unwrapMultiValue(value);
}

private Tuple<Object, Integer> extractAsDottedField(Map<String, Object> map, int idx, String node) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking to remove this extracted code back to the body of extractFromSource() as now we create an extra Tuple and Integer object which happens for every document returned. What do you think?

Also the code might be a bit ugly but I wanted to avoid calling recursively a method (with every combination of a.b, a.b.c, etc.) Happy to hear better suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It looks like a small method and while it might be inlined the extra Tuple/Integer are boiler-plate.
If the method gets unrolled, both the index and the found value will be available without wrapping/boxing.

@matriv
Copy link
Contributor Author

matriv commented Jan 11, 2019

run gradle build tests 2

@matriv matriv requested a review from imotov January 11, 2019 16:43
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I like the lack of virtual calls - the structure is small enough that we can extract the values through loops. +1 on unrolling the method so there's no additional method call.

Regarding tests, I would add a few more examples such as : a.b { c { d.e } } a.b.c { d { e } and {a.b { c.d.e} } just to be on the safe side.

@@ -39,7 +40,7 @@
*/
private static String[] sourcePath(String name, boolean useDocValue, String hitName) {
return useDocValue ? Strings.EMPTY_ARRAY : Strings
.tokenizeToStringArray(hitName == null ? name : name.substring(hitName.length() + 1), ".");
.tokenizeToStringArray(hitName == null ? name : name.substring(hitName.length() + 1), ".");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely some formatting differences here.
Anyway, make sure you set your IDE to format only the lines that were modified as that prevents unnecessary changes like the above.

} else {
throw new SqlIllegalArgumentException("Cannot extract value [{}] from source", fieldName);
}
}
return unwrapMultiValue(value);
}

private Tuple<Object, Integer> extractAsDottedField(Map<String, Object> map, int idx, String node) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It looks like a small method and while it might be inlined the extra Tuple/Integer are boiler-plate.
If the method gets unrolled, both the index and the found value will be available without wrapping/boxing.

@@ -178,8 +204,8 @@ public boolean equals(Object obj) {
}
FieldHitExtractor other = (FieldHitExtractor) obj;
return fieldName.equals(other.fieldName)
&& hitName.equals(other.hitName)
&& useDocValue == other.useDocValue;
&& hitName.equals(other.hitName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again with the formatting...

@matriv
Copy link
Contributor Author

matriv commented Jan 11, 2019

@costin I don't know if you noticed already because it was my last commit but I added this: https://github.com/elastic/elasticsearch/pull/37364/files#diff-9aaee9be08445653bba7407b9f2b5ca3R269 Would you like to still have other cases in separate methods?

@costin
Copy link
Member

costin commented Jan 11, 2019

The random is fine :)

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice, elegant solution.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can improve it a bit.

DELETE test

PUT /test/blah/1
{
  "name.full": "Shane Connelly",
  "name.parts.first": "Shane"
}

PUT /test/blah/2
{
  "name": {
    "full": "Shane Connelly"
  },
  "name.parts": {
    "first": "Shane"
  }
}

POST /_xpack/sql?format=txt
{
  "query": "DESCRIBE test"
}

POST /_xpack/sql?format=txt
{
  "query": "select * from test"
}

GET /test/_search
{
  "_source": {
    "includes": "name.parts.first"
  }
}

@matriv
Copy link
Contributor Author

matriv commented Jan 12, 2019

@imotov I don't get your comment, how to improve it? in your example

POST /_xpack/sql?format=txt
{
  "query": "select * from test"
}

returns:

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane
Shane Connelly |Shane

@imotov
Copy link
Contributor

imotov commented Jan 12, 2019

Interesting. I ran it twice with this PR by itself and with this PR merged into master and every time I am getting

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane           
Shane Connelly |null            

Could you add an additional record like this and see what do you get back?

PUT /test/blah/3
{
  "name.parts": {
    "first": "Shane I"
  },
  "name": {
    "parts.first": "Shane II",
    "full": "Shane Connelly"
  }
}

I am now getting

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane           
Shane Connelly |null            
Shane Connelly |Shane II        

@matriv
Copy link
Contributor Author

matriv commented Jan 12, 2019

I get:

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane
Shane Connelly |Shane
Shane Connelly |Shane II

:-)

@matriv
Copy link
Contributor Author

matriv commented Jan 12, 2019

Hm, I cleaned and recompiled and I get the null too, will check it, thx!

@matriv
Copy link
Contributor Author

matriv commented Jan 13, 2019

@imotov @costin @astefan Adjusted the code to handle multiple entries in the map with common prefixes. Please take another look.

Many thanks @imotov for catching that!

@matriv
Copy link
Contributor Author

matriv commented Jan 13, 2019

Now the query yields:

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane
Shane Connelly |Shane
Shane Connelly |Shane I

so it's Shane I and not Shane II since we're searching for the longer path in the map that yields non-null value. What do you think?

@matriv
Copy link
Contributor Author

matriv commented Jan 13, 2019

@costin Check this: https://github.com/elastic/elasticsearch/pull/37364/files#diff-9aaee9be08445653bba7407b9f2b5ca3R312
The previous tests only had a single entry in the map.

Will make the changes to catch the same path(s)/different hierarchy(ies) and throw an exception about not supporting arrays (multi-values).

@matriv
Copy link
Contributor Author

matriv commented Jan 13, 2019

The difference now is here: https://github.com/elastic/elasticsearch/pull/37364/files#diff-6d292f4ab3ddfc415649c328fd2faefeR157
So we don't stop once a non-null value is found in the map but we keep walking the path and holding the latest non-null value found.

@matriv
Copy link
Contributor Author

matriv commented Jan 14, 2019

Pushed the new approach to handle multiple values for the same path but different hierarchies.

@matriv
Copy link
Contributor Author

matriv commented Jan 14, 2019

run default distro tests

1 similar comment
@matriv
Copy link
Contributor Author

matriv commented Jan 14, 2019

run default distro tests

@matriv
Copy link
Contributor Author

matriv commented Jan 14, 2019

run gradle build tests 1

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@matriv
Copy link
Contributor Author

matriv commented Jan 14, 2019

@imotov thanks for checking and catching this.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Let's get this in.


// Used to avoid recursive method calls
// Holds the sub-maps in the document hierarchy that are pending to be inspected.
LinkedList<Map<String, Object>> queue = new LinkedList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpick - use the Deque instead of LinkedList.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the index and the Map are associated, how about using only one Deque which holds a Tuple instead of two Deque:

Deque<Tuple<Map<String, Object> index>>` queue = ...

if (node instanceof Map) {
   queue.add(new Tuple<>(node, Integer.valueOf(i));
}

Copy link
Contributor Author

@matriv matriv Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I had those 2 before, while implementing. :-)

I chose LinkedList as we don't need any special operations of queue (we don't care about the order of dequeuing). I've just read about the better performance of ArrayDequeue over LinkedList and I'll change.

For the Tuple I had it as you said but I changed to a separate structure that holds integers to avoid the instantiation of Tuple objects on top of the Integer boxing. What do you think?

Copy link
Contributor Author

@matriv matriv Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And after benchmarking: https://gist.github.com/matriv/07e5909a49bed9794f350f458a0b2c60
The results show:

Benchmark                      Mode  Cnt       Score       Error  Units
MyBenchmark.testArrayDequeue  thrpt   25  608384,000 ±  7893,014  ops/s
MyBenchmark.testLinkedList    thrpt   25  581316,884 ±  8028,331  ops/s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autoboxing already happens and I wouldn't worry to much about it considering the depth is not that big. Same for Linked vs Array (in general arrays are faster except for inserting in the middle as that requires resizing/copying at which the linked structure excels).
I think the Tuple makes the code a bit more compact and safe (the queues cannot get out of sync) and more readable/simple code always trumps optimization (especially micro ones as here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI regarding the Tuples:
https://gist.github.com/matriv/3d2557ae1621e99a3c9505b5e4e6998f
and the result:

Benchmark                      Mode  Cnt       Score      Error  Units
MyBenchmark.testNoTuples      thrpt   25  601543,170 ± 7859,704  ops/s
MyBenchmark.testTuples        thrpt   25  573953,828 ± 5987,749  ops/s

if (node instanceof Map) {
// Add the sub-map to the queue along with the current path index
queue.add((Map<String, Object>) node);
idxQueue.add(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add i+1 to the queue since the next loop should start from the next position?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, I just preferred to start the i index of the for loop from prevPosition + 1 just because personally I see it more "visible". Would you prefer it your way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

@matriv matriv merged commit b594e81 into elastic:master Jan 15, 2019
@matriv matriv deleted the mt/fix-37128 branch January 15, 2019 07:41
matriv added a commit that referenced this pull request Jan 15, 2019
Adjust FieldExtractor to handle fields which contain `.` in their
name, regardless where they fall in, in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: #37128
@matriv
Copy link
Contributor Author

matriv commented Jan 15, 2019

Backported to 6.6 with 874b5ec

matriv added a commit that referenced this pull request Jan 15, 2019
Adjust FieldExtractor to handle fields which contain `.` in their
name, regardless where they fall in, in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: #37128
@matriv
Copy link
Contributor Author

matriv commented Jan 15, 2019

Backported to 6.5 with c72bd51

matriv added a commit that referenced this pull request Jan 15, 2019
Adjust FieldExtractor to handle fields which contain `.` in their
name, regardless where they fall in, in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: #37128
@matriv
Copy link
Contributor Author

matriv commented Jan 15, 2019

Backported to 6.x with fd3701b

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 15, 2019
* elastic/master:
  Docs be explicit on how to turn off deprecated auditing (elastic#37316)
  Fix line length for monitor and remove suppressions (elastic#37456)
  Fix IndexShardTestCase.recoverReplica(IndexShard, IndexShard, boolean) (elastic#37414)
  Update the Flush API documentation (elastic#33551)
  [TEST] Muted testDifferentRolesMaintainPathOnRestart
  Remove dead code from ShardSearchStats (elastic#37421)
  Simplify testSendSnapshotSendsOps (elastic#37445)
  SQL: Fix issue with field names containing "." (elastic#37364)
  Restore lost @Inject annotation (elastic#37452)
imotov added a commit to imotov/elasticsearch that referenced this pull request Jan 15, 2019
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to elastic#37364
@imotov
Copy link
Contributor

imotov commented Jan 15, 2019

@matriv I am sorry. It looks like I missed another edge case during review and immediately got bitten by it in geosql, which actually can extract objects from source in some cases. I opened #37502 to fix it.

imotov added a commit that referenced this pull request Jan 18, 2019
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to #37364
imotov added a commit that referenced this pull request Jan 18, 2019
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to #37364
imotov added a commit that referenced this pull request Jan 18, 2019
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to #37364
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants