Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Ignore multivalued key columns in lookup index on JOIN #120726

Merged
merged 4 commits into from
Jan 28, 2025

Conversation

ivancea
Copy link
Contributor

@ivancea ivancea commented Jan 23, 2025

Fixes #118780

Second part of #120519

In the first PR, we avoid matching multivalue keys in lookup when they come from the query.
Now, we avoid matching multivalues when the lookup index has multivalues in the key column.

@ivancea ivancea added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v9.0.0 v8.18.0 labels Jan 23, 2025
@ivancea ivancea requested review from nik9000 and alex-spies January 23, 2025 15:04
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

}
} catch (IOException e) {
// ignore
// TODO: Should we do something with the exception?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe wrap it into a runtime exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethrowing it as unchecked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap it, yeah. ESQL has kind of a rocky relationship with IOException. Pretty much everything should be allowed to throw it but we started wrapping it long ago and never stopped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this can throw if there's an IO error blowing up on the disk with lucene. It's not likely and probably terminal for the query. But we should bubble it up.

}
final int first = block.getFirstValueIndex(position);
return switch (count) {
Query doGetQuery(int position, int firstValueIndex, int valueCount) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those params are oddly specific, but at least we won't have to get them twice...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in the main esql project, and we don't have access from compute. So as to avoid cloning the query class, and as it doesn't use anything other than compute and lucene classes, I moved it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't add new tests, as the existing ones failed successfully. So just updated them

/**
* LOOKUP JOIN without MV matching on lookup index key (https://github.com/elastic/elasticsearch/issues/118780)
*/
JOIN_LOOKUP_SKIP_MV_ON_LOOKUP_KEY(JOIN_LOOKUP_V11.isEnabled()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid renaming conficts, I'm adding a new cap. We can remove it later if we want

SingleValueMatchQuery singleValueQuery = new SingleValueMatchQuery(
searchExecutionContext.getForField(field, MappedFieldType.FielddataOperation.SEARCH),
// Not emitting warnings for multivalued fields not matching
Warnings.NOOP_WARNINGS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should just ignore this behavior. Today we do emit a warning if there's an equality operation involving a multi-value field, ie row x=[1,2,3],y=3 | where x == y. The joining key matching is basically doing the same thing, no?

Copy link
Contributor Author

@ivancea ivancea Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is the same, indeed.
I have doubts over the warnings here, as I thought that it would be "clear enough" in JOIN.
That said, yeah, we could use that argument for everything else, so I'll better add warnings. I'll customize them, as SingleValueMatchQuery sends a generic single-value function encountered multi-value. I'll add something more specific for JOIN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some investigation and things, I have some local changes with this:

  1. Warnings are being correctly generated, nice!
  2. They are sent to coordinator correctly
  3. BUT the AsyncOperator (parent of the Lookup operator) doesn't merge them with the main ThreadLocal

It's not trivial to solve (Or at least, is a quite problematic topic). After commenting it a bit with Nik, we'll better leave the Warnings part for another PR.
If there's time to make it for the FF, that's fine. Otherwise, should be fine too (?). I'll keep working on it anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a comment around warnings that should be addressed as a follow-up: we should limit the amount of warnings an operator throws for cases to avoid flooding the system with objects that end up being dropped due to the limit being reached - either by a single operator, or by multiple operators, either before or during the warnings being emitted.
This is similar to the logger usage where we're invoking methods without doing any args transformation in order to perform the check the logging statement first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do limit these warnings in other operators and think we mostly just stick to prior art here. I think it's fine to juggle them in a follow up. Honestly, I think we should rip out the thread local handling of warnings from ESQL..... It's too sneaky. But I don't know if we'll have time for that for a while.

@@ -277,14 +248,45 @@ private static DirectoryData makeDirectoryWith(List<List<String>> terms) throws
for (var termList : terms) {
Document doc = new Document();
for (String term : termList) {
doc.add(new StringField("uid", term, Field.Store.NO));
doc.add(new KeywordField("uid", term, Field.Store.NO));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this as to avoid an unexpected docvalues type NONE for field error in the tests, within SingleValueMatchQuery. And add some other objects to the SearchExecutionContext mock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. StringField won't make doc values and we can't query them.

FWIW, I think we should make sure we fail correctly if you turn off doc values or searchability on a join key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! That also means that text fields can't be joined. They just super don't have doc values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked it, but #119473 already disallowed joins on TEXT, so it's safe for now

@ivancea ivancea requested a review from astefan January 24, 2025 17:52
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this change 👍

Also, I would keep this followup (null as a warning for MVs join key matching) somewhere on a list or new issue, just that we don't forget about it and we prioritize accordingly afterwards. Thanks.

# Conflicts:
#	x-pack/plugin/esql/qa/testFixtures/src/main/resources/lookup-join.csv-spec
SingleValueMatchQuery singleValueQuery = new SingleValueMatchQuery(
searchExecutionContext.getForField(field, MappedFieldType.FielddataOperation.SEARCH),
// Not emitting warnings for multivalued fields not matching
Warnings.NOOP_WARNINGS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do limit these warnings in other operators and think we mostly just stick to prior art here. I think it's fine to juggle them in a follow up. Honestly, I think we should rip out the thread local handling of warnings from ESQL..... It's too sneaky. But I don't know if we'll have time for that for a while.

}
} catch (IOException e) {
// ignore
// TODO: Should we do something with the exception?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap it, yeah. ESQL has kind of a rocky relationship with IOException. Pretty much everything should be allowed to throw it but we started wrapping it long ago and never stopped.

}
} catch (IOException e) {
// ignore
// TODO: Should we do something with the exception?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this can throw if there's an IO error blowing up on the disk with lucene. It's not likely and probably terminal for the query. But we should bubble it up.

@@ -277,14 +248,45 @@ private static DirectoryData makeDirectoryWith(List<List<String>> terms) throws
for (var termList : terms) {
Document doc = new Document();
for (String term : termList) {
doc.add(new StringField("uid", term, Field.Store.NO));
doc.add(new KeywordField("uid", term, Field.Store.NO));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. StringField won't make doc values and we can't query them.

FWIW, I think we should make sure we fail correctly if you turn off doc values or searchability on a join key.

@@ -277,14 +248,45 @@ private static DirectoryData makeDirectoryWith(List<List<String>> terms) throws
for (var termList : terms) {
Document doc = new Document();
for (String term : termList) {
doc.add(new StringField("uid", term, Field.Store.NO));
doc.add(new KeywordField("uid", term, Field.Store.NO));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! That also means that text fields can't be joined. They just super don't have doc values.

@ivancea ivancea enabled auto-merge (squash) January 28, 2025 12:41
@ivancea ivancea merged commit 8c3d9e1 into elastic:main Jan 28, 2025
16 checks passed
@ivancea ivancea deleted the esql-join-multivalue-on-index branch January 28, 2025 13:49
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

elasticsearchmachine pushed a commit that referenced this pull request Jan 28, 2025
#121037)

Fixes #118780

Second part of #120519

In the first PR, we avoid matching multivalue keys in lookup when they come from the query.
Now, we avoid matching multivalues when the lookup index has multivalues in the key column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: LOOKUP JOIN should warn when join keys are multi-valued
5 participants