Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add calculated numeric fields #69531

Merged
merged 58 commits into from
Mar 31, 2021

Conversation

romseygeek
Copy link
Contributor

@romseygeek romseygeek commented Feb 24, 2021

This commit adds a script parameter to long and double fields that makes
it possible to calculate a value for these fields at index time. It uses the same
script context as the equivalent runtime fields, and allows for multiple index-time
scripted feels to cross-refer while still checking for indirection loops.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look. Don't know how I feel about the memory index yet. I guess I'll have to reread its source code....

@@ -476,6 +508,11 @@ public Short parse(XContentParser parser, boolean coerce) throws IOException {
return parser.shortValue(coerce);
}

@Override
public BiFunction<Script, ScriptService, MapperScript> compiler(String fieldName) {
return LONG.compiler(fieldName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to "silently" promote here? I think I'd prefer a more conservative "no, you can't make this one calculated, use long" message for now and to relax it in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, have changed these to throw explicit errors

@romseygeek romseygeek changed the title [DRAFT] Add calculated fields, take 3 Add calculated fields, take 3 Mar 5, 2021
@romseygeek romseygeek marked this pull request as ready for review March 5, 2021 16:47
@romseygeek romseygeek changed the title Add calculated fields, take 3 Add calculated numeric fields Mar 8, 2021
private void executeFieldScript(String field) throws IOException {
assert mapperScripts.containsKey(field);
mapperScripts.get(field).script.accept(this);
mapperScripts.get(field).script = c -> {};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stops scripts being executed multiple times if they are referred to by other scripts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to set a boolean. I expect there won't be a real speed difference and it'd probably be a bit easier to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for when a calculated field depends on a runtime field? I think we said we didn't want that to work, but I think we should send a nice error message back to folks. It'd be lovely if we could do that on mapping update, but that may be a "for later" thing.

private void executeFieldScript(String field) throws IOException {
assert mapperScripts.containsKey(field);
mapperScripts.get(field).script.accept(this);
mapperScripts.get(field).script = c -> {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to set a boolean. I expect there won't be a real speed difference and it'd probably be a bit easier to follow.


@Override
public void executeAndIndex(SearchLookup lookup, LeafReaderContext ctx, int doc, Consumer<Number> emitter) {
LongFieldScript s = scriptFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be cool if LongFieldScript took a LongConsumer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, runFocDoc(doc, consumer) type of thing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Runtime fields needs an array with a length. Here we just want to immediately sync everything into the emitter. So maybe runtime fields should have the array and this should just sync the result like s.runForDoc(doc, l -> emitter.accept(l)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be reasonable to save that for a follow up. I'm convinced it'd save code and wouldn't slow things down but I'm not convinced it'd be a hell of a lot cleaner. It's not a blocker at least.

@romseygeek
Copy link
Contributor Author

Could you add a test for when a calculated field depends on a runtime field? I think we said we didn't want that to work, but I think we should send a nice error message back to folks. It'd be lovely if we could do that on mapping update, but that may be a "for later" thing.

At the moment this doesn't throw an error, because we make no distinction in the MappingLookup between runtime fields and concrete fields. @javanna and I talked about this a bit in #70065 and were unsure if we wanted to make the distinction, but the fact that it's needed in at least two places now (here + index sorting) suggests that we need it.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this! I did a first review round and left some design comments, mostly around abstractions and naming. I think that the functionality is there but I am hoping that we can find ways to simplify the flow a bit, I also asked @jtibshirani to have a look.

}
}

private class LazyDocumentReader extends LeafReader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this become a top level class? Then it could have a more indicative name and be tested separately?

@javanna
Copy link
Member

javanna commented Mar 15, 2021

At the moment this doesn't throw an error, because we make no distinction in the MappingLookup between runtime fields and concrete fields. @javanna and I talked about this a bit in #70065 and were unsure if we wanted to make the distinction, but the fact that it's needed in at least two places now (here + index sorting) suggests that we need it.

I don't know the answer yet, but I am wondering if there are ways to detect this without introducing a distinction between runtime fields and concrete fields in MappingLookup. I wonder about throwing an error at mapping update also: it sounds like what we should do, but how? Is the plan to find out by executing the script, or to retrieve the fields that the script refers to and check them one by one (and their dependencies, and so on)?

@romseygeek
Copy link
Contributor Author

Latest reworking:

  • PostParsePhase etc are gone, we now have a method on DocumentParser which makes much more sense
  • LazyDocumentReader is now DocumentLeafReader (it's not particularly lazy anymore!) and lives in its own class
  • PostParseExecutor is now IndexTimeScript
  • MapperScript is no longer a top-level class, although we still need an interface inside NumberFieldMapper so that we can build both long and double scripts.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another review round and left some more comments, I like the direction, it's getting simpler and hopefully we can simplify this even further, but it's close.

// does not attempt to do any analysis on text fields. It also supports stored
// fields where MemoryIndex does not. It is used to back index-time scripts that
// reference field data and stored fields from a document that has not yet been
// indexed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this! any reason why it isn't javadocs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was originally on a private class and I didn't convert it when I moved it to a top-level one. Will update!

@@ -61,7 +68,7 @@ ParsedDocument parseDocument(SourceToParse source,
source,
parser);
validateStart(parser);
internalParseDocument(mappingLookup.getMapping().getRoot(), metadataFieldsMappers, context, parser);
internalParseDocument(mappingLookup, metadataFieldsMappers, context, parser);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think that you can revert this change now, the root object should be enough as an argument.

return null;
}
assert compiler != null;
return type.compiler(name).apply(script.get(), compiler);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that the type can already return the MapperScript, no need for the function, that gets applied straightaway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would make the method not abstract and have the default impl throw exception.

@@ -120,6 +149,10 @@ public NumberFieldMapper build(ContentPath contentPath) {
}
}

public interface MapperScript {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add javadocs? his interface will be shared throughout the different implementations later, so maybe it should be extracted from here already? I am thinking that given we don't care about auto-boxing, we can type it.

import static org.hamcrest.Matchers.containsString;
import static org.hamcrest.Matchers.equalTo;

public class CalculatedFieldTests extends MapperServiceTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we are not using the calculated terminology anywhere else at this point, maybe we should not use it here either. Do we need a specific test class for these tests or could they fit in the ordinary mapper tests? What tests will we add as we support more types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these tests are generic 'does this work at all' tests, and some are more specific to the impls. I think it will become clearer which are which when we add other field types.

@@ -49,6 +49,11 @@ protected void registerParameters(ParameterChecker checker) throws IOException {
m -> assertTrue(((HistogramFieldMapper)m).ignoreMalformed()));
}

@Override
protected boolean supportsScripts() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we unify naming between this and allowsIndexTimeScript ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not actually about index time scripts, it's about access to field data via a lookup. Maybe it should be supportsLookup instead?


// field scripts can be called both by executeIndexTimeScripts() and via the document reader,
// so to ensure that we don't run field scripts multiple times we guard them with
// this one-time executor class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be javadocs?

/**
* Returns an index time script for this field mapper, or {@code null} if none is defined
*/
public IndexTimeScript getIndexTimeScript() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish that we did not have to go from MapperScript to IndexTimeScript, and rather reuse the same concept. I still wonder if this could be void executeScript(SearchLookup, LeafReaderContext, ParseContext)? We could make the notion of scripted field known to FieldMapper, let MappingLookup collect all mappers that have a script declared, then each one of those has the execute method called. That way you can also ensure the same behaviour once you add this functionality to other mappers?

This suggestion goes against another one I made on making OneTimeFieldExecutor implement IndexTimeScript. MAybe with this suggestion IndexTimeScript could go away and we would have to see what to do with the one time executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the tricky part here is that MapperScripts can be executed both at index time and at fetch time (for the value fetcher) and those two contexts have entirely different parameter requirements and so different signatures. I think I can get rid of OneTimeFieldExecutor entirely and just have an anonymous consumer within DocumentParser, and I may be able to unify these two as well by having a class with two methods one it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I figured why MapperScript is there, and maybe that should be the only abstraction of a script for this functionality. I don't see the value of the indirection that IndexTimeScript causes now, as it could be just a method exposed directly by the mapper. Maybe you can give that a try and see what happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have some way of knowing whether or not a FieldMapper has a script on it, so that we can correctly intercept calls to those fields from the index-time SearchLookup and allow scripted fields to refer to each other. The advantage of having this extra indirection is that it supplies this directly - if the IndexTimeScript is null, then the field doesn't have one. If we move this to FieldMapper#executeScript then we need to add an additional hasIndexTimeScript boolean, which seems like the wrong trade-off to me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make FieldMapper script aware, more mappers will support it anyways, and FieldMapper already has getIndexTimeScript anyways for that purpose, plus executing the script itself. Would that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have updated things:

  • IndexTimeScript is gone
  • FieldMapper has two methods, hasIndexTimeScript and executeIndexTimeScript

@Override
public int advance(int target) {
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might expose ourselves to subtle bugs by not implementing the DocIdSetIterator contract correctly. Let's make sure that docID() returns -1 when the iterator is not positioned and make nextDoc/advance return NO_MORE_DOCS when appropriate?

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all the iterations @romseygeek I left some super minor comments but this looks ready to me. The only thing that is left for us to settle on is the default behaviour when an error occurs running a script: reject or ignore? I know that @jimczi had an opinion on this. I am good with strict by default, and lenient when specifically configured. And maybe we should re-evaluate once we add the same handling to runtime fields.


/**
* A LeafReader over a lucene document that exposes doc values and stored fields.
* Note that unlike lucene's MemoryIndex implementation, this holds no state and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make LeafReader and MemoryIndex links? :)

@@ -167,6 +175,14 @@ FieldTypeLookup fieldTypesLookup() {
return fieldTypeLookup;
}

FieldTypeLookup indexTimeLookup() {
return indexTimeLookup;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still bugs me slightly that we end up exposing this. Thinking of alternatives, we could move a portion of ParseContext#executeIndexTimeScripts to MappingLookup and rather expose a method that returns a Map<String, Consumer> but I am not convinced this is a good idea either ;) maybe you have other ideas

@@ -108,9 +129,17 @@ public Builder docValues(boolean hasDocValues) {
return this;
}

private FieldValues<Number> mapperScript() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename the method, looks like a leftover?

@@ -1100,8 +1210,27 @@ protected void parseCreateField(ParseContext context) throws IOException {
}
}

@Override
public boolean hasIndexTimeScript() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove indexTime from the name of these two methods. There is only one script in a mapper anyways.

}

@Override
public void executeIndexTimeScript(SearchLookup searchLookup, LeafReaderContext readerContext, ParseContext parseContext) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would add the docId as an argument here, even if it will always be zero, just because the other side is the one that knows about what type of reader there is and that it holds a single document.

/**
* @return whether this field mapper uses a script to generate its values
*/
public boolean hasIndexTimeScript() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how having two methods is not fantastic, and why you had done it differently before. I had envisioned script as a member of FieldMapper directly, but we are going to see if that is possible once we add support for script to other mappers. The type of the consumer will make it possibly harder to share the impl but we'll see.

I am happy though with the execute method, I find it much clearer than returning an executor like we had before, because it is evident what it does and easier to trace.

double[] vs = s.values();
for (int i = 0; i < s.count(); i++) {
consumer.accept(vs[i]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked before if the bit from runForDoc till the end can be a new method exposed by the Script class, in the effort of consolidating how field script classes expose their values. Do you have thoughts on this? We could also do it later

@@ -54,7 +54,9 @@ private CacheKey() {}
private final Map<String, ObjectMapper> objectMappers;
private final boolean hasNested;
private final FieldTypeLookup fieldTypeLookup;
private final FieldTypeLookup indexTimeLookup;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have some docs around what this is for and why? I know that I will look at it in a few months and wonder...but maybe just reading "this resolves all of the fields besides runtime fields etc." would help me

import static org.hamcrest.Matchers.containsString;
import static org.hamcrest.Matchers.equalTo;

public class IndexTimeScriptTests extends MapperServiceTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ on the new name ;)


public class IndexTimeScriptTests extends MapperServiceTestCase {

public void testCalculatedFieldLength() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update the method name here too?

@romseygeek romseygeek merged commit 1653f2f into elastic:master Mar 31, 2021
@romseygeek romseygeek deleted the runtime/runtime-to-index-3 branch March 31, 2021 10:14
romseygeek added a commit to romseygeek/elasticsearch that referenced this pull request Mar 31, 2021
This commit adds a script parameter to long and double fields that makes
it possible to calculate a value for these fields at index time. It uses the same
script context as the equivalent runtime fields, and allows for multiple index-time
scripted fields to cross-refer while still checking for indirection loops.
romseygeek added a commit that referenced this pull request Mar 31, 2021
This commit adds a script parameter to long and double fields that makes
it possible to calculate a value for these fields at index time. It uses the same
script context as the equivalent runtime fields, and allows for multiple index-time
scripted fields to cross-refer while still checking for indirection loops.

`on_script_error`::

Defines what to do if the script defined by the `script` parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be leaving a comment so late -- was just looking how this turned out! I was wondering if we considered re-using the ignore_malformed boolean flag here instead of introducing a new parameter? That feels simple and consistent to me, and avoids introducing a new enum parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Julie, good question. It all started from "let's support some ignore_malformed behaviour" but then we decided to make it specific to scripts especially as we were not convinced that the two would have the same default values, as far as I recall. Another idea could be to support the new parameter as part of the definition of the script:

"field" : {
  "script" : {
    "source" : "here goes the script",
    "on_error" : "continue|fail"
  }
}

There is still time to change this if we want to, so let's discuss this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did discuss re-using ignore_malformed but there are some important differences in the semantics of the two parameters. In particular, ignore_malformed is specifically targeted at handling input data that cannot be converted into a number; because the scripts are typed, there is no way that they can return a malformed object, and so the same logic doesn't really apply. In addition, a script could throw an error for any number of reasons (eg logic errors, array index out of bounds exceptions) and ignore_malformed feels too specific to cover all of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that we are reusing the _ignored meta field though. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the context. I agree the name would feel a bit off, maybe if it had been called ignore_parse_errors we could have reused it!

`script`::

If this parameter is set, then the field will index values generated
by this script, rather than reading the values directly from the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rather suggests that values that are set on the document would always be ignored, is it correct? I wonder if we should make it more explicit.

Do we have tests for this case (I couldn't find them from a quick look but I might have just missed them), as well as for the case when the script computes values based on content of the document under the same field name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reject documents that have a value for the field, if the field declares a script. We should probably clarify that in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to explicitly documenting this, I will open a followup

@pugnascotia pugnascotia added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Apr 20, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants