-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add calculated numeric fields #69531
Add calculated numeric fields #69531
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look. Don't know how I feel about the memory index yet. I guess I'll have to reread its source code....
@@ -476,6 +508,11 @@ public Short parse(XContentParser parser, boolean coerce) throws IOException { | |||
return parser.shortValue(coerce); | |||
} | |||
|
|||
@Override | |||
public BiFunction<Script, ScriptService, MapperScript> compiler(String fieldName) { | |||
return LONG.compiler(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to "silently" promote here? I think I'd prefer a more conservative "no, you can't make this one calculated, use long" message for now and to relax it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, have changed these to throw explicit errors
server/src/main/java/org/elasticsearch/index/mapper/IndexTimeScriptContext.java
Outdated
Show resolved
Hide resolved
private void executeFieldScript(String field) throws IOException { | ||
assert mapperScripts.containsKey(field); | ||
mapperScripts.get(field).script.accept(this); | ||
mapperScripts.get(field).script = c -> {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stops scripts being executed multiple times if they are referred to by other scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to set a boolean. I expect there won't be a real speed difference and it'd probably be a bit easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for when a calculated field depends on a runtime field? I think we said we didn't want that to work, but I think we should send a nice error message back to folks. It'd be lovely if we could do that on mapping update, but that may be a "for later" thing.
server/src/main/java/org/elasticsearch/index/mapper/IndexTimeScriptContext.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/IndexTimeScriptContext.java
Outdated
Show resolved
Hide resolved
private void executeFieldScript(String field) throws IOException { | ||
assert mapperScripts.containsKey(field); | ||
mapperScripts.get(field).script.accept(this); | ||
mapperScripts.get(field).script = c -> {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to set a boolean. I expect there won't be a real speed difference and it'd probably be a bit easier to follow.
server/src/main/java/org/elasticsearch/index/mapper/MappingLookup.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public void executeAndIndex(SearchLookup lookup, LeafReaderContext ctx, int doc, Consumer<Number> emitter) { | ||
LongFieldScript s = scriptFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it'd be cool if LongFieldScript
took a LongConsumer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, runFocDoc(doc, consumer)
type of thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Runtime fields needs an array with a length. Here we just want to immediately sync everything into the emitter. So maybe runtime fields should have the array and this should just sync the result like s.runForDoc(doc, l -> emitter.accept(l))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be reasonable to save that for a follow up. I'm convinced it'd save code and wouldn't slow things down but I'm not convinced it'd be a hell of a lot cleaner. It's not a blocker at least.
server/src/main/java/org/elasticsearch/index/mapper/ScriptParams.java
Outdated
Show resolved
Hide resolved
At the moment this doesn't throw an error, because we make no distinction in the MappingLookup between runtime fields and concrete fields. @javanna and I talked about this a bit in #70065 and were unsure if we wanted to make the distinction, but the fact that it's needed in at least two places now (here + index sorting) suggests that we need it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this! I did a first review round and left some design comments, mostly around abstractions and naming. I think that the functionality is there but I am hoping that we can find ways to simplify the flow a bit, I also asked @jtibshirani to have a look.
server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/IndexTimeScriptContext.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
private class LazyDocumentReader extends LeafReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this become a top level class? Then it could have a more indicative name and be tested separately?
server/src/main/java/org/elasticsearch/index/mapper/IndexTimeScriptContext.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/ScriptParams.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/ScriptParams.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/IndexTimeScriptContext.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
Show resolved
Hide resolved
I don't know the answer yet, but I am wondering if there are ways to detect this without introducing a distinction between runtime fields and concrete fields in |
Latest reworking:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another review round and left some more comments, I like the direction, it's getting simpler and hopefully we can simplify this even further, but it's close.
// does not attempt to do any analysis on text fields. It also supports stored | ||
// fields where MemoryIndex does not. It is used to back index-time scripts that | ||
// reference field data and stored fields from a document that has not yet been | ||
// indexed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this! any reason why it isn't javadocs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it was originally on a private class and I didn't convert it when I moved it to a top-level one. Will update!
@@ -61,7 +68,7 @@ ParsedDocument parseDocument(SourceToParse source, | |||
source, | |||
parser); | |||
validateStart(parser); | |||
internalParseDocument(mappingLookup.getMapping().getRoot(), metadataFieldsMappers, context, parser); | |||
internalParseDocument(mappingLookup, metadataFieldsMappers, context, parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think that you can revert this change now, the root object should be enough as an argument.
server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
Show resolved
Hide resolved
return null; | ||
} | ||
assert compiler != null; | ||
return type.compiler(name).apply(script.get(), compiler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that the type can already return the MapperScript, no need for the function, that gets applied straightaway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would make the method not abstract and have the default impl throw exception.
@@ -120,6 +149,10 @@ public NumberFieldMapper build(ContentPath contentPath) { | |||
} | |||
} | |||
|
|||
public interface MapperScript { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add javadocs? his interface will be shared throughout the different implementations later, so maybe it should be extracted from here already? I am thinking that given we don't care about auto-boxing, we can type it.
import static org.hamcrest.Matchers.containsString; | ||
import static org.hamcrest.Matchers.equalTo; | ||
|
||
public class CalculatedFieldTests extends MapperServiceTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we are not using the calculated terminology anywhere else at this point, maybe we should not use it here either. Do we need a specific test class for these tests or could they fit in the ordinary mapper tests? What tests will we add as we support more types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these tests are generic 'does this work at all' tests, and some are more specific to the impls. I think it will become clearer which are which when we add other field types.
server/src/test/java/org/elasticsearch/index/mapper/NumberFieldMapperTests.java
Show resolved
Hide resolved
@@ -49,6 +49,11 @@ protected void registerParameters(ParameterChecker checker) throws IOException { | |||
m -> assertTrue(((HistogramFieldMapper)m).ignoreMalformed())); | |||
} | |||
|
|||
@Override | |||
protected boolean supportsScripts() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we unify naming between this and allowsIndexTimeScript
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not actually about index time scripts, it's about access to field data via a lookup. Maybe it should be supportsLookup
instead?
|
||
// field scripts can be called both by executeIndexTimeScripts() and via the document reader, | ||
// so to ensure that we don't run field scripts multiple times we guard them with | ||
// this one-time executor class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be javadocs?
/** | ||
* Returns an index time script for this field mapper, or {@code null} if none is defined | ||
*/ | ||
public IndexTimeScript getIndexTimeScript() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish that we did not have to go from MapperScript to IndexTimeScript, and rather reuse the same concept. I still wonder if this could be void executeScript(SearchLookup, LeafReaderContext, ParseContext)
? We could make the notion of scripted field known to FieldMapper, let MappingLookup collect all mappers that have a script declared, then each one of those has the execute method called. That way you can also ensure the same behaviour once you add this functionality to other mappers?
This suggestion goes against another one I made on making OneTimeFieldExecutor implement IndexTimeScript. MAybe with this suggestion IndexTimeScript could go away and we would have to see what to do with the one time executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the tricky part here is that MapperScripts can be executed both at index time and at fetch time (for the value fetcher) and those two contexts have entirely different parameter requirements and so different signatures. I think I can get rid of OneTimeFieldExecutor entirely and just have an anonymous consumer within DocumentParser, and I may be able to unify these two as well by having a class with two methods one it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I figured why MapperScript is there, and maybe that should be the only abstraction of a script for this functionality. I don't see the value of the indirection that IndexTimeScript causes now, as it could be just a method exposed directly by the mapper. Maybe you can give that a try and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have some way of knowing whether or not a FieldMapper has a script on it, so that we can correctly intercept calls to those fields from the index-time SearchLookup and allow scripted fields to refer to each other. The advantage of having this extra indirection is that it supplies this directly - if the IndexTimeScript is null, then the field doesn't have one. If we move this to FieldMapper#executeScript then we need to add an additional hasIndexTimeScript
boolean, which seems like the wrong trade-off to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make FieldMapper script aware, more mappers will support it anyways, and FieldMapper already has getIndexTimeScript anyways for that purpose, plus executing the script itself. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have updated things:
- IndexTimeScript is gone
- FieldMapper has two methods,
hasIndexTimeScript
andexecuteIndexTimeScript
@Override | ||
public int advance(int target) { | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might expose ourselves to subtle bugs by not implementing the DocIdSetIterator contract correctly. Let's make sure that docID()
returns -1 when the iterator is not positioned and make nextDoc
/advance
return NO_MORE_DOCS
when appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for all the iterations @romseygeek I left some super minor comments but this looks ready to me. The only thing that is left for us to settle on is the default behaviour when an error occurs running a script: reject or ignore? I know that @jimczi had an opinion on this. I am good with strict by default, and lenient when specifically configured. And maybe we should re-evaluate once we add the same handling to runtime fields.
|
||
/** | ||
* A LeafReader over a lucene document that exposes doc values and stored fields. | ||
* Note that unlike lucene's MemoryIndex implementation, this holds no state and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make LeafReader and MemoryIndex links? :)
@@ -167,6 +175,14 @@ FieldTypeLookup fieldTypesLookup() { | |||
return fieldTypeLookup; | |||
} | |||
|
|||
FieldTypeLookup indexTimeLookup() { | |||
return indexTimeLookup; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still bugs me slightly that we end up exposing this. Thinking of alternatives, we could move a portion of ParseContext#executeIndexTimeScripts to MappingLookup and rather expose a method that returns a Map<String, Consumer> but I am not convinced this is a good idea either ;) maybe you have other ideas
@@ -108,9 +129,17 @@ public Builder docValues(boolean hasDocValues) { | |||
return this; | |||
} | |||
|
|||
private FieldValues<Number> mapperScript() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename the method, looks like a leftover?
@@ -1100,8 +1210,27 @@ protected void parseCreateField(ParseContext context) throws IOException { | |||
} | |||
} | |||
|
|||
@Override | |||
public boolean hasIndexTimeScript() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove indexTime from the name of these two methods. There is only one script in a mapper anyways.
} | ||
|
||
@Override | ||
public void executeIndexTimeScript(SearchLookup searchLookup, LeafReaderContext readerContext, ParseContext parseContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would add the docId as an argument here, even if it will always be zero, just because the other side is the one that knows about what type of reader there is and that it holds a single document.
/** | ||
* @return whether this field mapper uses a script to generate its values | ||
*/ | ||
public boolean hasIndexTimeScript() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how having two methods is not fantastic, and why you had done it differently before. I had envisioned script as a member of FieldMapper directly, but we are going to see if that is possible once we add support for script to other mappers. The type of the consumer will make it possibly harder to share the impl but we'll see.
I am happy though with the execute method, I find it much clearer than returning an executor like we had before, because it is evident what it does and easier to trace.
double[] vs = s.values(); | ||
for (int i = 0; i < s.count(); i++) { | ||
consumer.accept(vs[i]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked before if the bit from runForDoc till the end can be a new method exposed by the Script class, in the effort of consolidating how field script classes expose their values. Do you have thoughts on this? We could also do it later
@@ -54,7 +54,9 @@ private CacheKey() {} | |||
private final Map<String, ObjectMapper> objectMappers; | |||
private final boolean hasNested; | |||
private final FieldTypeLookup fieldTypeLookup; | |||
private final FieldTypeLookup indexTimeLookup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have some docs around what this is for and why? I know that I will look at it in a few months and wonder...but maybe just reading "this resolves all of the fields besides runtime fields etc." would help me
import static org.hamcrest.Matchers.containsString; | ||
import static org.hamcrest.Matchers.equalTo; | ||
|
||
public class IndexTimeScriptTests extends MapperServiceTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ on the new name ;)
|
||
public class IndexTimeScriptTests extends MapperServiceTestCase { | ||
|
||
public void testCalculatedFieldLength() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update the method name here too?
This commit adds a script parameter to long and double fields that makes it possible to calculate a value for these fields at index time. It uses the same script context as the equivalent runtime fields, and allows for multiple index-time scripted fields to cross-refer while still checking for indirection loops.
This commit adds a script parameter to long and double fields that makes it possible to calculate a value for these fields at index time. It uses the same script context as the equivalent runtime fields, and allows for multiple index-time scripted fields to cross-refer while still checking for indirection loops.
|
||
`on_script_error`:: | ||
|
||
Defines what to do if the script defined by the `script` parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be leaving a comment so late -- was just looking how this turned out! I was wondering if we considered re-using the ignore_malformed
boolean flag here instead of introducing a new parameter? That feels simple and consistent to me, and avoids introducing a new enum parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Julie, good question. It all started from "let's support some ignore_malformed behaviour" but then we decided to make it specific to scripts especially as we were not convinced that the two would have the same default values, as far as I recall. Another idea could be to support the new parameter as part of the definition of the script:
"field" : {
"script" : {
"source" : "here goes the script",
"on_error" : "continue|fail"
}
}
There is still time to change this if we want to, so let's discuss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did discuss re-using ignore_malformed
but there are some important differences in the semantics of the two parameters. In particular, ignore_malformed
is specifically targeted at handling input data that cannot be converted into a number; because the scripts are typed, there is no way that they can return a malformed object, and so the same logic doesn't really apply. In addition, a script could throw an error for any number of reasons (eg logic errors, array index out of bounds exceptions) and ignore_malformed
feels too specific to cover all of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice that we are reusing the _ignored
meta field though. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for the context. I agree the name would feel a bit off, maybe if it had been called ignore_parse_errors
we could have reused it!
`script`:: | ||
|
||
If this parameter is set, then the field will index values generated | ||
by this script, rather than reading the values directly from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rather
suggests that values that are set on the document would always be ignored, is it correct? I wonder if we should make it more explicit.
Do we have tests for this case (I couldn't find them from a quick look but I might have just missed them), as well as for the case when the script computes values based on content of the document under the same field name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reject documents that have a value for the field, if the field declares a script. We should probably clarify that in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to explicitly documenting this, I will open a followup
Pinging @elastic/es-search (Team:Search) |
This commit adds a
script
parameter tolong
anddouble
fields that makesit possible to calculate a value for these fields at index time. It uses the same
script context as the equivalent runtime fields, and allows for multiple index-time
scripted feels to cross-refer while still checking for indirection loops.