-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store arrays offsets for keyword fields natively with synthetic source #113757
Store arrays offsets for keyword fields natively with synthetic source #113757
Conversation
The keyword doc values field gets an extra binary doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets vint encoded into the binary doc values field. The additional storage required for this will likely be minimized with elastic#112416 (zstd compression for binary doc values) In case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Limitations: * only support for keyword field mapper. * multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] * empty arrays ([]) are not recorded * arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage.
|
||
public int getArrayValueCount(String field) { | ||
if (numValuesByField.containsKey(field)) { | ||
return numValuesByField.get(field) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why +1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numValuesByField
returns the last offset, in order to get count that returned value needs to be incremented by one.
numValuesByField
should be names something else.
} | ||
|
||
public void recordOffset(String fieldName, String value) { | ||
int count = numValuesByField.compute(fieldName, (s, integer) -> integer == null ? 0 : ++integer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to comment above, maybe 1 here instead of 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count
is the wrong name here. It is like the next offset to be used.
ord++; | ||
} | ||
|
||
logger.info("values=" + values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe debug
?
ords[i] = dv.nextOrd(); | ||
} | ||
|
||
logger.info("ords=" + Arrays.toString(ords)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are using info
just for debugging while in draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was for debugging purposes. This will be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very powerful idea. Some thoughts:
- When we generalize this, we need to think how to fit it into existing code nicely (leave poor
DocumentParser
alone). I was thinking lately that maybeDocumentParser
can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though. - I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".
} | ||
} | ||
|
||
public void processOffsets(DocumentParserContext context) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called from postParse
? It is possible that a field is indexed multiple times in one document with object arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised randomized tests don't complain about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think postParse is a better place to invoke this process offsets logic.
Thanks for taking a look @lkts!
I think we already have something like this via the
Good point, I think we can have an info byte where this kind of information can be encoded. I was thinking something similar earlier, but left this out of this draft PR in order to keep it simple for demonstration purposes. |
@salvatore-campagna and @lkts I made a few changes and keyword field mapper now overwrites |
With the last two commits the disk usage for the offset fields is much lower:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating on this.
} | ||
|
||
boolean indexed = indexValue(context, value); | ||
if (offsetsFieldName != null && context.isImmediateParentAnArray() && context.getRecordedSource() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -1707,7 +1707,7 @@ public void testSyntheticSourceKeepArrays() throws IOException { | |||
SyntheticSourceExample example = syntheticSourceSupportForKeepTests(shouldUseIgnoreMalformed()).example(1); | |||
DocumentMapper mapperAll = createSytheticSourceMapperService(mapping(b -> { | |||
b.startObject("field"); | |||
b.field("synthetic_source_keep", randomFrom("arrays", "all")); // Both options keep array source. | |||
b.field("synthetic_source_keep", randomFrom("all")); // Only option all keeps array source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this change only be done for keyword field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed even for keyword? All values should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this change only for keyword field mapper. Given that other field types still fallback to ignored source if source keep is set to arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed: bbee160
💔 Backport failed
You can use sqren/backport to manually backport by running |
… source Backporting elastic#113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
… source (#122997) Backporting #113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
Backporting elastic#122999 to 8.x branch. Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
…ce (#123405) * [8.x] Store arrays offsets for ip fields natively with synthetic source Backporting #122999 to 8.x branch. Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source. * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field.
For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]
Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1]
Empty arrays are also supported by encoding a zigzag vint array of zero elements.
Limitations:
These limitations can be addressed, but some require more complexity and or additional storage.
With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if
index.mapping.synthetic_source_keep
is set toarrays
(default for logsdb).