Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store arrays offsets for keyword fields natively with synthetic source #113757

Merged
merged 72 commits into from
Feb 20, 2025

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Sep 30, 2024

The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field.

For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]

Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1]

Empty arrays are also supported by encoding a zigzag vint array of zero elements.

Limitations:

  • currently only doc values based array support for keyword field mapper.
  • multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c]
  • arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"].

These limitations can be addressed, but some require more complexity and or additional storage.

With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

The keyword doc values field gets an extra binary doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values.

This is stored in an offset to ordinal array that gets vint encoded into the binary doc values field. The additional storage required for this will likely be minimized with elastic#112416 (zstd compression for binary doc values)

In case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]

Limitations:
* only support for keyword field mapper.
* multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c]
* empty arrays ([]) are not recorded
* arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"].

These limitations can be addressed, but some require more complexity and or additional storage.
@martijnvg martijnvg added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Sep 30, 2024

public int getArrayValueCount(String field) {
if (numValuesByField.containsKey(field)) {
return numValuesByField.get(field) + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numValuesByField returns the last offset, in order to get count that returned value needs to be incremented by one.

numValuesByField should be names something else.

}

public void recordOffset(String fieldName, String value) {
int count = numValuesByField.compute(fieldName, (s, integer) -> integer == null ? 0 : ++integer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to comment above, maybe 1 here instead of 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count is the wrong name here. It is like the next offset to be used.

ord++;
}

logger.info("values=" + values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe debug?

ords[i] = dv.nextOrd();
}

logger.info("ords=" + Arrays.toString(ords));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are using info just for debugging while in draft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was for debugging purposes. This will be removed.

Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very powerful idea. Some thoughts:

  • When we generalize this, we need to think how to fit it into existing code nicely (leave poor DocumentParser alone). I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.
  • I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".

}
}

public void processOffsets(DocumentParserContext context) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called from postParse? It is possible that a field is indexed multiple times in one document with object arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised randomized tests don't complain about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think postParse is a better place to invoke this process offsets logic.

@martijnvg
Copy link
Member Author

martijnvg commented Oct 4, 2024

Thanks for taking a look @lkts!

I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.

I think we already have something like this via the FieldMapper#parsesArrayValue() flag. I initially tried using this, because I don't like introducing more complexity in DocumentParser. However it didn't work out in all cases. I recall that tests using copy to failed. For some reason field that overwrite that method and return true, are not taken into account with copy to. Maybe we should make field mappers that chose to overwrite FieldMapper#parsesArrayValue() work with copy_to correctly first.

I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".

Good point, I think we can have an info byte where this kind of information can be encoded. I was thinking something similar earlier, but left this out of this draft PR in order to keep it simple for demonstration purposes.

@martijnvg
Copy link
Member Author

@salvatore-campagna and @lkts I made a few changes and keyword field mapper now overwrites parsesArrayValue(), so that it can parse arrays. This allows minimizing changes in DocumentParser.

@martijnvg
Copy link
Member Author

With the last two commits the disk usage for the offset fields is much lower:

 field name (from different shard)                                                   | Size     | Relative to total shard size 
apache.access.remote_addresses.offsets                                               | 3.0b     | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
tags.offsets                                                                         | 5.1mb    | 0.03%
log.flags.offsets                                                                    | 4.0mb    | 0.02%
event.category.offsets                                                               | 94.4kb   | 0.00%
event.type.offsets                                                                   | 94.4kb   | 0.00%
host.ip.offsets                                                                      | 30.9kb   | 0.00%
host.mac.offsets                                                                     | 30.9kb   | 0.00%
tags.offsets                                                                         | 2.5mb    | 0.03%
log.flags.offsets                                                                    | 1.7mb    | 0.02%
event.category.offsets                                                               | 41.1kb   | 0.00%
event.type.offsets                                                                   | 41.1kb   | 0.00%
host.ip.offsets                                                                      | 17.9kb   | 0.00%
host.mac.offsets                                                                     | 17.9kb   | 0.00%
log.flags.offsets                                                                    | 1.7mb    | 0.26%
event.type.offsets                                                                   | 17.1kb   | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
log.flags.offsets                                                                    | 1017.0b  | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
log.flags.offsets                                                                    | 3.0b     | 0.00%
tags.offsets                                                                         | 29.1kb   | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
nginx.access.remote_ip_list.offsets                                                  | 3.0b     | 0.00%
related.user.offsets                                                                 | 3.0b     | 0.00%
tags.offsets                                                                         | 25.7kb   | 0.01%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
log.flags.offsets                                                                    | 24.2kb   | 0.01%
event.category.offsets                                                               | 17.2kb   | 0.01%
event.type.offsets                                                                   | 17.2kb   | 0.01%
related.user.offsets                                                                 | 17.2kb   | 0.01%
host.mac.offsets                                                                     | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
redis.slowlog.args.offsets                                                           | 24.5mb   | 0.91%
host.mac.offsets                                                                     | 3.0b     | 0.00%
related.user.offsets                                                                 | 3.8mb    | 0.43%
event.category.offsets                                                               | 2.3mb    | 0.26%
event.type.offsets                                                                   | 1.1mb    | 0.13%
host.mac.offsets                                                                     | 435.7kb  | 0.05%
tags.offsets                                                                         | 93.7kb   | 0.01%
related.hosts.offsets                                                                | 3.0b     | 0.00%
host.mac.offsets                                                                     | 376.3kb  | 0.02%
tags.offsets                                                                         | 22.9kb   | 0.00%

Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating on this.

}

boolean indexed = indexValue(context, value);
if (offsetsFieldName != null && context.isImmediateParentAnArray() && context.getRecordedSource() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -1707,7 +1707,7 @@ public void testSyntheticSourceKeepArrays() throws IOException {
SyntheticSourceExample example = syntheticSourceSupportForKeepTests(shouldUseIgnoreMalformed()).example(1);
DocumentMapper mapperAll = createSytheticSourceMapperService(mapping(b -> {
b.startObject("field");
b.field("synthetic_source_keep", randomFrom("arrays", "all")); // Both options keep array source.
b.field("synthetic_source_keep", randomFrom("all")); // Only option all keeps array source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change only be done for keyword field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed even for keyword? All values should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this change only for keyword field mapper. Given that other field types still fallback to ignored source if source keep is set to arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed: bbee160

@martijnvg martijnvg added the auto-backport Automatically create backport pull requests when merged label Feb 20, 2025
@martijnvg martijnvg enabled auto-merge (squash) February 20, 2025 07:13
@martijnvg martijnvg merged commit 43665f0 into elastic:main Feb 20, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 113757

martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 20, 2025
… source

Backporting elastic#113757 to 8.x branch.

The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field.

For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]

Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1]

Empty arrays are also supported by encoding a zigzag vint array of zero elements.

Limitations:

currently only doc values based array support for keyword field mapper.
multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c]
arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"].
These limitations can be addressed, but some require more complexity and or additional storage.

With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 20, 2025
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
elasticsearchmachine pushed a commit that referenced this pull request Feb 20, 2025
… source (#122997)

Backporting #113757 to 8.x branch.

The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field.

For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]

Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1]

Empty arrays are also supported by encoding a zigzag vint array of zero elements.

Limitations:

currently only doc values based array support for keyword field mapper.
multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c]
arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"].
These limitations can be addressed, but some require more complexity and or additional storage.

With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 21, 2025
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 21, 2025
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
martijnvg added a commit that referenced this pull request Feb 25, 2025
…22999)

Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 25, 2025
Backporting elastic#122999 to 8.x branch.

Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
elasticsearchmachine added a commit that referenced this pull request Feb 25, 2025
…ce (#123405)

* [8.x] Store arrays offsets for ip fields natively with synthetic source

Backporting #122999 to 8.x branch.

Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >enhancement :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine test-full-bwc Trigger full BWC version matrix tests v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants