Store arrays offsets for keyword fields natively with synthetic source #113757

martijnvg · 2024-09-30T06:50:42Z

The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field.

For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]

Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1]

Empty arrays are also supported by encoding a zigzag vint array of zero elements.

Limitations:

currently only doc values based array support for keyword field mapper.
multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c]
arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"].

These limitations can be addressed, but some require more complexity and or additional storage.

With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

The keyword doc values field gets an extra binary doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets vint encoded into the binary doc values field. The additional storage required for this will likely be minimized with elastic#112416 (zstd compression for binary doc values) In case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Limitations: * only support for keyword field mapper. * multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] * empty arrays ([]) are not recorded * arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage.

…rrays

salvatore-campagna · 2024-09-30T07:55:00Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParserContext.java

+
+    public int getArrayValueCount(String field) {
+        if (numValuesByField.containsKey(field)) {
+            return numValuesByField.get(field) + 1;


numValuesByField returns the last offset, in order to get count that returned value needs to be incremented by one.

numValuesByField should be names something else.

salvatore-campagna · 2024-09-30T07:57:28Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParserContext.java

+    }
+
+    public void recordOffset(String fieldName, String value) {
+        int count = numValuesByField.compute(fieldName, (s, integer) -> integer == null ? 0 : ++integer);


Related to comment above, maybe 1 here instead of 0?

count is the wrong name here. It is like the next offset to be used.

salvatore-campagna · 2024-09-30T07:59:22Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+            ord++;
+        }
+
+        logger.info("values=" + values);


maybe debug?

salvatore-campagna · 2024-09-30T08:01:17Z

...rc/main/java/org/elasticsearch/index/mapper/SortedSetDocValuesSyntheticFieldLoaderLayer.java

+                    ords[i] = dv.nextOrd();
+                }
+
+                logger.info("ords=" + Arrays.toString(ords));


I guess you are using info just for debugging while in draft.

Yes, this was for debugging purposes. This will be removed.

…rrays

lkts

This is a very powerful idea. Some thoughts:

When we generalize this, we need to think how to fit it into existing code nicely (leave poor DocumentParser alone). I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.
I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".

lkts · 2024-10-03T21:00:26Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+        }
+    }
+
+    public void processOffsets(DocumentParserContext context) throws IOException {


Should this be called from postParse? It is possible that a field is indexed multiple times in one document with object arrays.

I am surprised randomized tests don't complain about this.

👍 I think postParse is a better place to invoke this process offsets logic.

martijnvg · 2024-10-04T08:52:07Z

Thanks for taking a look @lkts!

I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.

I think we already have something like this via the FieldMapper#parsesArrayValue() flag. I initially tried using this, because I don't like introducing more complexity in DocumentParser. However it didn't work out in all cases. I recall that tests using copy to failed. For some reason field that overwrite that method and return true, are not taken into account with copy to. Maybe we should make field mappers that chose to overwrite FieldMapper#parsesArrayValue() work with copy_to correctly first.

I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".

Good point, I think we can have an info byte where this kind of information can be encoded. I was thinking something similar earlier, but left this out of this draft PR in order to keep it simple for demonstration purposes.

…rrays

martijnvg · 2024-10-09T13:23:39Z

@salvatore-campagna and @lkts I made a few changes and keyword field mapper now overwrites parsesArrayValue(), so that it can parse arrays. This allows minimizing changes in DocumentParser.

…rrays

… field

martijnvg · 2025-02-18T19:37:27Z

With the last two commits the disk usage for the offset fields is much lower:

 field name (from different shard)                                                   | Size     | Relative to total shard size 
apache.access.remote_addresses.offsets                                               | 3.0b     | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
tags.offsets                                                                         | 5.1mb    | 0.03%
log.flags.offsets                                                                    | 4.0mb    | 0.02%
event.category.offsets                                                               | 94.4kb   | 0.00%
event.type.offsets                                                                   | 94.4kb   | 0.00%
host.ip.offsets                                                                      | 30.9kb   | 0.00%
host.mac.offsets                                                                     | 30.9kb   | 0.00%
tags.offsets                                                                         | 2.5mb    | 0.03%
log.flags.offsets                                                                    | 1.7mb    | 0.02%
event.category.offsets                                                               | 41.1kb   | 0.00%
event.type.offsets                                                                   | 41.1kb   | 0.00%
host.ip.offsets                                                                      | 17.9kb   | 0.00%
host.mac.offsets                                                                     | 17.9kb   | 0.00%
log.flags.offsets                                                                    | 1.7mb    | 0.26%
event.type.offsets                                                                   | 17.1kb   | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
log.flags.offsets                                                                    | 1017.0b  | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
log.flags.offsets                                                                    | 3.0b     | 0.00%
tags.offsets                                                                         | 29.1kb   | 0.00%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
nginx.access.remote_ip_list.offsets                                                  | 3.0b     | 0.00%
related.user.offsets                                                                 | 3.0b     | 0.00%
tags.offsets                                                                         | 25.7kb   | 0.01%
event.category.offsets                                                               | 3.0b     | 0.00%
event.type.offsets                                                                   | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
log.flags.offsets                                                                    | 24.2kb   | 0.01%
event.category.offsets                                                               | 17.2kb   | 0.01%
event.type.offsets                                                                   | 17.2kb   | 0.01%
related.user.offsets                                                                 | 17.2kb   | 0.01%
host.mac.offsets                                                                     | 3.0b     | 0.00%
host.mac.offsets                                                                     | 3.0b     | 0.00%
redis.slowlog.args.offsets                                                           | 24.5mb   | 0.91%
host.mac.offsets                                                                     | 3.0b     | 0.00%
related.user.offsets                                                                 | 3.8mb    | 0.43%
event.category.offsets                                                               | 2.3mb    | 0.26%
event.type.offsets                                                                   | 1.1mb    | 0.13%
host.mac.offsets                                                                     | 435.7kb  | 0.05%
tags.offsets                                                                         | 93.7kb   | 0.01%
related.hosts.offsets                                                                | 3.0b     | 0.00%
host.mac.offsets                                                                     | 376.3kb  | 0.02%
tags.offsets                                                                         | 22.9kb   | 0.00%

lkts

Thank you for iterating on this.

lkts · 2025-02-18T21:40:45Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+        }
+
+        boolean indexed = indexValue(context, value);
+        if (offsetsFieldName != null && context.isImmediateParentAnArray() && context.getRecordedSource() == false) {


lkts · 2025-02-18T21:43:51Z

test/framework/src/main/java/org/elasticsearch/index/mapper/MapperTestCase.java

@@ -1707,7 +1707,7 @@ public void testSyntheticSourceKeepArrays() throws IOException {
        SyntheticSourceExample example = syntheticSourceSupportForKeepTests(shouldUseIgnoreMalformed()).example(1);
        DocumentMapper mapperAll = createSytheticSourceMapperService(mapping(b -> {
            b.startObject("field");
-            b.field("synthetic_source_keep", randomFrom("arrays", "all"));  // Both options keep array source.
+            b.field("synthetic_source_keep", randomFrom("all"));  // Only option all keeps array source.


Should this change only be done for keyword field?

This is not needed even for keyword? All values should work.

I think we can make this change only for keyword field mapper. Given that other field types still fallback to ignored source if source keep is set to arrays.

I pushed: bbee160

…rrays

elasticsearchmachine · 2025-02-20T08:22:00Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 113757

… source Backporting elastic#113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

… source (#122997) Backporting #113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

…22999) Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

Backporting elastic#122999 to 8.x branch. Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

…ce (#123405) * [8.x] Store arrays offsets for ip fields natively with synthetic source Backporting #122999 to 8.x branch. Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source. * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

martijnvg added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Sep 30, 2024

elasticsearchmachine added the v9.0.0 label Sep 30, 2024

martijnvg added 3 commits September 30, 2024 08:58

spotless

acf4d09

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

dca77d7

…rrays

iter

49efe26

salvatore-campagna reviewed Sep 30, 2024

View reviewed changes

martijnvg added 5 commits September 30, 2024 16:27

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

f5e3d5a

…rrays

iter

59010c3

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

a5198ae

…rrays

spotless

9b4aa5f

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

12d30c5

…rrays

lkts reviewed Oct 3, 2024

View reviewed changes

martijnvg added 9 commits October 8, 2024 13:35

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

2ae8d83

…rrays

parsesArrayValue approach

fc0e627

fix multi fields

a111f94

iter

14c2ddd

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

ba9e513

…rrays

do not handle copy_to for now

007afd3

cleanup

194b4ca

move ValueXContentParser

52c0db4

adjust expected json element type

8cc5b46

martijnvg added 5 commits October 13, 2024 16:48

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

dc9db8a

…rrays

iter

6e03aca

fixed mistake

674f03e

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

acfaa55

…rrays

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

0d90234

…rrays

martijnvg added 2 commits February 18, 2025 15:35

Store offsets in sorted doc values field instead of binary doc values…

37634b9

… field

applied feedback

09c6a0e

lkts approved these changes Feb 18, 2025

View reviewed changes

martijnvg and others added 9 commits February 19, 2025 11:50

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

60f45f2

…rrays

iter testSyntheticSourceKeepArrays() test

bbee160

add index version check

4e6265f

iter test

405edf4

[CI] Auto commit changes from spotless

7c7b3a3

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

cfe5b56

…rrays

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

8049206

…rrays

cleanup supportStoringArrayOffsets(...) method

3fcb461

renamed test suites

5b1f80b

martijnvg added the auto-backport Automatically create backport pull requests when merged label Feb 20, 2025

martijnvg enabled auto-merge (squash) February 20, 2025 07:13

martijnvg merged commit 43665f0 into elastic:main Feb 20, 2025
17 checks passed

elasticsearchmachine added the backport pending label Feb 20, 2025

martijnvg mentioned this pull request Feb 20, 2025

[8.x] Store arrays offsets for keyword fields natively with synthetic source #122997

Merged

martijnvg mentioned this pull request Feb 20, 2025

Store arrays offsets for ip fields natively with synthetic source #122999

Merged

martijnvg removed the backport pending label Feb 21, 2025

martijnvg added a commit that referenced this pull request Feb 25, 2025

Store arrays offsets for ip fields natively with synthetic source (#1…

6c55099

…22999) Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

martijnvg mentioned this pull request Feb 25, 2025

[8.x] Store arrays offsets for ip fields natively with synthetic source #123405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store arrays offsets for keyword fields natively with synthetic source #113757

Store arrays offsets for keyword fields natively with synthetic source #113757

martijnvg commented Sep 30, 2024 •

edited

Loading

salvatore-campagna Sep 30, 2024

martijnvg Sep 30, 2024

salvatore-campagna Sep 30, 2024

martijnvg Sep 30, 2024

salvatore-campagna Sep 30, 2024

salvatore-campagna Sep 30, 2024

martijnvg Sep 30, 2024

lkts left a comment •

edited

Loading

lkts Oct 3, 2024

lkts Oct 3, 2024

martijnvg Oct 4, 2024

martijnvg commented Oct 4, 2024 •

edited

Loading

martijnvg commented Oct 9, 2024

martijnvg commented Feb 18, 2025

lkts left a comment

lkts Feb 18, 2025

lkts Feb 18, 2025

kkrik-es Feb 19, 2025

martijnvg Feb 19, 2025

martijnvg Feb 19, 2025

elasticsearchmachine commented Feb 20, 2025

Store arrays offsets for keyword fields natively with synthetic source #113757

Store arrays offsets for keyword fields natively with synthetic source #113757

Conversation

martijnvg commented Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkts left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Oct 4, 2024 • edited Loading

martijnvg commented Oct 9, 2024

martijnvg commented Feb 18, 2025

lkts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Feb 20, 2025

💔 Backport failed

martijnvg commented Sep 30, 2024 •

edited

Loading

lkts left a comment •

edited

Loading

martijnvg commented Oct 4, 2024 •

edited

Loading