Preserve null dictionary values in `interleave` and `concat` kernels #7144

kawadakk · 2025-02-17T08:56:08Z

Which issue does this PR close?

Closes #6302.

Rationale for this change

Fixing a bug.

What changes are included in this PR?

arrow_select::dictionary::merge_dictionary_values now preserves null values from input dictionaries. This is necessary because keys can reference null values. Without this change, the entries of MergedDictionaries::key_mappings corresponding to null values would be left unset, causing concat and interleave to remap all elements referencing them to whatever value at index 0, producing an erroneous result.

The handling of null keys (physical nulls) is unchanged.

Are there any user-facing changes?

This function internally computes value masks describing which values from input dictionaries should remain in the output. Values never referenced by keys are considered redundant. Null values were considered redundant, but they are now preserved as of this commit. This change is necessary because keys can reference null values. Before this commit, the entries of `MergedDictionaries::key_mappings` corresponding to null values were left unset. This caused `concat` and `interleave` to remap all elements referencing them to whatever value at index 0, producing an erroneous result.

…ls_in_values` This test case passes dictionary arrays containing null values (but no null keys) to `concat`.

…ulls` This test case passes two dictionary arrays each containing null values or keys to `interleave`.

tustvold · 2025-02-17T09:00:58Z

Have you run the benchmarks to confirm what impact this has?

If it regresses performance one option might be to compute the output null mask from the logical nulls of the inputs

Addresses `clippy::type-complexity`.

kawadakk · 2025-02-17T09:28:33Z

@tustvold I tried running cargo bench --bench concatenate_kernel, but the results fluctuate between "regressed" and "improved" due to a measurement noise on my machine. I'd say it has no significant impact, positive or negative.

concat str_dict 1024    time:   [1.6693 µs 1.6702 µs 1.6714 µs]
                        change: [-2.0002% -1.8167% -1.5443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

concat str_dict_sparse 1024
                        time:   [4.6988 µs 4.7002 µs 4.7017 µs]
                        change: [+1.3933% +1.4438% +1.4888%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

concat str nulls 1024   time:   [3.8924 µs 3.8939 µs 3.8956 µs]
                        change: [-0.4394% -0.3634% -0.2873%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

Theoretically, this can improve performance because merge_dictionary_values no longer calls DictionaryArray::logical_nulls, which could build a new boolean array. On the other hand, merge_dictionary_values will no longer skip over null values, but I think it would be pathological for a dictionary array to contain many null values.

kawadakk · 2025-02-17T09:44:32Z

Here's the result of cargo bench --bench interleave_kernels (with irrelevant test cases omitted; CPU frequencies are now locked to 2 GHz):

interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [4.2027 µs 4.2035 µs 4.2043 µs]
                        change: [-1.0518% -1.0119% -0.9742%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  7 (7.00%) high mild

interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [10.136 µs 10.142 µs 10.148 µs]
                        change: [-0.7627% -0.6643% -0.5668%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [23.133 µs 23.155 µs 23.177 µs]
                        change: [+0.3312% +0.4051% +0.4769%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [23.069 µs 23.092 µs 23.115 µs]
                        change: [-0.9958% -0.9284% -0.8580%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [4.2292 µs 4.2316 µs 4.2340 µs]
                        change: [-0.7215% -0.6577% -0.5871%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [10.266 µs 10.284 µs 10.303 µs]
                        change: [+1.6792% +1.8107% +1.9508%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [23.093 µs 23.104 µs 23.116 µs]
                        change: [+0.4904% +0.5596% +0.6294%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [23.207 µs 23.224 µs 23.241 µs]
                        change: [-0.6322% -0.4640% -0.2996%] (p = 0.00 < 0.05)
                        Change within noise threshold.

interleave dict_distinct 100
                        time:   [3.4552 µs 3.4567 µs 3.4581 µs]
                        change: [-4.0231% -3.7210% -3.4131%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

interleave dict_distinct 1024
                        time:   [3.4545 µs 3.4556 µs 3.4567 µs]
                        change: [+1.0769% +1.1664% +1.2568%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

interleave dict_distinct 2048
                        time:   [3.3727 µs 3.3753 µs 3.3784 µs]
                        change: [-1.9316% -1.8500% -1.7623%] (p = 0.00 < 0.05)
                        Performance has improved.

tustvold

This looks good to me.

To check my understanding, previously the logic correctly used the logical null mask in order to determine what values to intern, however, this was insufficient as the remapping logic would be unaware of this and blindly remap the keys that referred to the null values to arbitrary new values.

This PR allows the interner to intern a null value, thereby reserving a new key in the output dictionary for this purpose.

…pache#7144) * fix(select): preserve null values in `merge_dictionary_values` This function internally computes value masks describing which values from input dictionaries should remain in the output. Values never referenced by keys are considered redundant. Null values were considered redundant, but they are now preserved as of this commit. This change is necessary because keys can reference null values. Before this commit, the entries of `MergedDictionaries::key_mappings` corresponding to null values were left unset. This caused `concat` and `interleave` to remap all elements referencing them to whatever value at index 0, producing an erroneous result. * test(select): add test case `concat::test_string_dictionary_array_nulls_in_values` This test case passes dictionary arrays containing null values (but no null keys) to `concat`. * test(select): add test case `interleave::test_interleave_dictionary_nulls` This test case passes two dictionary arrays each containing null values or keys to `interleave`. * refactor(select): add type alias for `Interner` bucket Addresses `clippy::type-complexity`.

kawadakk added 3 commits February 17, 2025 17:53

test(select): add test case `concat::test_string_dictionary_array_nul…

fc96918

…ls_in_values` This test case passes dictionary arrays containing null values (but no null keys) to `concat`.

test(select): add test case `interleave::test_interleave_dictionary_n…

103286d

…ulls` This test case passes two dictionary arrays each containing null values or keys to `interleave`.

github-actions bot added the arrow Changes to the arrow crate label Feb 17, 2025

refactor(select): add type alias for Interner bucket

67154c7

Addresses `clippy::type-complexity`.

tustvold approved these changes Feb 19, 2025

View reviewed changes

tustvold merged commit 46afdd8 into apache:main Feb 19, 2025
26 checks passed

kawadakk deleted the patch/fix-dictionary-merge-preserve-null-values branch February 19, 2025 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve null dictionary values in `interleave` and `concat` kernels #7144

Preserve null dictionary values in `interleave` and `concat` kernels #7144

kawadakk commented Feb 17, 2025

tustvold commented Feb 17, 2025

kawadakk commented Feb 17, 2025 •

edited

Loading

kawadakk commented Feb 17, 2025

tustvold left a comment

Preserve null dictionary values in interleave and concat kernels #7144

Preserve null dictionary values in interleave and concat kernels #7144

Conversation

kawadakk commented Feb 17, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Feb 17, 2025

kawadakk commented Feb 17, 2025 • edited Loading

kawadakk commented Feb 17, 2025

tustvold left a comment

Choose a reason for hiding this comment

Preserve null dictionary values in `interleave` and `concat` kernels #7144

Preserve null dictionary values in `interleave` and `concat` kernels #7144

kawadakk commented Feb 17, 2025 •

edited

Loading