Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve null dictionary values in interleave and concat kernels #7144

Merged

Conversation

kawadakk
Copy link
Contributor

Which issue does this PR close?

Closes #6302.

Rationale for this change

Fixing a bug.

What changes are included in this PR?

arrow_select::dictionary::merge_dictionary_values now preserves null values from input dictionaries. This is necessary because keys can reference null values. Without this change, the entries of MergedDictionaries::key_mappings corresponding to null values would be left unset, causing concat and interleave to remap all elements referencing them to whatever value at index 0, producing an erroneous result.

The handling of null keys (physical nulls) is unchanged.

Are there any user-facing changes?

This function internally computes value masks describing which values
from input dictionaries should remain in the output. Values never
referenced by keys are considered redundant. Null values were
considered redundant, but they are now preserved as of this commit.

This change is necessary because keys can reference null values. Before
this commit, the entries of `MergedDictionaries::key_mappings`
corresponding to null values were left unset. This caused `concat` and
`interleave` to remap all elements referencing them to whatever value
at index 0, producing an erroneous result.
…ls_in_values`

This test case passes dictionary arrays containing null values (but no
null keys) to `concat`.
…ulls`

This test case passes two dictionary arrays each containing null values
or keys to `interleave`.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 17, 2025
@tustvold
Copy link
Contributor

Have you run the benchmarks to confirm what impact this has?

If it regresses performance one option might be to compute the output null mask from the logical nulls of the inputs

Addresses `clippy::type-complexity`.
@kawadakk
Copy link
Contributor Author

kawadakk commented Feb 17, 2025

@tustvold I tried running cargo bench --bench concatenate_kernel, but the results fluctuate between "regressed" and "improved" due to a measurement noise on my machine. I'd say it has no significant impact, positive or negative.

concat str_dict 1024    time:   [1.6693 µs 1.6702 µs 1.6714 µs]
                        change: [-2.0002% -1.8167% -1.5443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

concat str_dict_sparse 1024
                        time:   [4.6988 µs 4.7002 µs 4.7017 µs]
                        change: [+1.3933% +1.4438% +1.4888%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

concat str nulls 1024   time:   [3.8924 µs 3.8939 µs 3.8956 µs]
                        change: [-0.4394% -0.3634% -0.2873%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

Theoretically, this can improve performance because merge_dictionary_values no longer calls DictionaryArray::logical_nulls, which could build a new boolean array. On the other hand, merge_dictionary_values will no longer skip over null values, but I think it would be pathological for a dictionary array to contain many null values.

@kawadakk
Copy link
Contributor Author

Here's the result of cargo bench --bench interleave_kernels (with irrelevant test cases omitted; CPU frequencies are now locked to 2 GHz):

interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [4.2027 µs 4.2035 µs 4.2043 µs]
                        change: [-1.0518% -1.0119% -0.9742%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  7 (7.00%) high mild

interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [10.136 µs 10.142 µs 10.148 µs]
                        change: [-0.7627% -0.6643% -0.5668%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [23.133 µs 23.155 µs 23.177 µs]
                        change: [+0.3312% +0.4051% +0.4769%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [23.069 µs 23.092 µs 23.115 µs]
                        change: [-0.9958% -0.9284% -0.8580%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [4.2292 µs 4.2316 µs 4.2340 µs]
                        change: [-0.7215% -0.6577% -0.5871%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [10.266 µs 10.284 µs 10.303 µs]
                        change: [+1.6792% +1.8107% +1.9508%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [23.093 µs 23.104 µs 23.116 µs]
                        change: [+0.4904% +0.5596% +0.6294%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [23.207 µs 23.224 µs 23.241 µs]
                        change: [-0.6322% -0.4640% -0.2996%] (p = 0.00 < 0.05)
                        Change within noise threshold.

interleave dict_distinct 100
                        time:   [3.4552 µs 3.4567 µs 3.4581 µs]
                        change: [-4.0231% -3.7210% -3.4131%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

interleave dict_distinct 1024
                        time:   [3.4545 µs 3.4556 µs 3.4567 µs]
                        change: [+1.0769% +1.1664% +1.2568%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

interleave dict_distinct 2048
                        time:   [3.3727 µs 3.3753 µs 3.3784 µs]
                        change: [-1.9316% -1.8500% -1.7623%] (p = 0.00 < 0.05)
                        Performance has improved.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

To check my understanding, previously the logic correctly used the logical null mask in order to determine what values to intern, however, this was insufficient as the remapping logic would be unaware of this and blindly remap the keys that referred to the null values to arbitrary new values.

This PR allows the interner to intern a null value, thereby reserving a new key in the output dictionary for this purpose.

@tustvold tustvold merged commit 46afdd8 into apache:main Feb 19, 2025
26 checks passed
@kawadakk kawadakk deleted the patch/fix-dictionary-merge-preserve-null-values branch February 19, 2025 10:22
friendlymatthew pushed a commit to friendlymatthew/arrow-rs that referenced this pull request Feb 21, 2025
…pache#7144)

* fix(select): preserve null values in `merge_dictionary_values`

This function internally computes value masks describing which values
from input dictionaries should remain in the output. Values never
referenced by keys are considered redundant. Null values were
considered redundant, but they are now preserved as of this commit.

This change is necessary because keys can reference null values. Before
this commit, the entries of `MergedDictionaries::key_mappings`
corresponding to null values were left unset. This caused `concat` and
`interleave` to remap all elements referencing them to whatever value
at index 0, producing an erroneous result.

* test(select): add test case `concat::test_string_dictionary_array_nulls_in_values`

This test case passes dictionary arrays containing null values (but no
null keys) to `concat`.

* test(select): add test case `interleave::test_interleave_dictionary_nulls`

This test case passes two dictionary arrays each containing null values
or keys to `interleave`.

* refactor(select): add type alias for `Interner` bucket

Addresses `clippy::type-complexity`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concatenating dictionaries that have nulls in them yields strange results
2 participants