-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve null dictionary values in interleave
and concat
kernels
#7144
Preserve null dictionary values in interleave
and concat
kernels
#7144
Conversation
This function internally computes value masks describing which values from input dictionaries should remain in the output. Values never referenced by keys are considered redundant. Null values were considered redundant, but they are now preserved as of this commit. This change is necessary because keys can reference null values. Before this commit, the entries of `MergedDictionaries::key_mappings` corresponding to null values were left unset. This caused `concat` and `interleave` to remap all elements referencing them to whatever value at index 0, producing an erroneous result.
…ls_in_values` This test case passes dictionary arrays containing null values (but no null keys) to `concat`.
…ulls` This test case passes two dictionary arrays each containing null values or keys to `interleave`.
Have you run the benchmarks to confirm what impact this has? If it regresses performance one option might be to compute the output null mask from the logical nulls of the inputs |
Addresses `clippy::type-complexity`.
@tustvold I tried running
Theoretically, this can improve performance because |
Here's the result of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
To check my understanding, previously the logic correctly used the logical null mask in order to determine what values to intern, however, this was insufficient as the remapping logic would be unaware of this and blindly remap the keys that referred to the null values to arbitrary new values.
This PR allows the interner to intern a null value, thereby reserving a new key in the output dictionary for this purpose.
…pache#7144) * fix(select): preserve null values in `merge_dictionary_values` This function internally computes value masks describing which values from input dictionaries should remain in the output. Values never referenced by keys are considered redundant. Null values were considered redundant, but they are now preserved as of this commit. This change is necessary because keys can reference null values. Before this commit, the entries of `MergedDictionaries::key_mappings` corresponding to null values were left unset. This caused `concat` and `interleave` to remap all elements referencing them to whatever value at index 0, producing an erroneous result. * test(select): add test case `concat::test_string_dictionary_array_nulls_in_values` This test case passes dictionary arrays containing null values (but no null keys) to `concat`. * test(select): add test case `interleave::test_interleave_dictionary_nulls` This test case passes two dictionary arrays each containing null values or keys to `interleave`. * refactor(select): add type alias for `Interner` bucket Addresses `clippy::type-complexity`.
Which issue does this PR close?
Closes #6302.
Rationale for this change
Fixing a bug.
What changes are included in this PR?
arrow_select::dictionary::merge_dictionary_values
now preserves null values from input dictionaries. This is necessary because keys can reference null values. Without this change, the entries ofMergedDictionaries::key_mappings
corresponding to null values would be left unset, causingconcat
andinterleave
to remap all elements referencing them to whatever value at index 0, producing an erroneous result.The handling of null keys (physical nulls) is unchanged.
Are there any user-facing changes?