Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix support for empty strings for Dictionary and RLE encodings #3938

Merged
merged 7 commits into from
Mar 8, 2023

Conversation

ypatia
Copy link
Member

@ypatia ypatia commented Mar 2, 2023

This is a problem found in VCF by @gspowley where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") :

terminate called after throwing an instance of 'tiledb::TileDBError'
  what():  [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments.

The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer .

I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files .

Note that in order to fix this issue I had to correct the on disk serialization of dictionaries, so the fix won't work for existing arrays written in the so far non-working way.


TYPE: BUG
DESC: Fix support for empty strings for Dictionary and RLE encodings

@ypatia ypatia requested a review from Shelnutt2 March 2, 2023 15:26
@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #25823: Error when reading an annotation VCF.

@ypatia ypatia requested a review from gspowley March 2, 2023 15:26
@ypatia ypatia force-pushed the yt/ch25823/support_for_empty_strings_dict_rle branch from c80b8dc to 7c431fa Compare March 2, 2023 15:44
Copy link
Contributor

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only had one coffee and have only ever read this code, but I'm pretty sure the only bug here was that we were checking if output is empty or not. That's a broken assumption (that output must have size > 0) when all strings were empty during encoding.

To be specific, the encoded dictionary and data in the case of all empty strings should be represented by a dictionary of 0x00, and the input buffer of encoded strings would be a single 0x00 (zero valued byte) per string.

@@ -68,7 +68,7 @@ void DictEncoding::decompress(
const uint8_t word_id_size,
span<std::byte> output,
span<uint64_t> output_offsets) {
if (input.empty() || output.empty() || word_id_size == 0) {
if (input.empty() || word_id_size == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having just read the code, I'm pretty sure the removal of checking output.empty() here and in dict_compressor.h:186 is the entire fix for the issue.

Copy link
Member Author

@ypatia ypatia Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there are 2 reasons those 2 checks are not enough:

  • In deserialize_dictionary, if I have an empty string as entry, I am not properly incrementing the index to the serialized_dictionary (in_index) in dict_compressor.h:261 and I am accessing out of bounds. I special-cased for str_len = 0 there and reverted the rest of the changes, can you check if you like that solution or you have a better idea on how to handle that?
  • In decompress I still need the special case for output.empty() in dict_compressor.h:193 because otherwise I get a SIGABRT because I am violating a contract of the implementation of span we are using that assumes that when writing to a span at some index idx : TCB_SPAN_EXPECT(idx < size()); , and in case of an empty output we get a failing 0 < 0 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fix is still not correct for the all empty strings case because of the last special case I mentioned where I exit if output.empty(), because I don't reconstruct the offsets_buffer. If I do reconstruct them I get a heap buffer overflow though, I am still fighting it to understand why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that span behavior is odd since we're indexing with a zero length array. How does this approach look to you?

https://github.com/TileDB-Inc/TileDB/compare/pd/ch25823/support_for_empty_strings_dict_rle

Locally it passes the new test with --enable-debug --enable-assertions on the bootstrap line. I haven't run a full test suite with it though.

@davisp
Copy link
Contributor

davisp commented Mar 3, 2023

I should explain that encoding to make sure I'm not just a single coffee awake.

For the encoded dictionary, its a single 0x00 byte because all string lengths are less than 255 bytes (becuase they're all zero). Thus reading the dictionary, we read a single byte for a length, which is zero. Because its the first string in the dictionary, its "string id" is 0.

For the encoded strings that are all empty, we end up with the an array of 0x00 bytes that all represent empty strings. The reason here is that we know we had fewer than 255 encoded strings, thus the data type for each encoded string is a single byte. So the logic is we read a single byte, the byte contains the word id (0x00 for zero, which means the first string in the encoded dictionary) which is an empty string, which means we copy zero bytes to the output buffer and increment our decoding position by one (the size of our encoded word_id data type).

@ypatia ypatia force-pushed the yt/ch25823/support_for_empty_strings_dict_rle branch from 6140287 to 5885399 Compare March 6, 2023 13:12
@ypatia ypatia force-pushed the yt/ch25823/support_for_empty_strings_dict_rle branch from e0a1d25 to 76bbfbd Compare March 7, 2023 13:57
I was suddenly worried that memcpy with a zero length specified isn't
guaranteed to not dereference either the input or output buffers.
Copy link
Contributor

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@gspowley gspowley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ypatia and @davisp! This fixes the issue in VCF.

@ypatia ypatia merged commit 1307810 into dev Mar 8, 2023
@ypatia ypatia deleted the yt/ch25823/support_for_empty_strings_dict_rle branch March 8, 2023 14:48
github-actions bot pushed a commit that referenced this pull request Mar 8, 2023
This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") :

"terminate called after throwing an instance of 'tiledb::TileDBError'
  what():  [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments."

The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer .

I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files .

Co-authored-by: Paul J. Davis <[email protected]>
github-actions bot pushed a commit that referenced this pull request Mar 8, 2023
This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") :

"terminate called after throwing an instance of 'tiledb::TileDBError'
  what():  [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments."

The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer .

I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files .

Co-authored-by: Paul J. Davis <[email protected]>
ypatia added a commit that referenced this pull request Mar 9, 2023
#3945)

This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") :

"terminate called after throwing an instance of 'tiledb::TileDBError'
  what():  [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments."

The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer .

I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files .

Co-authored-by: Ypatia Tsavliri <[email protected]>
Co-authored-by: Paul J. Davis <[email protected]>
ypatia added a commit that referenced this pull request Mar 9, 2023
#3944)

This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") :

"terminate called after throwing an instance of 'tiledb::TileDBError'
  what():  [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments."

The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer .

I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files .

Co-authored-by: Ypatia Tsavliri <[email protected]>
Co-authored-by: Paul J. Davis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants