Fix support for empty strings for Dictionary and RLE encodings #3938

ypatia · 2023-03-02T15:26:38Z

This is a problem found in VCF by @gspowley where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") :

terminate called after throwing an instance of 'tiledb::TileDBError'
  what():  [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments.

The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer .

I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files .

~~Note that in order to fix this issue I had to correct the on disk serialization of dictionaries, so the fix won't work for existing arrays written in the so far non-working way.~~

TYPE: BUG
DESC: Fix support for empty strings for Dictionary and RLE encodings

shortcut-integration · 2023-03-02T15:26:44Z

This pull request has been linked to Shortcut Story #25823: Error when reading an annotation VCF.

davisp

I've only had one coffee and have only ever read this code, but I'm pretty sure the only bug here was that we were checking if output is empty or not. That's a broken assumption (that output must have size > 0) when all strings were empty during encoding.

To be specific, the encoded dictionary and data in the case of all empty strings should be represented by a dictionary of 0x00, and the input buffer of encoded strings would be a single 0x00 (zero valued byte) per string.

davisp · 2023-03-03T15:45:21Z

tiledb/sm/compressors/dict_compressor.cc

@@ -68,7 +68,7 @@ void DictEncoding::decompress(
    const uint8_t word_id_size,
    span<std::byte> output,
    span<uint64_t> output_offsets) {
-  if (input.empty() || output.empty() || word_id_size == 0) {
+  if (input.empty() || word_id_size == 0) {


Having just read the code, I'm pretty sure the removal of checking output.empty() here and in dict_compressor.h:186 is the entire fix for the issue.

So, there are 2 reasons those 2 checks are not enough:

In deserialize_dictionary, if I have an empty string as entry, I am not properly incrementing the index to the serialized_dictionary (in_index) in dict_compressor.h:261 and I am accessing out of bounds. I special-cased for str_len = 0 there and reverted the rest of the changes, can you check if you like that solution or you have a better idea on how to handle that?

In decompress I still need the special case for output.empty() in dict_compressor.h:193 because otherwise I get a SIGABRT because I am violating a contract of the implementation of span we are using that assumes that when writing to a span at some index idx : TCB_SPAN_EXPECT(idx < size()); , and in case of an empty output we get a failing 0 < 0 .

My fix is still not correct for the all empty strings case because of the last special case I mentioned where I exit if output.empty(), because I don't reconstruct the offsets_buffer. If I do reconstruct them I get a heap buffer overflow though, I am still fighting it to understand why.

Ah, yeah, that span behavior is odd since we're indexing with a zero length array. How does this approach look to you?

https://github.com/TileDB-Inc/TileDB/compare/pd/ch25823/support_for_empty_strings_dict_rle

Locally it passes the new test with --enable-debug --enable-assertions on the bootstrap line. I haven't run a full test suite with it though.

davisp · 2023-03-03T17:37:29Z

I should explain that encoding to make sure I'm not just a single coffee awake.

For the encoded dictionary, its a single 0x00 byte because all string lengths are less than 255 bytes (becuase they're all zero). Thus reading the dictionary, we read a single byte for a length, which is zero. Because its the first string in the dictionary, its "string id" is 0.

For the encoded strings that are all empty, we end up with the an array of 0x00 bytes that all represent empty strings. The reason here is that we know we had fewer than 255 encoded strings, thus the data type for each encoded string is a single byte. So the logic is we read a single byte, the byte contains the word id (0x00 for zero, which means the first string in the encoded dictionary) which is an empty string, which means we copy zero bytes to the output buffer and increment our decoding position by one (the size of our encoded word_id data type).

I was suddenly worried that memcpy with a zero length specified isn't guaranteed to not dereference either the input or output buffers.

davisp

+1

gspowley

Thanks @ypatia and @davisp! This fixes the issue in VCF.

This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") : "terminate called after throwing an instance of 'tiledb::TileDBError' what(): [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments." The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer . I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files . Co-authored-by: Paul J. Davis <[email protected]>

#3945) This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") : "terminate called after throwing an instance of 'tiledb::TileDBError' what(): [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments." The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer . I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files . Co-authored-by: Ypatia Tsavliri <[email protected]> Co-authored-by: Paul J. Davis <[email protected]>

#3944) This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") : "terminate called after throwing an instance of 'tiledb::TileDBError' what(): [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments." The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer . I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files . Co-authored-by: Ypatia Tsavliri <[email protected]> Co-authored-by: Paul J. Davis <[email protected]>

ypatia requested a review from Shelnutt2 March 2, 2023 15:26

ypatia requested a review from gspowley March 2, 2023 15:26

Fix checks and dict for empty strings

7c431fa

ypatia force-pushed the yt/ch25823/support_for_empty_strings_dict_rle branch from c80b8dc to 7c431fa Compare March 2, 2023 15:44

Fix rle tests failing CI, but not dictionary ones

c0fe06f

davisp requested changes Mar 3, 2023

View reviewed changes

Address @davisp's review comments

5885399

ypatia force-pushed the yt/ch25823/support_for_empty_strings_dict_rle branch from 6140287 to 5885399 Compare March 6, 2023 13:12

ypatia and others added 3 commits March 6, 2023 17:05

Reconstruct offsets for the all empty string case

72925e5

Test commit

363cd1a

Fix RLE for empty strings

76bbfbd

ypatia force-pushed the yt/ch25823/support_for_empty_strings_dict_rle branch from e0a1d25 to 76bbfbd Compare March 7, 2023 13:57

Don't rely on memcpy no-op

a8fed71

I was suddenly worried that memcpy with a zero length specified isn't guaranteed to not dereference either the input or output buffers.

davisp approved these changes Mar 7, 2023

View reviewed changes

gspowley approved these changes Mar 7, 2023

View reviewed changes

ypatia added backport release-2.14 backport release-2.15 backport release-2.13 and removed backport release-2.13 labels Mar 8, 2023

ypatia merged commit 1307810 into dev Mar 8, 2023

ypatia deleted the yt/ch25823/support_for_empty_strings_dict_rle branch March 8, 2023 14:48

github-actions bot mentioned this pull request Mar 8, 2023

[Backport release-2.14] Fix support for empty strings for Dictionary and RLE encodings #3944

Merged

github-actions bot mentioned this pull request Mar 8, 2023

[Backport release-2.15] Fix support for empty strings for Dictionary and RLE encodings #3945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix support for empty strings for Dictionary and RLE encodings #3938

Fix support for empty strings for Dictionary and RLE encodings #3938

ypatia commented Mar 2, 2023 •

edited

Loading

shortcut-integration bot commented Mar 2, 2023

davisp left a comment

davisp Mar 3, 2023

ypatia Mar 6, 2023 •

edited

Loading

ypatia Mar 6, 2023

davisp Mar 6, 2023

davisp commented Mar 3, 2023

davisp left a comment

gspowley left a comment

Fix support for empty strings for Dictionary and RLE encodings #3938

Fix support for empty strings for Dictionary and RLE encodings #3938

Conversation

ypatia commented Mar 2, 2023 • edited Loading

shortcut-integration bot commented Mar 2, 2023

davisp left a comment

Choose a reason for hiding this comment

davisp Mar 3, 2023

Choose a reason for hiding this comment

ypatia Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

ypatia Mar 6, 2023

Choose a reason for hiding this comment

davisp Mar 6, 2023

Choose a reason for hiding this comment

davisp commented Mar 3, 2023

davisp left a comment

Choose a reason for hiding this comment

gspowley left a comment

Choose a reason for hiding this comment

ypatia commented Mar 2, 2023 •

edited

Loading

ypatia Mar 6, 2023 •

edited

Loading