-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix support for empty strings for Dictionary and RLE encodings #3938
Conversation
This pull request has been linked to Shortcut Story #25823: Error when reading an annotation VCF. |
c80b8dc
to
7c431fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only had one coffee and have only ever read this code, but I'm pretty sure the only bug here was that we were checking if output is empty or not. That's a broken assumption (that output must have size > 0) when all strings were empty during encoding.
To be specific, the encoded dictionary and data in the case of all empty strings should be represented by a dictionary of 0x00, and the input buffer of encoded strings would be a single 0x00 (zero valued byte) per string.
@@ -68,7 +68,7 @@ void DictEncoding::decompress( | |||
const uint8_t word_id_size, | |||
span<std::byte> output, | |||
span<uint64_t> output_offsets) { | |||
if (input.empty() || output.empty() || word_id_size == 0) { | |||
if (input.empty() || word_id_size == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having just read the code, I'm pretty sure the removal of checking output.empty()
here and in dict_compressor.h:186
is the entire fix for the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there are 2 reasons those 2 checks are not enough:
- In
deserialize_dictionary
, if I have an empty string as entry, I am not properly incrementing the index to theserialized_dictionary
(in_index
) indict_compressor.h:261
and I am accessing out of bounds. I special-cased forstr_len = 0
there and reverted the rest of the changes, can you check if you like that solution or you have a better idea on how to handle that? - In
decompress
I still need the special case foroutput.empty()
indict_compressor.h:193
because otherwise I get aSIGABRT
because I am violating a contract of the implementation ofspan
we are using that assumes that when writing to aspan
at some indexidx
:TCB_SPAN_EXPECT(idx < size());
, and in case of an emptyoutput
we get a failing0 < 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fix is still not correct for the all empty strings case because of the last special case I mentioned where I exit if output.empty()
, because I don't reconstruct the offsets_buffer
. If I do reconstruct them I get a heap buffer overflow though, I am still fighting it to understand why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, that span behavior is odd since we're indexing with a zero length array. How does this approach look to you?
https://github.com/TileDB-Inc/TileDB/compare/pd/ch25823/support_for_empty_strings_dict_rle
Locally it passes the new test with --enable-debug --enable-assertions
on the bootstrap line. I haven't run a full test suite with it though.
I should explain that encoding to make sure I'm not just a single coffee awake. For the encoded dictionary, its a single For the encoded strings that are all empty, we end up with the an array of |
6140287
to
5885399
Compare
e0a1d25
to
76bbfbd
Compare
I was suddenly worried that memcpy with a zero length specified isn't guaranteed to not dereference either the input or output buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") : "terminate called after throwing an instance of 'tiledb::TileDBError' what(): [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments." The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer . I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files . Co-authored-by: Paul J. Davis <[email protected]>
This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") : "terminate called after throwing an instance of 'tiledb::TileDBError' what(): [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments." The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer . I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files . Co-authored-by: Paul J. Davis <[email protected]>
#3945) This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") : "terminate called after throwing an instance of 'tiledb::TileDBError' what(): [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments." The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer . I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files . Co-authored-by: Ypatia Tsavliri <[email protected]> Co-authored-by: Paul J. Davis <[email protected]>
#3944) This is a problem found in VCF where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") : "terminate called after throwing an instance of 'tiledb::TileDBError' what(): [TileDB::Task] Error: Caught std::exception: Failed decompressing dictionary-encoded strings; empty input arguments." The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer . I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files . Co-authored-by: Ypatia Tsavliri <[email protected]> Co-authored-by: Paul J. Davis <[email protected]>
This is a problem found in VCF by @gspowley where if using a string dimension compressed with a dictionary filter, we get an error if the value of the string dimension is always an empty string ("") :
The failure would not occur when reading an array that was written using a string dimension data buffer that contained N entries with value 0, but only when writing using an empty string dimension data buffer .
I also added support for encoding empty strings with RLE by removing a check (this was previously done for Dictionary in https://github.com/TileDB-Inc/TileDB/pull/3493/files .
Note that in order to fix this issue I had to correct the on disk serialization of dictionaries, so the fix won't work for existing arrays written in the so far non-working way.TYPE: BUG
DESC: Fix support for empty strings for Dictionary and RLE encodings