Scan Delete Support Part 2: introduce DeleteFileManager
skeleton. Use in ArrowReader
#950
+220
−43
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Second part of delete file read support. See #630.
This PR provides the basis for delete file support within
ArrowReader
.DeleteFileManager
is introduced, in skeleton form. Full implementation of its behaviour will be submitted in follow-up PRs.DeleteFileManager
is responsible for loading and parsing positional and equality delete files fromFileIO
. Once delete files for a task have been loaded and parsed,ArrowReader::process_file_scan_task
uses the resultingDeleteFileManager
in two places:DeleteFileManager::get_positional_delete_indexes_for_data_file
is passed a data file path and will return anOption<Vec<usize>>
Option<RoaringTreeMap>
containing the indices of all rows that are positionally deleted in that data file (orNone
if there are none)DeleteFileManager::build_delete_predicate
is invoked with the schema from the file scan task. It will return anOption<BoundPredicate>
representing the filter predicate derived from all of the applicable equality deletes being transformed into predicates, logically joined into a single predicate and then bound to the schema (orNone
if there are no applicable equality deletes)This PR integrates the skeleton of the
DeleteFileManager
intoArrowReader::process_file_scan_task
, extending theRowFilter
andRowSelection
logic to take into account anyRowFilter
that results from equality deletes and anyRowSelection
that results from positional deletes.Updates:
DeleteFileManager
so thatget_positional_delete_indexes_for_data_file
returns aRoaringTreemap
rather than aVec<usize>
. This was based on @liurenjie1024's recommendation in a comment on the v1 PR, and makes a lot of sense from a performance perspective and made it easier to implementArrowReader::build_deletes_row_selection
in the follow-up PR to this one, Scan Delete Support Part 3:ArrowReader::build_deletes_row_selection
implementation #951Potential further enhancements:
DeleteFileManager
in theArrowReader
rather than per-task so that delete files that apply to more than one task don't end up getting loaded and parsed twiceObjectCache
to ensure that loading and parsing of the same files persists across scans