Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan Delete Support Part 2: introduce DeleteFileManager skeleton. Use in ArrowReader #950

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Feb 8, 2025

Second part of delete file read support. See #630.

This PR provides the basis for delete file support within ArrowReader.

DeleteFileManager is introduced, in skeleton form. Full implementation of its behaviour will be submitted in follow-up PRs.

DeleteFileManager is responsible for loading and parsing positional and equality delete files from FileIO. Once delete files for a task have been loaded and parsed, ArrowReader::process_file_scan_task uses the resulting DeleteFileManager in two places:

  • DeleteFileManager::get_positional_delete_indexes_for_data_file is passed a data file path and will return an Option<Vec<usize>> Option<RoaringTreeMap> containing the indices of all rows that are positionally deleted in that data file (or None if there are none)
  • DeleteFileManager::build_delete_predicate is invoked with the schema from the file scan task. It will return an Option<BoundPredicate> representing the filter predicate derived from all of the applicable equality deletes being transformed into predicates, logically joined into a single predicate and then bound to the schema (or None if there are no applicable equality deletes)

This PR integrates the skeleton of the DeleteFileManager into ArrowReader::process_file_scan_task, extending the RowFilter and RowSelection logic to take into account any RowFilter that results from equality deletes and any RowSelection that results from positional deletes.

Updates:

Potential further enhancements:

  • Instantiate and store the DeleteFileManager in the ArrowReader rather than per-task so that delete files that apply to more than one task don't end up getting loaded and parsed twice
  • Go one step further and move loading of delete files, and parsing of positional delete files, into ObjectCache to ensure that loading and parsing of the same files persists across scans

@sdd sdd changed the title feat: introduce DeleteFileManager skeleton. Use in ArrowReader feat: introduce DeleteFileManager skeleton. Use in ArrowReader Feb 8, 2025
@sdd sdd force-pushed the feat/introduce-delete-file-manager branch 3 times, most recently from 6cbf041 to 4c0c7f9 Compare February 8, 2025 14:07
sdd added 2 commits February 10, 2025 09:00
* refactor: only pass row groups metadata rather than entire
  parquet metadata to . This
  makes it easier to test  as
  we don't need to mock up a full
@sdd sdd force-pushed the feat/introduce-delete-file-manager branch from 9d47546 to 4c2ef08 Compare February 10, 2025 09:00
@sdd
Copy link
Contributor Author

sdd commented Feb 10, 2025

@liurenjie1024, @Xuanwo, @Fokko - this is ready for review when any of you get chance. Thanks! :-)

@sdd sdd changed the title feat: introduce DeleteFileManager skeleton. Use in ArrowReader Scan Delete Support Part 2: introduce DeleteFileManager skeleton. Use in ArrowReader Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant