Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Rename DeleteFileHolder to PendingDeleteFile / Optimize duplicate data/delete file detection #11254

Merged
merged 3 commits into from
Oct 15, 2024

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Oct 4, 2024

depends on #11158

@nastra nastra force-pushed the delete-file-holder-improvements branch from 4d42f18 to 6d26455 Compare October 14, 2024 17:46
@github-actions github-actions bot removed the NESSIE label Oct 14, 2024
private final Map<PartitionSpec, List<DataFile>> newDataFilesBySpec = Maps.newHashMap();
private final DataFileSet newDataFiles = DataFileSet.create();
private final DeleteFileSet newDeleteFiles = DeleteFileSet.create();
private final Map<PartitionSpec, DataFileSet> newDataFilesBySpec = Maps.newHashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... When did we start using PartitionSpec as keys? This makes all operations more expensive. We always used Integer when indexing by specs, like PartitionMap or even newDeleteFilesBySpec below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this was introduced with #9860

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can follow up on this in a separate PR and change it to Map<Integer, DataFileSet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think we should. Thanks!

@nastra nastra force-pushed the delete-file-holder-improvements branch from bf02252 to f26cb9b Compare October 15, 2024 07:18
@nastra nastra marked this pull request as ready for review October 15, 2024 07:21
@nastra nastra requested a review from aokolnychyi October 15, 2024 07:21
@nastra nastra force-pushed the delete-file-holder-improvements branch from f26cb9b to db770dc Compare October 15, 2024 08:18
RollingManifestWriter<DeleteFile> writer = newRollingDeleteManifestWriter(spec);

try (RollingManifestWriter<DeleteFile> closableWriter = writer) {
for (DeleteFileHolder file : files) {
for (DeleteFile file : files) {
Preconditions.checkState(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should we use checkState or checkArgument? Also, any chance we can shorten the error message to stay on 1 line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. My reasoning for using checkState initially was that something internally must have gotten wrong to reach that state state but I guess it's also possible that some other internal code just passes delete files that aren't a PendingDeleteFile.
Also getting this into a single line is hard, since the full line has 113 chars and I don't know what to omit from the error msg to make that fit into 100 chars

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @nastra!

@nastra nastra force-pushed the delete-file-holder-improvements branch from db770dc to 371e472 Compare October 15, 2024 17:12
@nastra
Copy link
Contributor Author

nastra commented Oct 15, 2024

thanks @aokolnychyi for the review

@nastra nastra merged commit 33b33f3 into apache:main Oct 15, 2024
49 checks passed
@nastra nastra deleted the delete-file-holder-improvements branch October 15, 2024 18:02
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants