Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add codepath for computing buckets without int conversion #326
Add codepath for computing buckets without int conversion #326
Changes from 10 commits
ccb1e31
f2b1888
816940b
30f383c
d7a2617
954a043
3b51aad
d119740
8dbc48a
dccd964
2e497df
c969e1f
f2a59f1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable for tracking if all the buckets were empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we need to do this in the first place is because there's no way to know if we're writing out an empty dataframe or not, unless we persist, or write it out, check the metadata and then overwrite on the next iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return None or an empty
DocumentDataset
with no id'sThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer returning
None
. Empty DocumentDatasets might lead to unexplained errors downstream that might be tougher to debug/understand. Happy to hear counter points.One thing that comes up from this is that I might update the
examples/FuzzyDedup.py
to handle the case where the result returned wasNone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but then for
Sequential
I think we might want to handle that behavior too?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen
Sequential
being used directly withFuzzyDuplicates
since the results cannot be processed downstream by any of the other modules without using to filter out the duplicates first. I'm not sure how to handle this use case. But longer term, we would probably want to add aFuzzyDeduplicate
class that calls Fuzzy Duplicates and also handles removal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic can probably be simplified by using a global metadata file when writing out the parquet dataset
write_metadata_file=True
. However this had some issues in 24.10: rapidsai/cudf#17177 and is only fixed in 24.12. Will open an issue to simplify this method once that's merged in.