-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add codepath for computing buckets without int conversion #326
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
nemo_curator/modules/fuzzy_dedup.py
Outdated
import shutil | ||
|
||
shutil.rmtree(write_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fir this PR, but just a highlight from our google docs convo, good place to leverage fsspec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Decided to go via this route for now (since other places also use shutil). Aligned that the refactor to be more remote friendly should leverage fsspec utilities where possible.
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
|
||
shutil.rmtree(write_path) | ||
|
||
return are_buckets_empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable for tracking if all the buckets were empty
) | ||
# Only check if buckets written so far are empty | ||
if are_buckets_empty: | ||
are_buckets_empty = check_empty_buckets(write_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we need to do this in the first place is because there's no way to know if we're writing out an empty dataframe or not, unless we persist, or write it out, check the metadata and then overwrite on the next iteration.
ds = dataset(bucket_path, format="parquet") | ||
for fragment in ds.get_fragments(): | ||
if fragment.metadata.num_rows > 0: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic can probably be simplified by using a global metadata file when writing out the parquet dataset write_metadata_file=True
. However this had some issues in 24.10: rapidsai/cudf#17177 and is only fixed in 24.12. Will open an issue to simplify this method once that's merged in.
print( | ||
f"Stage{stage_num}: No potential duplicate documents found during LSH" | ||
) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return None or an empty DocumentDataset
with no id's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer returning None
. Empty DocumentDatasets might lead to unexplained errors downstream that might be tougher to debug/understand. Happy to hear counter points.
One thing that comes up from this is that I might update the examples/FuzzyDedup.py
to handle the case where the result returned was None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but then for Sequential
I think we might want to handle that behavior too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen Sequential
being used directly with FuzzyDuplicates
since the results cannot be processed downstream by any of the other modules without using to filter out the duplicates first. I'm not sure how to handle this use case. But longer term, we would probably want to add a FuzzyDeduplicate
class that calls Fuzzy Duplicates and also handles removal.
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me.
I have two nitpicks but the logic seems good to me.
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Signed-off-by: Sarah Yurick <[email protected]>
* Add codepath for computing buckets without int conversion Signed-off-by: Ayush Dattagupta <[email protected]> * Refactor write logic into its own method Signed-off-by: Ayush Dattagupta <[email protected]> * Update cli script Signed-off-by: Ayush Dattagupta <[email protected]> * Add tests Signed-off-by: Ayush Dattagupta <[email protected]> * Update docs Signed-off-by: Ayush Dattagupta <[email protected]> * Update fuzzy_deduplication example Signed-off-by: Ayush Dattagupta <[email protected]> * Address reviews Signed-off-by: Ayush Dattagupta <[email protected]> * update docs Signed-off-by: Ayush Dattagupta <[email protected]> * Update arg name in tests Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
* add changes from #389 Signed-off-by: Sarah Yurick <[email protected]> * add scripts files Signed-off-by: Sarah Yurick <[email protected]> * add changes from #326 Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * re add ParallelScoreFilter Signed-off-by: Sarah Yurick <[email protected]> * remove _MapBuckets and _Shuffle from nemo_curator path Signed-off-by: Sarah Yurick <[email protected]> * update api doc Signed-off-by: Sarah Yurick <[email protected]> * add changes from #445 Signed-off-by: Sarah Yurick <[email protected]> * Add changes from #478 Signed-off-by: Sarah Yurick <[email protected]> * final nits Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
Description
PR has 2 enhancements:
map_buckets
and following steps in the fpcheck path.Usage
Checklist