Add codepath for computing buckets without int conversion #326

ayushdg · 2024-10-25T15:22:19Z

Description

PR has 2 enhancements:

Improves performance for cases where users want to skip the fp check by skipping conversion of bucket_id's to integers, only needed by map_buckets and following steps in the fpcheck path.
Improves error messages/cases where the data contained no duplicates. Fixes [BUG] Fuzzy deduplication fails on datasets with no duplicates #67.

Usage

        lsh = LSH(
            ..., # same params as earlier
            buckets_as_int=False, # or true if planning to go via FP check.
        )

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>

praateekmahajan · 2024-11-04T19:10:21Z

nemo_curator/modules/fuzzy_dedup.py

+            import shutil
+
+            shutil.rmtree(write_path)


Not fir this PR, but just a highlight from our google docs convo, good place to leverage fsspec

Agreed. Decided to go via this route for now (since other places also use shutil). Aligned that the refactor to be more remote friendly should leverage fsspec utilities where possible.

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg · 2024-11-13T19:49:33Z

nemo_curator/modules/fuzzy_dedup.py

+
+                shutil.rmtree(write_path)
+
+        return are_buckets_empty


Variable for tracking if all the buckets were empty

nemo_curator/modules/fuzzy_dedup.py

ayushdg · 2024-11-13T19:50:44Z

nemo_curator/modules/fuzzy_dedup.py

+            )
+        # Only check if buckets written so far are empty
+        if are_buckets_empty:
+            are_buckets_empty = check_empty_buckets(write_path)


The reason we need to do this in the first place is because there's no way to know if we're writing out an empty dataframe or not, unless we persist, or write it out, check the metadata and then overwrite on the next iteration.

ayushdg · 2024-11-13T19:52:39Z

nemo_curator/utils/fuzzy_dedup_utils/io_utils.py

+    ds = dataset(bucket_path, format="parquet")
+    for fragment in ds.get_fragments():
+        if fragment.metadata.num_rows > 0:
+            return False


This logic can probably be simplified by using a global metadata file when writing out the parquet dataset write_metadata_file=True. However this had some issues in 24.10: rapidsai/cudf#17177 and is only fixed in 24.12. Will open an issue to simplify this method once that's merged in.

praateekmahajan · 2024-11-14T14:24:21Z

nemo_curator/modules/fuzzy_dedup.py

+            print(
+                f"Stage{stage_num}: No potential duplicate documents found during LSH"
+            )
+            return None


Should this return None or an empty DocumentDataset with no id's

I prefer returning None. Empty DocumentDatasets might lead to unexplained errors downstream that might be tougher to debug/understand. Happy to hear counter points.
One thing that comes up from this is that I might update the examples/FuzzyDedup.py to handle the case where the result returned was None

Makes sense, but then for Sequential I think we might want to handle that behavior too?

I haven't seen Sequential being used directly with FuzzyDuplicates since the results cannot be processed downstream by any of the other modules without using to filter out the duplicates first. I'm not sure how to handle this use case. But longer term, we would probably want to add a FuzzyDeduplicate class that calls Fuzzy Duplicates and also handles removal.

Signed-off-by: Ayush Dattagupta <[email protected]>

nemo_curator/modules/fuzzy_dedup.py

VibhuJawa

Mostly looks good to me.

I have two nitpicks but the logic seems good to me.

nemo_curator/modules/fuzzy_dedup.py

Signed-off-by: Ayush Dattagupta <[email protected]>

sarahyurick

LGTM, thanks!

Signed-off-by: Sarah Yurick <[email protected]>

* Add codepath for computing buckets without int conversion Signed-off-by: Ayush Dattagupta <[email protected]> * Refactor write logic into its own method Signed-off-by: Ayush Dattagupta <[email protected]> * Update cli script Signed-off-by: Ayush Dattagupta <[email protected]> * Add tests Signed-off-by: Ayush Dattagupta <[email protected]> * Update docs Signed-off-by: Ayush Dattagupta <[email protected]> * Update fuzzy_deduplication example Signed-off-by: Ayush Dattagupta <[email protected]> * Address reviews Signed-off-by: Ayush Dattagupta <[email protected]> * update docs Signed-off-by: Ayush Dattagupta <[email protected]> * Update arg name in tests Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

* add changes from #389 Signed-off-by: Sarah Yurick <[email protected]> * add scripts files Signed-off-by: Sarah Yurick <[email protected]> * add changes from #326 Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * re add ParallelScoreFilter Signed-off-by: Sarah Yurick <[email protected]> * remove _MapBuckets and _Shuffle from nemo_curator path Signed-off-by: Sarah Yurick <[email protected]> * update api doc Signed-off-by: Sarah Yurick <[email protected]> * add changes from #445 Signed-off-by: Sarah Yurick <[email protected]> * Add changes from #478 Signed-off-by: Sarah Yurick <[email protected]> * final nits Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>

Add codepath for computing buckets without int conversion

ccb1e31

Signed-off-by: Ayush Dattagupta <[email protected]>

praateekmahajan reviewed Nov 4, 2024

View reviewed changes

ayushdg added 7 commits November 8, 2024 15:43

Merge branch 'main' into enh-lsh-noint

f2b1888

Signed-off-by: Ayush Dattagupta <[email protected]>

Merge branch 'main' into enh-lsh-noint

816940b

Signed-off-by: Ayush Dattagupta <[email protected]>

Refactor write logic into its own method

30f383c

Signed-off-by: Ayush Dattagupta <[email protected]>

Update cli script

d7a2617

Signed-off-by: Ayush Dattagupta <[email protected]>

Add tests

954a043

Signed-off-by: Ayush Dattagupta <[email protected]>

Update docs

3b51aad

Signed-off-by: Ayush Dattagupta <[email protected]>

Merge branch 'main' into enh-lsh-noint

d119740

ayushdg marked this pull request as ready for review November 13, 2024 18:21

ayushdg requested a review from VibhuJawa November 13, 2024 18:22

ayushdg commented Nov 13, 2024

View reviewed changes

praateekmahajan reviewed Nov 14, 2024

View reviewed changes

Update fuzzy_deduplication example

8dbc48a

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg mentioned this pull request Nov 19, 2024

Graceful handling when no LSH duplicates found. #381

Closed

ayushdg requested a review from sarahyurick November 22, 2024 19:04

Merge branch 'main' of github.com:NVIDIA/NeMo-Curator into enh-lsh-noint

dccd964

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg added enhancement New feature or request gpuci Run GPU CI/CD on PR labels Nov 22, 2024

sarahyurick reviewed Nov 22, 2024

View reviewed changes

nemo_curator/modules/fuzzy_dedup.py Outdated Show resolved Hide resolved

nemo_curator/modules/fuzzy_dedup.py Show resolved Hide resolved

VibhuJawa approved these changes Nov 25, 2024

View reviewed changes

nemo_curator/modules/fuzzy_dedup.py Outdated Show resolved Hide resolved

nemo_curator/modules/fuzzy_dedup.py Show resolved Hide resolved

ayushdg added 2 commits November 26, 2024 17:45

Address reviews

2e497df

Signed-off-by: Ayush Dattagupta <[email protected]>

update docs

c969e1f

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg requested a review from sarahyurick November 27, 2024 01:51

ayushdg added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Nov 27, 2024

Update arg name in tests

f2a59f1

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Nov 27, 2024

ayushdg added the gpuci Run GPU CI/CD on PR label Nov 27, 2024

sarahyurick approved these changes Nov 27, 2024

View reviewed changes

ayushdg merged commit 3ebc807 into NVIDIA:main Nov 27, 2024
5 checks passed

sarahyurick added a commit to sarahyurick/NeMo-Curator that referenced this pull request Dec 3, 2024

add changes from NVIDIA#326

746af88

Signed-off-by: Sarah Yurick <[email protected]>

ayushdg mentioned this pull request Dec 18, 2024

Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

Open

ayushdg mentioned this pull request Jan 15, 2025

false_positive_check=True need to add in ThWiki tutorial #481

Open

ayushdg deleted the enh-lsh-noint branch January 28, 2025 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add codepath for computing buckets without int conversion #326

Add codepath for computing buckets without int conversion #326

ayushdg commented Oct 25, 2024 •

edited

Loading

praateekmahajan Nov 4, 2024

ayushdg Nov 13, 2024

ayushdg Nov 13, 2024

ayushdg Nov 13, 2024

ayushdg Nov 13, 2024

praateekmahajan Nov 14, 2024

ayushdg Nov 14, 2024

praateekmahajan Nov 15, 2024

ayushdg Nov 15, 2024

VibhuJawa left a comment

sarahyurick left a comment

Add codepath for computing buckets without int conversion #326

Add codepath for computing buckets without int conversion #326

Conversation

ayushdg commented Oct 25, 2024 • edited Loading

Description

Usage

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

ayushdg commented Oct 25, 2024 •

edited

Loading