Fuzzy dedup error if partition wise indices do not start from 0 #48

ayushdg · 2024-05-02T23:17:46Z

Describe the bug

By default when reading from json/parquet files, unless an index is specified, Curator typically reads in each partition with an index ranging from 0->len(partition). However for dataframes where this is not the case, Fuzzy dedup might fail.

Steps/Code to reproduce bug

Reproducer in #46 tests, root cause seems to be coming from

NeMo-Curator/nemo_curator/utils/fuzzy_dedup_utils/merge_utils.py

Line 161 in fe9fd6f

left_df["_partitions"] = global_partitioning_index % parts_per_bucket_batch

where the lhs df might have different indices but the rhs starts from 0 resulting in assignment.

Expected behavior

No errors

ayushdg added the bug Something isn't working label May 2, 2024

ayushdg self-assigned this Jan 22, 2025

ayushdg mentioned this issue Jan 29, 2025

Consecutive execution of fuzzy deduplication on different columns fails with errors #501

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy dedup error if partition wise indices do not start from 0 #48

Fuzzy dedup error if partition wise indices do not start from 0 #48

ayushdg commented May 2, 2024

Fuzzy dedup error if partition wise indices do not start from 0 #48

Fuzzy dedup error if partition wise indices do not start from 0 #48

Comments

ayushdg commented May 2, 2024