You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default when reading from json/parquet files, unless an index is specified, Curator typically reads in each partition with an index ranging from 0->len(partition). However for dataframes where this is not the case, Fuzzy dedup might fail.
Steps/Code to reproduce bug
Reproducer in #46 tests, root cause seems to be coming from
Describe the bug
By default when reading from json/parquet files, unless an index is specified, Curator typically reads in each partition with an index ranging from 0->len(partition). However for dataframes where this is not the case, Fuzzy dedup might fail.
Steps/Code to reproduce bug
Reproducer in #46 tests, root cause seems to be coming from
NeMo-Curator/nemo_curator/utils/fuzzy_dedup_utils/merge_utils.py
Line 161 in fe9fd6f
Expected behavior
No errors
The text was updated successfully, but these errors were encountered: