-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can datasets remove duplicated rows? #2514
Comments
Hi ! For now this is probably the best option. Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ? |
Yes, I'd like to work on this feature once I'm done with #2500, but first I have to do some research, and see if the implementation wouldn't be too complex. In the meantime, maybe this lib can help. However, note that this lib operates directly on pyarrow tables and relies only on |
Great if this is can be done. Thanks!! Not sure if you are asking me. In any case I don't know of any unfortunately :( in practice if data is really large we normally do it with spark (only for info. I understand this is not useful in developing this library..) |
Hello, I'm also interested in this feature. Could we use a similar trick as above, but with a better hashing algorithm like SHA? We could also use a bloom filter, should we care a lot about collision in this case? |
For reference, we can get a solution fairly easily if we assume that we can hold in memory all unique values. from datasets import Dataset
from itertools import cycle
from functools import partial
memory = set()
def is_unique(elem:Any , column: str, memory: set) -> bool:
if elem[column] in memory:
return False
else:
memory.add(elem[column])
return True
# Example dataset
ds = Dataset.from_dict({"col1" : [sent for i, sent in zip(range(10), cycle(["apple", "orange", "pear"]))],
"col2": [i % 5 for i in range(10)]})
# Drop duplicates in `ds` on "col1"
ds2 = ds.filter(partial(is_unique, column="col1", memory=memory)) Of course, we can improve the API so that we can introduce |
An approach that works assuming you can hold the all the unique document hashes in memory: from datasets import load_dataset
def get_hash(example):
"""Get hash of content field."""
return {"hash": hash(example["content"])} # can use any hashing function here
def check_uniques(example, uniques):
"""Check if current hash is still in set of unique hashes and remove if true."""
if example["hash"] in uniques:
uniques.remove(example["hash"])
return True
else:
return False
ds = load_dataset("some_dataset")
ds = ds.map(get_hash)
uniques = set(ds.unique("hash"))
ds_filter = ds.filter(check_uniques, fn_kwargs={"uniques": uniques}) If the |
@lvwerra hey, could you tell me how reliable is this deduplication method. i am currently using the same deduplication strategy to deduplicate a large text corpus to pretrain LLMs ~ 11B to 20B. just needed to ensure if this strategy would be fine on large datasets for LLMs pretraining. |
Hi @StephennFernandes I'm also trying to pretrain an llm, and need to do deduplication for my dataset, |
Hey @Manel-Hik The following is a simpler yet really effective deduplication code that i has used in the past. given that I have limited training corpus for the languages I wanted to train i relied on this code. https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned/blob/main/deduplicate.py for more robust and stronger deduplication, refer to this huggingface repo, that's newly released: https://github.com/huggingface/datatrove |
Thanks a lot Sure I will check it @StephennFernandes |
Hi, is there any updates? Thanks! |
Update July 2024 PyArrow now supports so if we want to move in this direction we can :) Is that something we want to do? Would be happy to contribute. |
Is your feature request related to a problem? Please describe.
i find myself more and more relying on datasets just to do all the preprocessing. One thing however, for removing duplicated rows, I couldn't find out how and am always converting datasets to pandas to do that..
Describe the solution you'd like
have a functionality of " remove duplicated rows"
Describe alternatives you've considered
convert dataset to pandas, remove duplicate, and convert back...
Additional context
no
The text was updated successfully, but these errors were encountered: