Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can datasets remove duplicated rows? #2514

Open
liuxinglan opened this issue Jun 17, 2021 · 12 comments
Open

Can datasets remove duplicated rows? #2514

liuxinglan opened this issue Jun 17, 2021 · 12 comments
Labels
enhancement New feature or request

Comments

@liuxinglan
Copy link

liuxinglan commented Jun 17, 2021

Is your feature request related to a problem? Please describe.
i find myself more and more relying on datasets just to do all the preprocessing. One thing however, for removing duplicated rows, I couldn't find out how and am always converting datasets to pandas to do that..

Describe the solution you'd like
have a functionality of " remove duplicated rows"

Describe alternatives you've considered
convert dataset to pandas, remove duplicate, and convert back...

Additional context
no

@liuxinglan liuxinglan added the enhancement New feature or request label Jun 17, 2021
@lhoestq
Copy link
Member

lhoestq commented Jun 18, 2021

Hi ! For now this is probably the best option.
We might add a feature like this in the feature as well.

Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ?
Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases

@mariosasko
Copy link
Collaborator

Yes, I'd like to work on this feature once I'm done with #2500, but first I have to do some research, and see if the implementation wouldn't be too complex.

In the meantime, maybe this lib can help. However, note that this lib operates directly on pyarrow tables and relies only on hash to find duplicates (e.g. -1 and -2 have the same hash in Python 3, so this lib will treat them as duplicates), which doesn't make much sense.

@liuxinglan
Copy link
Author

Hi ! For now this is probably the best option.
We might add a feature like this in the feature as well.

Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ?
Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases

Great if this is can be done. Thanks!!

Not sure if you are asking me. In any case I don't know of any unfortunately :( in practice if data is really large we normally do it with spark (only for info. I understand this is not useful in developing this library..)

@Dref360
Copy link
Contributor

Dref360 commented Oct 7, 2021

Hello,

I'm also interested in this feature.
Has there been progress on this issue?

Could we use a similar trick as above, but with a better hashing algorithm like SHA?

We could also use a bloom filter, should we care a lot about collision in this case?

@Dref360
Copy link
Contributor

Dref360 commented Nov 6, 2021

For reference, we can get a solution fairly easily if we assume that we can hold in memory all unique values.

from datasets import Dataset
from itertools import cycle
from functools import partial

memory = set()
def is_unique(elem:Any , column: str, memory: set) -> bool:
    if elem[column] in memory:
        return False
    else:
        memory.add(elem[column])
        return True

# Example dataset
ds = Dataset.from_dict({"col1" : [sent for i, sent in zip(range(10), cycle(["apple", "orange", "pear"]))],
                                      "col2": [i % 5 for i in range(10)]})

# Drop duplicates in `ds` on "col1"
ds2 = ds.filter(partial(is_unique, column="col1", memory=memory))

Of course, we can improve the API so that we can introduce Dataset.drop_duplicates.
For the parallel version, we can use a shared memory set.

@lvwerra
Copy link
Member

lvwerra commented Dec 2, 2021

An approach that works assuming you can hold the all the unique document hashes in memory:

from datasets import load_dataset

def get_hash(example):
    """Get hash of content field."""
    return {"hash": hash(example["content"])} # can use any hashing function here
    
def check_uniques(example, uniques):
    """Check if current hash is still in set of unique hashes and remove if true."""
    if example["hash"] in uniques:
        uniques.remove(example["hash"])
        return True
    else:
        return False

ds = load_dataset("some_dataset")
ds = ds.map(get_hash)
uniques = set(ds.unique("hash"))
ds_filter = ds.filter(check_uniques, fn_kwargs={"uniques": uniques})

If the uniques could be stored in arrow then no additional memory would used at all but I don't know if this is possible.

@StephennFernandes
Copy link

@lvwerra hey, could you tell me how reliable is this deduplication method. i am currently using the same deduplication strategy to deduplicate a large text corpus to pretrain LLMs ~ 11B to 20B. just needed to ensure if this strategy would be fine on large datasets for LLMs pretraining.

@Manel-Hik
Copy link

Hi @StephennFernandes I'm also trying to pretrain an llm, and need to do deduplication for my dataset,
which method you applied please?

@StephennFernandes
Copy link

StephennFernandes commented Jan 25, 2024

Hey @Manel-Hik

The following is a simpler yet really effective deduplication code that i has used in the past.

given that I have limited training corpus for the languages I wanted to train i relied on this code. https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned/blob/main/deduplicate.py

for more robust and stronger deduplication, refer to this huggingface repo, that's newly released: https://github.com/huggingface/datatrove

@Manel-Hik
Copy link

Thanks a lot Sure I will check it @StephennFernandes

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Mar 28, 2024

Hi, is there any updates? Thanks!

@Dref360
Copy link
Contributor

Dref360 commented Jul 19, 2024

Update July 2024

PyArrow now supports first/last aggregation which would allow us to implement this functionality. Link

so if we want to move in this direction we can :) Is that something we want to do? Would be happy to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants