Can datasets remove duplicated rows? #2514

liuxinglan · 2021-06-17T23:35:38Z

Is your feature request related to a problem? Please describe.
i find myself more and more relying on datasets just to do all the preprocessing. One thing however, for removing duplicated rows, I couldn't find out how and am always converting datasets to pandas to do that..

Describe the solution you'd like
have a functionality of " remove duplicated rows"

Describe alternatives you've considered
convert dataset to pandas, remove duplicate, and convert back...

Additional context
no

lhoestq · 2021-06-18T15:04:01Z

Hi ! For now this is probably the best option.
We might add a feature like this in the feature as well.

Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ?
Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases

mariosasko · 2021-06-19T00:01:57Z

Yes, I'd like to work on this feature once I'm done with #2500, but first I have to do some research, and see if the implementation wouldn't be too complex.

In the meantime, maybe this lib can help. However, note that this lib operates directly on pyarrow tables and relies only on hash to find duplicates (e.g. -1 and -2 have the same hash in Python 3, so this lib will treat them as duplicates), which doesn't make much sense.

liuxinglan · 2021-06-21T07:37:04Z

Hi ! For now this is probably the best option.
We might add a feature like this in the feature as well.

Do you know any deduplication method that works on arbitrary big datasets without filling up RAM ?
Otherwise we can have do the deduplication in memory like pandas but I feel like this is going to be limiting for some cases

Great if this is can be done. Thanks!!

Not sure if you are asking me. In any case I don't know of any unfortunately :( in practice if data is really large we normally do it with spark (only for info. I understand this is not useful in developing this library..)

Dref360 · 2021-10-07T20:42:39Z

Hello,

I'm also interested in this feature.
Has there been progress on this issue?

Could we use a similar trick as above, but with a better hashing algorithm like SHA?

We could also use a bloom filter, should we care a lot about collision in this case?

Dref360 · 2021-11-06T19:08:26Z

For reference, we can get a solution fairly easily if we assume that we can hold in memory all unique values.

from datasets import Dataset
from itertools import cycle
from functools import partial

memory = set()
def is_unique(elem:Any , column: str, memory: set) -> bool:
    if elem[column] in memory:
        return False
    else:
        memory.add(elem[column])
        return True

# Example dataset
ds = Dataset.from_dict({"col1" : [sent for i, sent in zip(range(10), cycle(["apple", "orange", "pear"]))],
                                      "col2": [i % 5 for i in range(10)]})

# Drop duplicates in `ds` on "col1"
ds2 = ds.filter(partial(is_unique, column="col1", memory=memory))

Of course, we can improve the API so that we can introduce Dataset.drop_duplicates.
For the parallel version, we can use a shared memory set.

lvwerra · 2021-12-02T08:39:21Z

An approach that works assuming you can hold the all the unique document hashes in memory:

from datasets import load_dataset

def get_hash(example):
    """Get hash of content field."""
    return {"hash": hash(example["content"])} # can use any hashing function here
    
def check_uniques(example, uniques):
    """Check if current hash is still in set of unique hashes and remove if true."""
    if example["hash"] in uniques:
        uniques.remove(example["hash"])
        return True
    else:
        return False

ds = load_dataset("some_dataset")
ds = ds.map(get_hash)
uniques = set(ds.unique("hash"))
ds_filter = ds.filter(check_uniques, fn_kwargs={"uniques": uniques})

If the uniques could be stored in arrow then no additional memory would used at all but I don't know if this is possible.

StephennFernandes · 2022-09-10T14:43:26Z

@lvwerra hey, could you tell me how reliable is this deduplication method. i am currently using the same deduplication strategy to deduplicate a large text corpus to pretrain LLMs ~ 11B to 20B. just needed to ensure if this strategy would be fine on large datasets for LLMs pretraining.

Manel-Hik · 2024-01-25T13:46:52Z

Hi @StephennFernandes I'm also trying to pretrain an llm, and need to do deduplication for my dataset,
which method you applied please?

StephennFernandes · 2024-01-25T14:20:53Z

Hey @Manel-Hik

The following is a simpler yet really effective deduplication code that i has used in the past.

given that I have limited training corpus for the languages I wanted to train i relied on this code. https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned/blob/main/deduplicate.py

for more robust and stronger deduplication, refer to this huggingface repo, that's newly released: https://github.com/huggingface/datatrove

Manel-Hik · 2024-01-25T14:27:58Z

Thanks a lot Sure I will check it @StephennFernandes

fzyzcjy · 2024-03-28T05:37:40Z

Hi, is there any updates? Thanks!

Dref360 · 2024-07-19T13:23:00Z

Update July 2024

PyArrow now supports first/last aggregation which would allow us to implement this functionality. Link

so if we want to move in this direction we can :) Is that something we want to do? Would be happy to contribute.

liuxinglan added the enhancement New feature or request label Jun 17, 2021

mariosasko mentioned this issue Jun 5, 2022

New Preprocessing Feature - Deduplication [Request] #4448

Open

mariosasko mentioned this issue May 31, 2023

Request for text deduplication feature #5877

Open

Dref360 mentioned this issue Jul 2, 2024

drop_duplicates method #7016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can datasets remove duplicated rows? #2514

Can datasets remove duplicated rows? #2514

liuxinglan commented Jun 17, 2021 •

edited

Loading

lhoestq commented Jun 18, 2021

mariosasko commented Jun 19, 2021

liuxinglan commented Jun 21, 2021

Dref360 commented Oct 7, 2021 •

edited

Loading

Dref360 commented Nov 6, 2021

lvwerra commented Dec 2, 2021

StephennFernandes commented Sep 10, 2022

Manel-Hik commented Jan 25, 2024

StephennFernandes commented Jan 25, 2024 •

edited

Loading

Manel-Hik commented Jan 25, 2024

fzyzcjy commented Mar 28, 2024

Dref360 commented Jul 19, 2024

Can datasets remove duplicated rows? #2514

Can datasets remove duplicated rows? #2514

Comments

liuxinglan commented Jun 17, 2021 • edited Loading

lhoestq commented Jun 18, 2021

mariosasko commented Jun 19, 2021

liuxinglan commented Jun 21, 2021

Dref360 commented Oct 7, 2021 • edited Loading

Dref360 commented Nov 6, 2021

lvwerra commented Dec 2, 2021

StephennFernandes commented Sep 10, 2022

Manel-Hik commented Jan 25, 2024

StephennFernandes commented Jan 25, 2024 • edited Loading

Manel-Hik commented Jan 25, 2024

fzyzcjy commented Mar 28, 2024

Dref360 commented Jul 19, 2024

liuxinglan commented Jun 17, 2021 •

edited

Loading

Dref360 commented Oct 7, 2021 •

edited

Loading

StephennFernandes commented Jan 25, 2024 •

edited

Loading