Create `Cache` class for exact, fuzzy, and semantic deduplication #384

sarahyurick · 2024-11-19T22:21:06Z

TODO:

Exact deduplication files
Semantic deduplication files
Fuzzy deduplication files
Tutorials folder

Signed-off-by: Sarah Yurick <[email protected]>

docs/user-guide/semdedup.rst

nemo_curator/modules/semantic_dedup.py

Signed-off-by: Sarah Yurick <[email protected]>

nemo_curator/cache.py

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick

This PR allows the user to use Cache(cache_dir=...) instead of having to specify the cache_dir at every step of the deduplication pipelines.

However, the user does not have to use Cache if they want to keep using the modules the same way as before. Both ways work, with preference if the user specifies the cache_dir field in a deduplication module.

The only hard changes are with the semantic deduplication fields, see below.

sarahyurick · 2025-01-23T22:10:22Z

nemo_curator/modules/semantic_dedup/clusteringmodel.py

-        clustering_output_dir: str,
+        cache_dir: Optional[str] = None,
+        clustering_save_loc: str = "clustering_results",


We deprecate clustering_output_dir in favor of cache_dir and clustering_save_loc for ClusteringModel, which matches the logic in the SemDedup class.

sarahyurick · 2025-01-23T22:11:39Z

nemo_curator/modules/semantic_dedup/embeddings.py

-        embedding_output_dir: str,
+        cache_dir: Optional[str] = None,
+        embeddings_save_loc: str = "embeddings",


We deprecate embedding_output_dir in favor of cache_dir and embeddings_save_loc for EmbeddingCreator, which matches the logic in the SemDedup class.

sarahyurick · 2025-01-23T22:12:36Z

nemo_curator/modules/semantic_dedup/semanticclusterleveldedup.py

-            emb_by_clust_dir (str): Directory containing embeddings by cluster.
-            sorted_clusters_dir (str): Directory containing sorted clusters.


We deprecate emb_by_clust_dir and sorted_clusters_dir in favor of cache_dir and embeddings_save_loc for SemanticClusterLevelDedup, which matches the logic in the SemDedup class.

Signed-off-by: Sarah Yurick <[email protected]>

Maghoumi · 2025-02-18T18:22:04Z

Thanks so much for working on this change. What I like about this now is that it gives users the option to either use the same cache directory for anything that requires caching, or provide a specific directory if they don't want to re-use the same cache.

The cache class implementation is functional but not thread-safe. I don't think that's a blocking problem for this PR.

I didn't run the samples/tutorials, but I assume the change has been thoroughly verified?

sarahyurick · 2025-02-18T18:26:57Z

Thanks so much for working on this change. What I like about this now is that it gives users the option to either use the same cache directory for anything that requires caching, or provide a specific directory if they don't want to re-use the same cache.

The cache class implementation is functional but not thread-safe. I don't think that's a blocking problem for this PR.

I didn't run the samples/tutorials, but I assume the change has been thoroughly verified?

Thanks! Yes, I tried to make as few breaking changes as possible. The examples and tutorials should all reflect those changes.

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2025-02-21T20:34:24Z

config/sem_dedup_config.yaml

-seed: 1234
 max_iter: 100
 kmeans_with_cos_dist: false

 # Semdedup configuration
 which_to_keep: "hard"
-largest_cluster_size_to_process: 100000


IIRC, I removed seed and largest_cluster_size_to_process from the semantic dedupe parameters because I couldn't find them being used anywhere.

sarahyurick added 3 commits November 19, 2024 14:20

add global cache variable and use it for exact dedup

769e2ea

Signed-off-by: Sarah Yurick <[email protected]>

global cache for semdedup

b77139c

Signed-off-by: Sarah Yurick <[email protected]>

run black and modify pytest

337cec8

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick changed the title ~~Global cache variable for exact, fuzzy, and semantic deduplication~~ Global cache_dir variable for exact, fuzzy, and semantic deduplication Nov 19, 2024

sarahyurick commented Nov 20, 2024

View reviewed changes

docs/user-guide/semdedup.rst Outdated Show resolved Hide resolved

nemo_curator/modules/semantic_dedup.py Outdated Show resolved Hide resolved

nemo_curator/modules/semantic_dedup.py Outdated Show resolved Hide resolved

sarahyurick and others added 6 commits November 19, 2024 16:13

update image notebook

6d55d8c

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into global_cache_dir

622912b

save fuzzy dedup progress

4cb26d5

Signed-off-by: Sarah Yurick <[email protected]>

save progress

b001622

Signed-off-by: Sarah Yurick <[email protected]>

update remaining docs

0c14626

Signed-off-by: Sarah Yurick <[email protected]>

run black

7486459

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick added the gpuci Run GPU CI/CD on PR label Nov 20, 2024

sarahyurick marked this pull request as ready for review November 20, 2024 23:27

Maghoumi reviewed Nov 25, 2024

View reviewed changes

nemo_curator/cache.py Outdated Show resolved Hide resolved

sarahyurick added 5 commits December 6, 2024 15:06

Merge branch 'main' into global_cache_dir

053f312

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into global_cache_dir

1b1ba30

Merge branch 'main' into global_cache_dir

4b12651

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into global_cache_dir

4160471

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into global_cache_dir

8a22ace

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 23, 2024

Merge branch 'main' into global_cache_dir

5e9bef1

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 3, 2025

sarahyurick added 4 commits January 21, 2025 12:54

Merge remote-tracking branch 'upstream/main' into global_cache_dir

d823a0b

re-add get_cache_directory changes

0890fb0

Signed-off-by: Sarah Yurick <[email protected]>

create Cache singleton class

8fd79fb

Signed-off-by: Sarah Yurick <[email protected]>

update exact_dedup

0d7b969

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick marked this pull request as draft January 22, 2025 00:54

sarahyurick changed the title ~~Global cache_dir variable for exact, fuzzy, and semantic deduplication~~ Create Cache class for exact, fuzzy, and semantic deduplication Jan 22, 2025

sarahyurick added 5 commits January 22, 2025 13:04

add semdedup functionality with Cache

2c1a435

Signed-off-by: Sarah Yurick <[email protected]>

add semdedup_example script

f0ff2ce

Signed-off-by: Sarah Yurick <[email protected]>

Cache singleton option for fuzzy dedup

a379893

Signed-off-by: Sarah Yurick <[email protected]>

run black

67f609c

Signed-off-by: Sarah Yurick <[email protected]>

fix tutorials

8693177

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick commented Jan 23, 2025

View reviewed changes

sarahyurick marked this pull request as ready for review January 23, 2025 22:22

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 23, 2025

sarahyurick requested a review from Maghoumi January 23, 2025 22:23

Merge branch 'main' into global_cache_dir

c296cc7

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 29, 2025

Maghoumi approved these changes Feb 18, 2025

View reviewed changes

sarahyurick and others added 3 commits February 18, 2025 14:35

Merge branch 'main' into global_cache_dir

510347c

Signed-off-by: Sarah Yurick <[email protected]>

run black

0635ebf

Signed-off-by: Sarah Yurick <[email protected]>

import assert_eq

a229857

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 18, 2025

fix semdedup test

30ec409

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 19, 2025

Merge branch 'main' into global_cache_dir

1a63468

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick commented Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `Cache` class for exact, fuzzy, and semantic deduplication #384

Create `Cache` class for exact, fuzzy, and semantic deduplication #384

sarahyurick commented Nov 19, 2024 •

edited

Loading

sarahyurick left a comment

sarahyurick Jan 23, 2025

sarahyurick Jan 23, 2025

sarahyurick Jan 23, 2025

Maghoumi commented Feb 18, 2025

sarahyurick commented Feb 18, 2025

sarahyurick Feb 21, 2025

		emb_by_clust_dir (str): Directory containing embeddings by cluster.
		sorted_clusters_dir (str): Directory containing sorted clusters.

Create Cache class for exact, fuzzy, and semantic deduplication #384

Are you sure you want to change the base?

Create Cache class for exact, fuzzy, and semantic deduplication #384

Conversation

sarahyurick commented Nov 19, 2024 • edited Loading

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick Jan 23, 2025

Choose a reason for hiding this comment

sarahyurick Jan 23, 2025

Choose a reason for hiding this comment

sarahyurick Jan 23, 2025

Choose a reason for hiding this comment

Maghoumi commented Feb 18, 2025

sarahyurick commented Feb 18, 2025

sarahyurick Feb 21, 2025

Choose a reason for hiding this comment

Create `Cache` class for exact, fuzzy, and semantic deduplication #384

Create `Cache` class for exact, fuzzy, and semantic deduplication #384

sarahyurick commented Nov 19, 2024 •

edited

Loading