Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Cache class for exact, fuzzy, and semantic deduplication #384

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

sarahyurick
Copy link
Collaborator

@sarahyurick sarahyurick commented Nov 19, 2024

TODO:

  • Exact deduplication files
  • Semantic deduplication files
  • Fuzzy deduplication files
  • Tutorials folder

@sarahyurick sarahyurick changed the title Global cache variable for exact, fuzzy, and semantic deduplication Global cache_dir variable for exact, fuzzy, and semantic deduplication Nov 19, 2024
sarahyurick and others added 6 commits November 19, 2024 16:13
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Nov 20, 2024
@sarahyurick sarahyurick marked this pull request as ready for review November 20, 2024 23:27
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 23, 2024
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 3, 2025
@sarahyurick sarahyurick marked this pull request as draft January 22, 2025 00:54
@sarahyurick sarahyurick changed the title Global cache_dir variable for exact, fuzzy, and semantic deduplication Create Cache class for exact, fuzzy, and semantic deduplication Jan 22, 2025
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Copy link
Collaborator Author

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR allows the user to use Cache(cache_dir=...) instead of having to specify the cache_dir at every step of the deduplication pipelines.

However, the user does not have to use Cache if they want to keep using the modules the same way as before. Both ways work, with preference if the user specifies the cache_dir field in a deduplication module.

The only hard changes are with the semantic deduplication fields, see below.

Comment on lines -56 to +58
clustering_output_dir: str,
cache_dir: Optional[str] = None,
clustering_save_loc: str = "clustering_results",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deprecate clustering_output_dir in favor of cache_dir and clustering_save_loc for ClusteringModel, which matches the logic in the SemDedup class.

Comment on lines -117 to +118
embedding_output_dir: str,
cache_dir: Optional[str] = None,
embeddings_save_loc: str = "embeddings",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deprecate embedding_output_dir in favor of cache_dir and embeddings_save_loc for EmbeddingCreator, which matches the logic in the SemDedup class.

Comment on lines -54 to -55
emb_by_clust_dir (str): Directory containing embeddings by cluster.
sorted_clusters_dir (str): Directory containing sorted clusters.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deprecate emb_by_clust_dir and sorted_clusters_dir in favor of cache_dir and embeddings_save_loc for SemanticClusterLevelDedup, which matches the logic in the SemDedup class.

@sarahyurick sarahyurick marked this pull request as ready for review January 23, 2025 22:22
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 23, 2025
@sarahyurick sarahyurick requested a review from Maghoumi January 23, 2025 22:23
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 29, 2025
@Maghoumi
Copy link
Collaborator

Thanks so much for working on this change. What I like about this now is that it gives users the option to either use the same cache directory for anything that requires caching, or provide a specific directory if they don't want to re-use the same cache.

The cache class implementation is functional but not thread-safe. I don't think that's a blocking problem for this PR.

I didn't run the samples/tutorials, but I assume the change has been thoroughly verified?

@sarahyurick
Copy link
Collaborator Author

Thanks so much for working on this change. What I like about this now is that it gives users the option to either use the same cache directory for anything that requires caching, or provide a specific directory if they don't want to re-use the same cache.

The cache class implementation is functional but not thread-safe. I don't think that's a blocking problem for this PR.

I didn't run the samples/tutorials, but I assume the change has been thoroughly verified?

Thanks! Yes, I tried to make as few breaking changes as possible. The examples and tutorials should all reflect those changes.

sarahyurick and others added 3 commits February 18, 2025 14:35
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 18, 2025
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 19, 2025
Comment on lines -14 to -20
seed: 1234
max_iter: 100
kmeans_with_cos_dist: false

# Semdedup configuration
which_to_keep: "hard"
largest_cluster_size_to_process: 100000
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I removed seed and largest_cluster_size_to_process from the semantic dedupe parameters because I couldn't find them being used anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants