-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Cache
class for exact, fuzzy, and semantic deduplication
#384
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
cache_dir
variable for exact, fuzzy, and semantic deduplication
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
cache_dir
variable for exact, fuzzy, and semantic deduplicationCache
class for exact, fuzzy, and semantic deduplication
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR allows the user to use Cache(cache_dir=...)
instead of having to specify the cache_dir
at every step of the deduplication pipelines.
However, the user does not have to use Cache
if they want to keep using the modules the same way as before. Both ways work, with preference if the user specifies the cache_dir
field in a deduplication module.
The only hard changes are with the semantic deduplication fields, see below.
clustering_output_dir: str, | ||
cache_dir: Optional[str] = None, | ||
clustering_save_loc: str = "clustering_results", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We deprecate clustering_output_dir
in favor of cache_dir
and clustering_save_loc
for ClusteringModel
, which matches the logic in the SemDedup
class.
embedding_output_dir: str, | ||
cache_dir: Optional[str] = None, | ||
embeddings_save_loc: str = "embeddings", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We deprecate embedding_output_dir
in favor of cache_dir
and embeddings_save_loc
for EmbeddingCreator
, which matches the logic in the SemDedup
class.
emb_by_clust_dir (str): Directory containing embeddings by cluster. | ||
sorted_clusters_dir (str): Directory containing sorted clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We deprecate emb_by_clust_dir
and sorted_clusters_dir
in favor of cache_dir
and embeddings_save_loc
for SemanticClusterLevelDedup
, which matches the logic in the SemDedup
class.
Signed-off-by: Sarah Yurick <[email protected]>
Thanks so much for working on this change. What I like about this now is that it gives users the option to either use the same cache directory for anything that requires caching, or provide a specific directory if they don't want to re-use the same cache. The cache class implementation is functional but not thread-safe. I don't think that's a blocking problem for this PR. I didn't run the samples/tutorials, but I assume the change has been thoroughly verified? |
Thanks! Yes, I tried to make as few breaking changes as possible. The examples and tutorials should all reflect those changes. |
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
seed: 1234 | ||
max_iter: 100 | ||
kmeans_with_cos_dist: false | ||
|
||
# Semdedup configuration | ||
which_to_keep: "hard" | ||
largest_cluster_size_to_process: 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I removed seed
and largest_cluster_size_to_process
from the semantic dedupe parameters because I couldn't find them being used anywhere.
TODO: