Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix Sem Dedup #478

Merged
merged 7 commits into from
Jan 13, 2025

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Jan 9, 2025

Description

This PR fixes semdedup errors that i was seeing on machine.

@VibhuJawa VibhuJawa added the gpuci Run GPU CI/CD on PR label Jan 9, 2025
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 9, 2025
@VibhuJawa VibhuJawa marked this pull request as ready for review January 9, 2025 21:50
@VibhuJawa VibhuJawa changed the title [WIP] Fix Sem Dedup [REVIEW] Fix Sem Dedup Jan 13, 2025
@VibhuJawa VibhuJawa merged commit 7cfda44 into NVIDIA:main Jan 13, 2025
5 checks passed
sarahyurick added a commit to sarahyurick/NeMo-Curator that referenced this pull request Jan 15, 2025
Signed-off-by: Sarah Yurick <[email protected]>
sarahyurick added a commit that referenced this pull request Jan 17, 2025
* add changes from #389

Signed-off-by: Sarah Yurick <[email protected]>

* add scripts files

Signed-off-by: Sarah Yurick <[email protected]>

* add changes from #326

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* re add ParallelScoreFilter

Signed-off-by: Sarah Yurick <[email protected]>

* remove _MapBuckets and _Shuffle from nemo_curator path

Signed-off-by: Sarah Yurick <[email protected]>

* update api doc

Signed-off-by: Sarah Yurick <[email protected]>

* add changes from #445

Signed-off-by: Sarah Yurick <[email protected]>

* Add changes from #478

Signed-off-by: Sarah Yurick <[email protected]>

* final nits

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants