Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gneissweb_classification #974

Merged
merged 12 commits into from
Jan 30, 2025
Merged

Conversation

ran-iwamoto
Copy link
Contributor

Why are these changes needed?

This PR adds a transform for gneissweb classification using fasttext classifier.

Related issue number (if any).

issue #924

@ran-iwamoto
Copy link
Contributor Author

Please check @shahrokhDaijavad @touma-I

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@shahrokhDaijavad
Copy link
Member

@touma-I and @ran-iwamoto I reviewed and modified the README file. It looks good now. I also tested both Python and Ray versions of notebook (after creating my own HF token and using that in the notebooks) successfully. CI/CD is failing above, because test-language-gneissweb_classification.yml is missing.

@Swanand-Kadhe
Copy link

One point for discussion: I was wondering if we want to call this transform as gneissweb classification transform. This is a more generic transform for fastText classification. We will use this transform with the fastText models used in GneissWeb to create GneissWeb recipe notebook. However, one can use this transform outside of GneissWeb. For instance, one can reproduce fastText classification step of DCLM dataset using this transform with the appropriate fastText model. It would be good to know what others think. Many thanks!

@shahrokhDaijavad
Copy link
Member

One point for discussion: I was wondering if we want to call this transform as gneissweb classification transform. This is a more generic transform for fastText classification. We will use this transform with the fastText models used in GneissWeb to create GneissWeb recipe notebook. However, one can use this transform outside of GneissWeb. For instance, one can reproduce fastText classification step of DCLM dataset using this transform with the appropriate fastText model. It would be good to know what others think. Many thanks!

Thanks, @Swanand-Kadhe Your point is well-taken. However, we have discussed quite a bit about the naming of this transform in #924 and the name gneissweb_classification has been chosen, until we come up with a better name. The name fastText classification is too broad for what we have now.,

Signed-off-by: Maroun Touma <[email protected]>
Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment.

The sentence is added "The prefix gcls is short name for Gneissweb CLaSsification.".
@Swanand-Kadhe
Copy link

Thanks, @Swanand-Kadhe Your point is well-taken. However, we have discussed quite a bit about the naming of this transform in #924 and the name gneissweb_classification has been chosen, until we come up with a better name. The name fastText classification is too broad for what we have now.,

Thanks a lot, @shahrokhDaijavad, for pointing me to the discussion in #924. I should have looked at it earlier. Sorry about that.

@shahrokhDaijavad
Copy link
Member

Thank you, @ran-iwamoto, for making the latest changes. Let's wait for the other reviewers also to look before we approve and merge.

cmadam
cmadam previously requested changes Jan 29, 2025
Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great implementation, however, I would suggest to use *_cli_param instead of *_key configuration keys in transform, as highlighted in the two comments below.

@shahrokhDaijavad
Copy link
Member

@ran-iwamoto I pushed a few changes to your branch, to finalize the parameter names and configuration in README and the comment parts of the two notebooks.

@shahrokhDaijavad
Copy link
Member

@BishwaBhatta This transform has been tested only with facebook/fasttext-language-identification model.bin
Whenever we have the models that were used in creating GneissWeb fasttext classification uploaded to HF or a similar public place, we should test it again.

@touma-I touma-I self-requested a review January 30, 2025 17:35
Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@touma-I touma-I dismissed cmadam’s stale review January 30, 2025 20:45

change has been implemented. thanks

@touma-I touma-I merged commit 6c83d27 into IBM:dev Jan 30, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants