Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE tokenizer #389

Merged
merged 1 commit into from
Oct 21, 2022
Merged

Conversation

chenmoneygithub
Copy link
Contributor

This PR is a rework on #303.

Recreate the PR instead of direct editing for clear remote-local tracking.

@mattdangerw
Copy link
Member

Thanks! Will take a look!

One note, it might be nice to add Jesse as a co-author on the commit, he did some incredible work on this and we should make sure to credit it.

https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors#creating-co-authored-commits-on-the-command-line

@chenmoneygithub
Copy link
Contributor Author

chenmoneygithub commented Oct 14, 2022

@mattdangerw Definitely! Will add in the next commit.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Took a high level pass, left some suggestions for overall readability (particularly splitting out a util file for the BPE algo itself).

keras_nlp/tokenizers/__init__.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer_test.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Left some docstring comments to fix up. Still plenty of follow ups to do, but I think landing and iterating is a good plan.

Let's make sure to fix the co-author so github properly registers this.

Let's also open some follow up issues once this is in:

  • Any outstanding places we know our output differs should for sure be tracked in an issue.
  • An issue for removing the tf.function we shouldn't force compile when running eagerly.
  • An issue for adding some testing that does not require network (and works off small simplified vocabs).

Thanks!

keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved
keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/__init__.py Show resolved Hide resolved
Add more test cases.

Co-authored-by: jessechancy <[email protected]>

add merge file

Make cache a tf module

Delete testdata

address comments

address comments

fix docstring

fix docstring
@chenmoneygithub chenmoneygithub merged commit da490c9 into keras-team:master Oct 21, 2022
mattdangerw pushed a commit to mattdangerw/keras-hub that referenced this pull request Nov 10, 2022
Add more test cases.

Co-authored-by: jessechancy <[email protected]>

add merge file

Make cache a tf module

Delete testdata

address comments

address comments

fix docstring

fix docstring
mattdangerw pushed a commit that referenced this pull request Nov 10, 2022
Add more test cases.

Co-authored-by: jessechancy <[email protected]>

add merge file

Make cache a tf module

Delete testdata

address comments

address comments

fix docstring

fix docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants