-
Notifications
You must be signed in to change notification settings - Fork 248
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add option to use fast HF tokenizer. (#482)
* Add option to use fast HF tokenizer * Hand merge tests from PR #205 * test_inferencer_with_fast_bert_tokenizer * test_fast_bert_tokenizer * test_fast_bert_tokenizer_strip_accents * test_fast_electra_tokenizer * Fix OOM issue of CI - set num_processes=0 for Inferencer * Extend test for fast tokenizer - electra - roberta * test_fast_tokenizer for more model typed - electra - roberta * Fix tokenize_with_metadata * Split tokenizer tests * Fix pytest params bug in test_tok * Fix fast tokenizer usage * add missing newline eof * Add test fast tok. doc_callif. * Remove RobertaTokenizerFast * Fix Tokenizer load and save. * Fix typo * Improve test test_embeddings_extraction - add shape assert - fix embedding assert * Dosctring for fast tokenizers improved * tokenizer_args docstring * Extend test_embeddings_extraction to fast tok. * extend test_ner with fast tok. * fix sample_to_features_ner for fast tokenizer * temp fix for is_pretokenized until fixed upstream * Make use of fast tokenizer possible + fix bug in offset calculation * Make fast tokenization possible with NER, LM and QA * Change error messages * Add tests * update error messages, comments and truncation arg in tokenizer Co-authored-by: Malte Pietsch <[email protected]> Co-authored-by: Bogdan Kostić <[email protected]>
- Loading branch information
1 parent
dd3945d
commit 435f3ee
Showing
10 changed files
with
420 additions
and
93 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.