Releases: tensorflow/text
Releases · tensorflow/text
2.4.0-b0
Release 2.4.0-b0
Please note that this is a pre-release and meant to run with TF v2.3.x. We wanted to give access to some of the features we were adding to 2.4.x, but did not want to wait for the TF release.
Major Features and Improvements
- Released our first TF Hub module for Chinese segmentation! Please visit the hub module page here for more info including instructions on how to use the model.
- Added
Spliter
/SplitterWithOffsets
abstract base classes. These are meant to replace the currentTokenizer
/TokenizerWithOffsets
base classes. TheTokenizer
base classes will continue to work and will implement these newSplitter
base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc). - With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that
offset_end
is a positional value rather than a length. - Added new
HubModuleSplitter
that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class. - Added new
SplitMergeFromLogitsTokenizer
which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
Bug Fixes and Other Changes
- Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
- Add dep on tensorflow_hub in pip_package/setup.py
- Add filegroup BUILD target for test_data segmentation Hub module.
- Extend documentation for class HubModuleSplitter.
- Read SP model file in bytes mode in tests.
Thanks to our Contributors
2.3.0
Release 2.3.0
Major Features and Improvements
- Added UnicodeCharacterTokenizer
- Tokenizers are now tf.Modules and can be saved from within Keras layers.
Bug Fixes and Other Changes
- Allow wordpiece_tokenizer to output int32 tokens natively.
- Tracks the Sentencepiece model resource via a TrackableResource.
- oss-segmenter:
- fix end-offset error in split_merge_tokenizer_kernel.
- TensorFlow text python ops wordshape:
- More comprehensive emoji handling
- Other:
- Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
- Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
- add normalize kernals test
- Fix Sentencepiece tests.
- Add some metric logs to tokenizers.
- Fix documentation formatting for SplitMergeTokenizer
- Bug fix: make sure tokenize() method does not ignore itself.
- Improve logging efficiency.
- Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
- Add the ability to define a user-defined destination directory to make testing easier.
- Fix typo in documentation of BertTokenizer
- Clarify docstring of UnicodeScriptTokenizer about splitting on space
- Add executable flag to the run_build.sh script.
- Clarify docstring of WordpieceTokenizer on unknown_token:
- Update protobuf library and point HEAD to build on tf 2.3.0-rc0
Thanks to our Contributors
2.3.0-rc1
Release 2.3.0-rc1
Major Features and Improvements
- Added UnicodeCharacterTokenizer
Bug Fixes and Other Changes
- oss-segmenter:
- fix end-offset error in split_merge_tokenizer_kernel.
- TensorFlow text python ops wordshape:
- More comprehensive emoji handling
- Other:
- Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
- Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
- add normalize kernals test
- Add some metric logs to tokenizers.
- Fix documentation formatting for SplitMergeTokenizer
- Bug fix: make sure tokenize() method does not ignore itself.
- Improve logging efficiency.
- Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
- Add the ability to define a user-defined destination directory to make testing easier.
- Fix typo in documentation of BertTokenizer
- Clarify docstring of UnicodeScriptTokenizer about splitting on space
- Add executable flag to the run_build.sh script.
- Clarify docstring of WordpieceTokenizer on unknown_token:
- Update protobuf library and point HEAD to build on tf 2.3.0-rc0
Thanks to our Contributors
2.2.1
2.2.0 release
Release 2.2
Major Features and Improvements
Breaking Changes
Bug Fixes and Other Changes
- Update version
Thanks to our Contributors
v2.2.0-rc2
Bug fixes
- Force MacOS builds to build for OSX 10.9 so they can be installed to a wider range of MacOS versions.
v2.2.0-rc1
Release 2.2.0-rc1
Major Features and Improvements
- Add op for solving max-spanning-tree (MST) problems. The code here is intended for NLP applications, but attempts to remain agnostic to particular NLP tasks (such as dependency parsing).
- Add max_spanning_tree_gradient.
- Add support for 'preserve_unused_tokens' options in BertTokenizer.
Bug Fixes and Other Changes
- Documentation updates.
- Reorganize the BUILD file for keras layers.
- Update model server testing. The test script now generates a model that integrates into tf serving's testing infra.
- Remove unneeded heavy dependencies in regex_split library.
- Turn TF text's ConstrainedSequence implementations into standalone callable functions.
- Fix bug in ViterbiAnalysis computation triggered when not using transition_weights.
- Removing testing_utils run_tf_function which is enabled by default now.
- Update patch params to work with Bazel >=1.0.0
- Remove circular dependencies by removing submodule imports from ragged package.
- Prevent lack of ragged_ops.py being released in TF from breaking tf.Text
Thanks to our Contributors
This release contains contributions from many people at Google, as well as:
Hyunwoo Cho
v2.1.1
v2.1.0-rc0
Major Updates
- Added SplitMergeTokenizer.
- Add support for token offsets to BertTokenizer.
Minor Updates
- Give BertTokenizer ability to read in a vocab file directly.
- Migrate from std::string to tensorflow::tstring.
- Many build script improvements.
- Update ToDense layer with ragged support attribute.
Bug Fixes
- Update SentencePiece to inherit from TokenizerWithOffsets.
- Fix ICU data linking issue.