Skip to content

Commit

Permalink
Fixes GitHub issue #158. This replaces the ICU character class for wh…
Browse files Browse the repository at this point in the history
…itespace.

If we want to support these, we need to add the RE2_USE_ICU build flag and link in ICU to the regex ops. I have a working patch, but am not convinced it is worth submitting.

PiperOrigin-RevId: 279791264
  • Loading branch information
broken committed Nov 11, 2019
1 parent 84c324a commit 9bfece7
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 4 deletions.
1 change: 0 additions & 1 deletion tensorflow_text/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ py_library(
":tokenization",
":unicode_script_tokenizer",
":wordpiece_tokenizer",
":wordshape_ops",
# python:array_ops tensorflow dep,
# python:dtypes tensorflow dep,
# python:math_ops tensorflow dep,
Expand Down
5 changes: 2 additions & 3 deletions tensorflow_text/python/ops/bert_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,10 @@
from tensorflow_text.python.ops.tokenization import Tokenizer
from tensorflow_text.python.ops.tokenization import TokenizerWithOffsets
from tensorflow_text.python.ops.wordpiece_tokenizer import WordpieceTokenizer
from tensorflow_text.python.ops.wordshape_ops import WordShape


_DELIM_REGEX = [
WordShape.IS_WHITESPACE.value,
r"\s+",
r"|".join([
r"[!-/]",
r"[:-@]",
Expand All @@ -54,7 +53,7 @@

_DELIM_REGEX_PATTERN = "|".join(_DELIM_REGEX)
_KEEP_DELIM_NO_WHITESPACE = copy.deepcopy(_DELIM_REGEX)
_KEEP_DELIM_NO_WHITESPACE.remove(WordShape.IS_WHITESPACE.value)
_KEEP_DELIM_NO_WHITESPACE.remove(r"\s+")

_KEEP_DELIM_NO_WHITESPACE_PATTERN = "|".join(_KEEP_DELIM_NO_WHITESPACE)

Expand Down

0 comments on commit 9bfece7

Please sign in to comment.