-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Deletion Layer #214
Changes from 25 commits
bf3219f
66c0de7
d2508dc
9e2e243
b48dc5a
02ce27d
f137256
c7dcb8a
515a1d3
6acfb62
68fcae0
058d572
7a57339
9cf9a2a
ce10e2d
a1ab88e
17e2365
b5cfe45
839a770
ec2d4ed
dbdd690
fd856ad
292353b
9c447dc
35cf31a
b26dd24
983bfb3
4813f1f
86bafbe
46eda29
906614a
20df2b2
0c8fdc5
3d87d14
5ee888a
a77880e
9c1904c
b87394d
81ba7bf
0da28dc
d91a1bb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
# Copyright 2022 The KerasNLP Authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
import tensorflow as tf | ||
from keras import backend | ||
from tensorflow import keras | ||
|
||
|
||
class RandomDeletion(keras.layers.Layer): | ||
"""Augments input by randomly deleting words. | ||
|
||
Args: | ||
rate: rate of a word being chosen for deletion | ||
mattdangerw marked this conversation as resolved.
Show resolved
Hide resolved
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
max_deletions: The maximum number of words to delete | ||
seed: A seed for the random number generator. | ||
|
||
Examples: | ||
|
||
Word level usage | ||
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
>>> tf.random.get_global_generator().reset_from_seed(30) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. after we make the minor randomness changes suggested below, can we do |
||
>>> tf.random.set_seed(30) | ||
>>> inputs = tf.strings.split(["Hey I like", "Keras and Tensorflow"]) | ||
>>> augmenter = keras_nlp.layers.RandomDeletion(rate = 0.4, | ||
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
... max_deletions = 1, seed = 42) | ||
>>> augmented = augmenter(inputs) | ||
>>> tf.strings.reduce_join(augmented, separator=" ", axis=-1) | ||
<tf.Tensor: shape=(2,), dtype=string, | ||
numpy=array([b'Hey I', b'and Tensorflow'], dtype=object)> | ||
|
||
Character level usage | ||
>>> tf.random.get_global_generator().reset_from_seed(30) | ||
>>> tf.random.set_seed(30) | ||
>>> inputs = tf.strings.unicode_split(["Hey Dude", "Speed Up"], "UTF-8") | ||
>>> augmenter = keras_nlp.layers.RandomDeletion(rate = 0.4, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can show these examples without max_deletions, just to keep things simple |
||
... max_deletions = 1, seed = 42) | ||
>>> augmented = augmenter(inputs) | ||
>>> tf.strings.reduce_join(augmented, axis=-1) | ||
<tf.Tensor: shape=(2,), dtype=string, | ||
numpy=array([b'Hey Dde', b'Sped Up'], dtype=object)> | ||
""" | ||
|
||
def __init__(self, rate, max_deletions, seed=None, name=None, **kwargs): | ||
# Check dtype and provide a default. | ||
if "dtype" not in kwargs or kwargs["dtype"] is None: | ||
kwargs["dtype"] = tf.int32 | ||
else: | ||
dtype = tf.dtypes.as_dtype(kwargs["dtype"]) | ||
if not dtype.is_integer and dtype != tf.string: | ||
raise ValueError( | ||
"Output dtype must be an integer type or a string. " | ||
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
f"Received: dtype={dtype}" | ||
) | ||
|
||
super().__init__(name=name, **kwargs) | ||
self.rate = rate | ||
self.max_deletions = max_deletions | ||
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.seed = seed | ||
self._random_generator = backend.RandomGenerator(seed) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just use this from the keras import, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is some weird behaviour here, if I do this I get There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RandomGenerator is not exposed as a public API: https://github.com/keras-team/keras/blob/v2.9.0/keras/backend.py#L1823 There is actually an API in TF: tf.random.Generator (https://www.tensorflow.org/api_docs/python/tf/random/Generator), can we use that one? |
||
|
||
if self.rate > 1 or self.rate < 0: | ||
raise ValueError( | ||
"Rate must be between 0 and 1 (both inclusive)." | ||
f"Received: rate={rate}" | ||
) | ||
|
||
def call(self, inputs): | ||
"""Augments input by randomly deleting words. | ||
Args: | ||
inputs: A tensor or nested tensor of strings to augment. | ||
Returns: | ||
A tensor or nested tensor of augmented strings. | ||
""" | ||
|
||
isString = False | ||
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if isinstance(inputs, str): | ||
inputs = [inputs] | ||
isString = True | ||
|
||
scalar_input = inputs.shape.rank == 0 | ||
if scalar_input: | ||
inputs = tf.expand_dims(inputs, 0) | ||
|
||
positions_flat = tf.range(tf.size(inputs.flat_values)) | ||
positions = inputs.with_flat_values(positions_flat) | ||
|
||
# Figure out how many we are going to select. | ||
word_counts = tf.cast(inputs.row_lengths(), "float32") | ||
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
num_to_select = tf.random.stateless_binomial( | ||
shape=tf.shape(word_counts), | ||
seed=tf.random.get_global_generator().make_seeds()[:, 0], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can't we do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mattdangerw I don't think we can. It returns a None right now unless I set the rng_type but as per the latest tf release it is not available currently and was added after the latest release |
||
counts=word_counts, | ||
probs=self.rate, | ||
) | ||
num_to_select = tf.math.minimum(num_to_select, self.max_deletions) | ||
num_to_select = tf.cast(num_to_select, "int64") | ||
|
||
# Shuffle and trim to items that are going to be selected. | ||
def _shuffle_and_trim(x): | ||
positions, top_n = x | ||
shuffled = tf.random.shuffle( | ||
positions, seed=self._random_generator.make_legacy_seed() | ||
) | ||
return shuffled[:top_n] | ||
|
||
selected_for_mask = tf.map_fn( | ||
_shuffle_and_trim, | ||
(positions, num_to_select), | ||
fn_output_signature=tf.RaggedTensorSpec( | ||
ragged_rank=positions.ragged_rank - 1, dtype=positions.dtype | ||
), | ||
) | ||
selected_for_mask.flat_values.set_shape([None]) | ||
mattdangerw marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Construct the mask which is a boolean RT | ||
# Scatter 0's to positions that have been selector for deletion. | ||
update_values = tf.zeros_like(selected_for_mask.flat_values, "int32") | ||
update_indices = selected_for_mask.flat_values | ||
update_indices = tf.expand_dims(update_indices, -1) | ||
update_indices = tf.cast(update_indices, "int32") | ||
mask_flat = tf.ones_like(inputs.flat_values, dtype="int32") | ||
mask_flat = tf.tensor_scatter_nd_update( | ||
mask_flat, update_indices, update_values | ||
) | ||
mask = tf.cast(inputs.with_flat_values(mask_flat), "bool") | ||
|
||
inputs = tf.ragged.boolean_mask(inputs, mask) | ||
|
||
if scalar_input: | ||
inputs = tf.squeeze(inputs, 0) | ||
if isString: | ||
inputs = inputs[0] | ||
return inputs | ||
|
||
def get_config(self): | ||
config = super().get_config() | ||
config.update( | ||
{ | ||
"rate": self.rate, | ||
"max_deletions": self.max_deletions, | ||
"seed": self.seed, | ||
aflah02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
) | ||
return config |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Copyright 2022 The KerasNLP Authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""Tests for Random Swap Layer.""" | ||
|
||
import tensorflow as tf | ||
|
||
from keras_nlp.layers import random_deletion | ||
|
||
|
||
class RandomDeletionTest(tf.test.TestCase): | ||
mattdangerw marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def test_shape_and_output_from_word_deletion(self): | ||
tf.random.get_global_generator().reset_from_seed(30) | ||
tf.random.set_seed(30) | ||
inputs = ["Hey I like", "Keras and Tensorflow"] | ||
split = tf.strings.split(inputs) | ||
augmenter = random_deletion.RandomDeletion( | ||
rate=0.4, max_deletions=1, seed=42 | ||
) | ||
augmented = augmenter(split) | ||
output = tf.strings.reduce_join(augmented, separator=" ", axis=-1) | ||
self.assertAllEqual(output.shape, tf.convert_to_tensor(inputs).shape) | ||
exp_output = [b"Hey I", b"and Tensorflow"] | ||
for i in range(output.shape[0]): | ||
self.assertAllEqual(output[i], exp_output[i]) | ||
|
||
def test_shape_and_output_from_character_swaps(self): | ||
tf.random.get_global_generator().reset_from_seed(30) | ||
tf.random.set_seed(30) | ||
inputs = ["Hey I like", "Keras and Tensorflow"] | ||
split = tf.strings.unicode_split(inputs, "UTF-8") | ||
augmenter = random_deletion.RandomDeletion( | ||
rate=0.4, max_deletions=1, seed=42 | ||
) | ||
augmented = augmenter(split) | ||
output = tf.strings.reduce_join(augmented, axis=-1) | ||
self.assertAllEqual(output.shape, tf.convert_to_tensor(inputs).shape) | ||
exp_output = [b"HeyI like", b"Keras ad Tensorflow"] | ||
for i in range(output.shape[0]): | ||
self.assertAllEqual(output[i], exp_output[i]) | ||
|
||
def test_get_config_and_from_config(self): | ||
|
||
augmenter = random_deletion.RandomDeletion( | ||
rate=0.4, max_deletions=1, seed=42 | ||
) | ||
|
||
expected_config_subset = {"seed": 42, "max_deletions": 1, "rate": 0.4} | ||
|
||
config = augmenter.get_config() | ||
|
||
self.assertEqual(config, {**config, **expected_config_subset}) | ||
|
||
restored_augmenter = random_deletion.RandomDeletion.from_config( | ||
config, | ||
) | ||
|
||
self.assertEqual( | ||
restored_augmenter.get_config(), | ||
{**config, **expected_config_subset}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before we ship this we should decide if we want to leave room for a separate character deletion layer, or if we would want to do that as attributes on this layer.
If a character deletion layer would be separate, we should probably call this "RandomCharacterDeletion" or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree here, we should consider the scalability.
My question is to make a character-level deletion layer, how much change would be required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenmoneygithub I did discuss it with Matt here after his initial comment. We're now thinking of having them as 2 separate layers but would love to hear your thoughts on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with both! I raised the question because I want to check if possible to have a BaseClass and only do small customization on WordDelete and CharacaterDelete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenmoneygithub That does seem like a more efficient design choice but not really sure about that. I'll get back if I find a good way for that