Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Deletion Layer #214

Merged
merged 41 commits into from
Jul 27, 2022
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
bf3219f
Random Deletion Working
aflah02 Apr 29, 2022
66c0de7
Added to init
aflah02 Apr 29, 2022
d2508dc
WOrking
aflah02 May 3, 2022
9e2e243
Merge branch 'keras-team:master' into RandomDeletionLayer
aflah02 May 3, 2022
b48dc5a
Working
aflah02 May 23, 2022
02ce27d
Current Status
aflah02 May 26, 2022
f137256
Working Layer More Tests to be Added
aflah02 May 31, 2022
c7dcb8a
Fixed Scalar Case
aflah02 May 31, 2022
515a1d3
Added Comments
aflah02 May 31, 2022
6acfb62
Minor Fixes
aflah02 Jun 3, 2022
68fcae0
Major Refactors and Fixes, ToDo - Docs, Tests
aflah02 Jun 8, 2022
058d572
Fixed Shape Issues for Scalar Lists
aflah02 Jun 8, 2022
7a57339
Finalized Tests and DocString
aflah02 Jun 9, 2022
9cf9a2a
Ran Stylers Added More Descriptive DocString
aflah02 Jun 9, 2022
ce10e2d
Fixed Failing Docstring Tests
aflah02 Jun 9, 2022
a1ab88e
Removed Map Call and Unsupported Test
aflah02 Jun 10, 2022
17e2365
Shape Fixes
aflah02 Jun 10, 2022
b5cfe45
Working
aflah02 Jun 29, 2022
839a770
Working
aflah02 Jun 29, 2022
ec2d4ed
Changing Parent Class
aflah02 Jun 30, 2022
dbdd690
Changes
aflah02 Jul 1, 2022
fd856ad
Formatter Ran
aflah02 Jul 1, 2022
292353b
Merge branch 'keras-team:master' into RandomDeletionLayer
aflah02 Jul 5, 2022
9c447dc
Finalized
aflah02 Jul 5, 2022
35cf31a
Merge branch 'RandomDeletionLayer' of https://github.com/aflah02/kera…
aflah02 Jul 5, 2022
b26dd24
Addresed Review Comments
aflah02 Jul 6, 2022
983bfb3
Fornatter
aflah02 Jul 6, 2022
4813f1f
Added new Tests
aflah02 Jul 6, 2022
86bafbe
Fan Formatter
aflah02 Jul 6, 2022
46eda29
Skip Works
aflah02 Jul 12, 2022
906614a
New Randomness
aflah02 Jul 12, 2022
20df2b2
All Testing Done
aflah02 Jul 20, 2022
0c8fdc5
Review Changes
aflah02 Jul 20, 2022
3d87d14
Addressed all Review Comments
aflah02 Jul 20, 2022
5ee888a
Merge branch 'keras-team:master' into RandomDeletionLayer
aflah02 Jul 20, 2022
a77880e
Copy edits for docstrings
mattdangerw Jul 22, 2022
9c1904c
Finishes
aflah02 Jul 25, 2022
b87394d
Merge branch 'keras-team:master' into RandomDeletionLayer
aflah02 Jul 25, 2022
81ba7bf
Changed Tokenizer Import
aflah02 Jul 25, 2022
0da28dc
Addressed Reviews
aflah02 Jul 26, 2022
d91a1bb
Fix typo
mattdangerw Jul 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions keras_nlp/layers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from keras_nlp.layers.mlm_head import MLMHead
from keras_nlp.layers.mlm_mask_generator import MLMMaskGenerator
from keras_nlp.layers.position_embedding import PositionEmbedding
from keras_nlp.layers.random_deletion import RandomDeletion
from keras_nlp.layers.sine_position_encoding import SinePositionEncoding
from keras_nlp.layers.token_and_position_embedding import (
TokenAndPositionEmbedding,
Expand Down
137 changes: 137 additions & 0 deletions keras_nlp/layers/random_deletion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Copyright 2022 The KerasNLP Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
from typing import Any
from typing import Dict

import tensorflow as tf
from tensorflow import keras


class RandomDeletion(keras.layers.Layer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before we ship this we should decide if we want to leave room for a separate character deletion layer, or if we would want to do that as attributes on this layer.

If a character deletion layer would be separate, we should probably call this "RandomCharacterDeletion" or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree here, we should consider the scalability.

My question is to make a character-level deletion layer, how much change would be required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenmoneygithub I did discuss it with Matt here after his initial comment. We're now thinking of having them as 2 separate layers but would love to hear your thoughts on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with both! I raised the question because I want to check if possible to have a BaseClass and only do small customization on WordDelete and CharacaterDelete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenmoneygithub That does seem like a more efficient design choice but not really sure about that. I'll get back if I find a good way for that

"""Augments input by randomly deleting words
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

period at end of sentence.

Also we should probably have a little more a description in a separate paragraph, that describes the flow of computation. E.g. split words, delete words, reforms words.


Args:
probability: probability of a word being chosen for deletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with the capital case after ":", also we need to note the type in the arg comment.

max_deletions: The maximum number of words to replace

Examples:

Basic usage.
>>> augmenter = keras_nlp.layers.RandomDeletion(
aflah02 marked this conversation as resolved.
Show resolved Hide resolved
... probability = 1,
... max_deletions = 1,
... )
>>> augmenter(["dog dog dog dog dog"])
<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'dog dog dog dog'],
dtype=object)>
"""

def __init__(self, probability, max_deletions, **kwargs) -> None:
aflah02 marked this conversation as resolved.
Show resolved Hide resolved
# Check dtype and provide a default.
if "dtype" not in kwargs or kwargs["dtype"] is None:
kwargs["dtype"] = tf.int32
else:
dtype = tf.dtypes.as_dtype(kwargs["dtype"])
if not dtype.is_integer and dtype != tf.string:
raise ValueError(
"Output dtype must be an integer type of a string. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of => or

f"Received: dtype={dtype}"
)

super().__init__(**kwargs)
self.probability = probability
self.max_deletions = max_deletions
aflah02 marked this conversation as resolved.
Show resolved Hide resolved

def call(self, inputs):
"""Augments input by randomly deleting words
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

period at end.


Args:
inputs: A tensor or nested tensor of strings to augment.

Returns:
A tensor or nested tensor of augmented strings.
"""
# If input is a simple String convert it into a list
isString = False
aflah02 marked this conversation as resolved.
Show resolved Hide resolved
if isinstance(inputs, str):
inputs = [inputs]
isString = True
# If input is not a tensor convert it into a tensor
if not isinstance(inputs, (tf.Tensor, tf.RaggedTensor)):
inputs = tf.convert_to_tensor(inputs)
inputs = tf.cast(inputs, tf.string)

def _map_fn(inputs):
scalar_input = inputs.shape.rank == 0
if scalar_input:
inputs = tf.expand_dims(inputs, 0)
ragged_words = tf.strings.split(inputs)
# Get the row splits for the ragged tensor
row_splits = ragged_words.row_splits.numpy()
mask = (
tf.random.uniform(ragged_words.flat_values.shape)
> self.probability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the format looks a bit strange, is that formatted by black?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenmoneygithub Yup this was generated by black

)
# Iterate to check for any cases where deletions exceed the maximum
for i in range(len(row_splits) - 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can clean this whole for loop up, though will need to think about how exactly a bit more.

One place for inspiration is probably the op code for tf text's RandomItemSelector. Which is also selecting a number of items based on a probability with a max cap.

I'm a little skeptical that this would function trace, you are doing a lot of looping and calling .numpy(), the later definitely won't work in a compiled context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading about RandomItemSelector, I think briefly what it does is to:

  1. Calculate how many to select, let's call it N.
  2. shuffle the list/array/tensor's index array.
  3. Pick the first N elements from index array, then use tf.gather to get the actual selected elements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this @mattdangerw and for the concise summary @chenmoneygithub this seems like a much more smoother way to do it, will incorporate this idea in my code. Just to clarify do I need to cite this code file in the code as (inspired by) or something of that sort?

mask_range = mask[row_splits[i] : row_splits[i + 1]]
mask_range_list = tf.unstack(mask_range)
# Get the number of deletions
FalseCount = tf.reduce_sum(
tf.cast(tf.equal(mask_range, False), tf.int32)
)
if FalseCount > self.max_deletions:
y, idx, _ = tf.unique_with_counts(mask_range)
false_ind = 0 if not y[0] else 1
False_idxs = []
for j in range(len(idx)):
if idx[j] == false_ind:
False_idxs.append(j)
# While deletions exceed the maximum, randomly convert some
# to true
while len(False_idxs) > self.max_deletions:
rand_idx = random.randrange(len(False_idxs))
mask_range_list[False_idxs[rand_idx]] = True
False_idxs.pop(rand_idx)
mask_list = tf.unstack(mask)
mask_list[
row_splits[i] : row_splits[i + 1]
] = mask_range_list
mask = tf.stack(mask_list)
mask = ragged_words.with_flat_values(mask)
deleted = tf.ragged.boolean_mask(ragged_words, mask)
deleted = tf.strings.reduce_join(deleted, axis=-1, separator=" ")
if scalar_input:
deleted = tf.squeeze(deleted, 0)
return deleted

if isinstance(inputs, tf.Tensor):
inputs = tf.map_fn(
_map_fn,
inputs,
)
if isString:
inputs = inputs[0]
return inputs

def get_config(self) -> Dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave the return annotation off. Let's stick to optional annotations for simple types only.

config = super().get_config()
config.update(
{
"probability": self.probability,
"max_deletions": self.max_deletions,
}
)
return config
62 changes: 62 additions & 0 deletions keras_nlp/layers/random_deletion_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Copyright 2022 The KerasNLP Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for Transformer Decoder."""

import tensorflow as tf

from keras_nlp.layers import random_deletion


class RandomDeletionTest(tf.test.TestCase):
mattdangerw marked this conversation as resolved.
Show resolved Hide resolved
def test_shape_with_scalar(self):
augmenter = random_deletion.RandomDeletion(
probability=0.5, max_deletions=3
)
input = ["Running Around"]
output = augmenter(input)
self.assertAllEqual(output.shape, tf.convert_to_tensor(input).shape)

def test_shape_with_nested(self):
augmenter = random_deletion.RandomDeletion(
probability=0.5, max_deletions=3
)
input = [
["dog dog dog dog dog", "I Like CATS"],
["I Like to read comics", "Shinobis and Samurais"],
]
output = augmenter(input)
self.assertAllEqual(output.shape, tf.convert_to_tensor(input).shape)

def test_get_config_and_from_config(self):
augmenter = random_deletion.RandomDeletion(
probability=0.5, max_deletions=3
)

config = augmenter.get_config()

expected_config_subset = {
"probability": 0.5,
"max_deletions": 3,
}

self.assertEqual(config, {**config, **expected_config_subset})

restored_augmenter = random_deletion.RandomDeletion.from_config(
config,
)

self.assertEqual(
restored_augmenter.get_config(),
{**config, **expected_config_subset},
)