added top k search util #232

jessechancy · 2022-06-22T00:41:38Z

Follow up from random search

mattdangerw

Looks good! Just a few comments

mattdangerw · 2022-06-22T00:55:53Z

keras_nlp/utils/text_generation.py

+    token_probability_fn,
+    prompt,
+    max_length,
+    k=10,


10 is fine, but how did we come up with this number?

I think this number should be related to the vocab size, so maybe let's make it a required arg?

do you mean vocab_size should be another argument? Or k should be a required arg

Talked on chat. Let's make k required (no default). No need to take in vocab_size, that can continue to be inferred.

mattdangerw · 2022-06-22T00:56:43Z

keras_nlp/utils/text_generation.py

+    pad_token_id=0,
+):
+    """
+    Text generation utility based on top k sampling.


top-k

here and elsewhere in docstring

mattdangerw · 2022-06-22T01:00:08Z

keras_nlp/utils/text_generation.py

+    input_is_1d = prompt.shape.rank == 1
+    if input_is_1d:
+        prompt = prompt[tf.newaxis, :]
+    i = prompt.shape[1]


I think we are doing both shape[1] and shape[-1] to read the last axis? This reads confusing, let's choose one and be consistent. Unless there is a case where these are different?

edited to both use shape[1]. The output should strictly be [batch_size, vocab_size] in the pred and [batch_size, length] in prompt

mattdangerw · 2022-06-22T01:03:38Z

keras_nlp/utils/text_generation.py

+
+    Args:
+        token_probability_fn: a callable, which takes in input_sequence
+            and output the probability distribution of the next token.


Here and elsewhere in this file, we probably should mention that this function should return the unnormalized logits and not softmax probabilities right?

This is actually one confusing part where we should make a decision - currently there is not such enforcement on the return type, shall we add it?

The return type should be probabilities here for this to work, if not I would need to add a softmax over it

mattdangerw · 2022-06-22T01:23:26Z

keras_nlp/utils/text_generation.py

+        # If k is greater than the vocabulary size, use the entire vocabulary.
+        k = min(k, pred.shape[-1])
+        # Filter out top k tokens.
+        sorted_pred, sorted_indices = tf.math.top_k(pred, k=k, sorted=True)


Why do we need sorted here? tf.random.categorical doesn't need a sort order. We just need to make sure we gather than correct indices from the top_k call which you are already doing.

Yep doesn't need sorted, edited

chenmoneygithub

Mainly looks good!

chenmoneygithub · 2022-06-22T18:58:32Z

keras_nlp/utils/text_generation.py

+    token_probability_fn,
+    prompt,
+    max_length,
+    k=10,


I think this number should be related to the vocab size, so maybe let's make it a required arg?

chenmoneygithub · 2022-06-22T19:04:22Z

keras_nlp/utils/text_generation.py

+
+    Args:
+        token_probability_fn: a callable, which takes in input_sequence
+            and output the probability distribution of the next token.


This is actually one confusing part where we should make a decision - currently there is not such enforcement on the return type, shall we add it?

chenmoneygithub · 2022-06-22T19:15:36Z

keras_nlp/utils/text_generation_test.py

+        outputs = top_k_search(
+            token_probability_fn, inputs, k=2, max_length=max_length, seed=42
+        )
+        # Random sampling result with seed 42


top-k search result

chenmoneygithub · 2022-06-22T19:17:43Z

keras_nlp/utils/text_generation_test.py

+            rtol=0.2,
+        )
+
+    def test_assert_top_k_generation_is_correct(self):


The test is to assert only top-k tokens can appear, but the name does not suggest so. Let's rename to something like test_only_choose_from_top_k_tokens

mattdangerw

Couple comments!

mattdangerw · 2022-06-23T19:07:18Z

keras_nlp/utils/text_generation.py


    i = prompt.shape[1]
    while i < max_length:
        # If the prompt has reached our desired length, exit while loop.
-        pred = token_probability_fn(prompt)
+        pred = token_fn(prompt)


Do we need to wrap in another function? This whole change is a little hard to read. I would find it simpler to just add a block here.

pred = token_probability_fn(prompt) if from_logits: pred = tf.keras.activations.softmax(pred)

makes sense, edited

mattdangerw · 2022-06-23T19:12:21Z

keras_nlp/utils/text_generation.py

+    def token_probability_fn(inputs):
+        return model(inputs)[:, -1, :]
+
+    prompt = tf.random.uniform(shape=[5, 5], maxval=VOCAB_SIZE, dtype=tf.int64)


This is kind of a weird prompt to show. Who would want to generate sequences after 5 totally random tokens.

Maybe we should do something like

BATCH_SIZE = 8 VOCAB_SIZE = 10 FEATURE_SIZE = 16 START_ID=1 END_ID=2 ... prompt = tf.fill((BATCH_SIZE, 1), START_ID) keras_nlp.utils.top_k_search( token_probability_fn, prompt, k=10, max_length=10, end_token_id=END_ID)

We may want to update other examples if they have the same problem.

also to be clear the ellipsis is me being lazy, not suggesting we put that in the docstring :)

edited docstring

mattdangerw

Thanks! Last few nits!

mattdangerw · 2022-06-24T01:03:11Z

keras_nlp/utils/text_generation.py


    # Print the generated sequence (token ids).
    keras_nlp.utils.greedy_search(
        token_probability_fn,
        prompt,
        max_length=10,
-        end_token_id=0,)
+        end_token_id=END_ID


nit: trailing comma

mattdangerw · 2022-06-24T01:04:28Z

keras_nlp/utils/text_generation.py

        token_probability_fn,
        prompt,
        max_length=10,
-        end_token_id=0,)
+        end_token_id=END_ID


trailing comma

mattdangerw · 2022-06-24T01:04:39Z

keras_nlp/utils/text_generation.py

+        prompt,
+        max_length=10,
+        k=4,
+        end_token_id=END_ID


trailing comma

mattdangerw · 2022-06-24T01:08:16Z

keras_nlp/utils/text_generation.py

        prompt: a list or a Tensor, can be 1D or 2D, the initial tokens to
            append generated tokens.
        max_length: int. The max length of generated text.
+        from_logits: bool. Indicates whether `token_probability_fn` outputs


You document this but didn't actually add it? Should we just leave it off as for greedy search it is a no-op?

Yeah I added it at first and realised greedy search isn't affected, will remove it from docstring

jessechancy added 2 commits June 21, 2022 17:40

added top k search util

c62c56c

reformat files

14cbef2

mattdangerw requested changes Jun 22, 2022

View reviewed changes

chenmoneygithub suggested changes Jun 22, 2022

View reviewed changes

jessechancy added 2 commits June 22, 2022 13:46

added edits from comments and function validating

9ace784

Added optional parameter and testing

41a93cf

mattdangerw requested changes Jun 23, 2022

View reviewed changes

edited docstring and changed format of from_logit

d0a2452

mattdangerw approved these changes Jun 24, 2022

View reviewed changes

minor changes

2ae0ab0

chenmoneygithub approved these changes Jun 24, 2022

View reviewed changes

chenmoneygithub merged commit 31674a1 into keras-team:master Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added top k search util #232

added top k search util #232

jessechancy commented Jun 22, 2022

mattdangerw left a comment

mattdangerw Jun 22, 2022

chenmoneygithub Jun 22, 2022

jessechancy Jun 22, 2022

mattdangerw Jun 23, 2022

mattdangerw Jun 22, 2022

jessechancy Jun 22, 2022

mattdangerw Jun 22, 2022

jessechancy Jun 22, 2022

mattdangerw Jun 22, 2022

chenmoneygithub Jun 22, 2022

jessechancy Jun 22, 2022

mattdangerw Jun 22, 2022

jessechancy Jun 22, 2022

chenmoneygithub left a comment

chenmoneygithub Jun 22, 2022

chenmoneygithub Jun 22, 2022

chenmoneygithub Jun 22, 2022

jessechancy Jun 22, 2022

chenmoneygithub Jun 22, 2022

jessechancy Jun 22, 2022

mattdangerw left a comment •

edited

Loading

mattdangerw Jun 23, 2022

jessechancy Jun 23, 2022

mattdangerw Jun 23, 2022

mattdangerw Jun 23, 2022

jessechancy Jun 23, 2022

mattdangerw left a comment

mattdangerw Jun 24, 2022

mattdangerw Jun 24, 2022

mattdangerw Jun 24, 2022

mattdangerw Jun 24, 2022

jessechancy Jun 24, 2022

added top k search util #232

added top k search util #232

Conversation

jessechancy commented Jun 22, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenmoneygithub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment •

edited

Loading