Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BLEU Score #222

Merged
merged 18 commits into from
Jul 11, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add references
  • Loading branch information
abheesht17 committed Jun 29, 2022
commit 5ddcfa74106e207301b2abaf092904f505063b5e
40 changes: 22 additions & 18 deletions keras_nlp/metrics/bleu.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,39 +52,43 @@ class Bleu(keras.metrics.Metric):

This class implements the BLEU metric. BLEU is generally used to evaluate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably mention more prominently that this will replicate sacrebleu by default, but can be used with other tokenizers e.g. for other languages.

machine translation systems. Succinctly put, in BLEU score, we count the
number of matching n-grams in the candidate translation to n-grams in the
reference text. We find the "clipped count" of matching n-grams so as to not
give a high score to a reference, prediction pair with repeated tokens.
Secondly, BLEU score tends to reward shorter predictions more, which is why
a brevity penalty is applied to penalise short predictions.
number of matching n-grams in the candidate translation and the reference
text. We find the "clipped count" of matching n-grams so as to not
give a high score to a (reference, prediction) pair with redundant, repeated
tokens. Secondly, BLEU score tends to reward shorter predictions more, which
is why a brevity penalty is applied to penalise short predictions.

Note on input shapes:
For `y_true` and `y_pred`, this class supports scalar values and batch
inputs of shapes `()`, `(batch_size,)` and `(batch_size, 1)`.

Args:
tokenizer: callable. A function that takes a string `tf.Tensor` (of
any shape), and tokenizes the strings in the tensor. This function
should use TensorFlow graph ops. If the tokenizer is not specified,
the default tokenizer (`"tokenizer_13a"` present in the SacreBLEU
package) will be used.
tokenizer: callable. A function that takes a string `tf.RaggedTensor`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens it you pass a tokenizer layer here, will that work? Say byte tokenizer for simplicity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it won't work with byte tokeniser because we use tensor_to_string_list in the code. Do you want me to change that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either support our tokenizers or not name this argument to something else.

Tokenizer means something specific in our library now, if we use that name but don't support our tokenizer class that is a bad look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(of any shape), and tokenizes the strings in the tensor. This
function should use TensorFlow graph ops. If the tokenizer is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary? If people are not interested in using model.evaluate(), can they just run it in pure eager mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but we call the tokeniser after converting the inputs to tensors. So, we have to use TF ops here such as tf.strings.regex_replace().

specified, the default tokenizer is used. The default tokenizer
replicates the behaviour of SacreBLEU's `"tokenizer_13a"` tokenizer
(https://github.com/mjpost/sacrebleu/blob/v2.1.0/sacrebleu/tokenizers/tokenizer_13a.py).
max_order: int. The maximum n-gram order to use. For example, if
`max_order` is set to 3, unigrams, bigrams, and trigrams will be
considered. Defaults to 4.
smooth: bool. Whether to apply Lin et al. 2004 smoothing to the BLEU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we describe this better? Lin et al. 2004 with a period in the middle of the docstring does not read very well. Also please add to reference section.

score. Defaults to False.
variant: string. Either `"corpus_bleu"` or `"sentence_bleu"`. The former
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like corpus bleu is the better option here? I see that sacrebleu exposes methods for both of these, but does not seems to document the sentence one. Huggingface looks like it might not even have an option for this (is that true?).

I guess I'm wondering if it might make sense to not even expose this, and wait till someone asks for the sentence option.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing a survey, this is what I found:

Conclusion: I think the expectation is that, if users want to compute the Sentence BLEU score, they can do so by passing one sample at a time, and average over the returned scores.

Some additional notes
However, another point to note is that HF provides two options with all its metrics:

  • .compute() - user can pass one sample at a time, get the BLEU scores, and average over them for computing Sentence BLEU.
  • .add_batch() - will compute the Corpus BLEU score across all samples across batches.

We use Keras metrics similar to the add_batch() function. So, if the user wants to compute the Sentence BLEU score, he/she/they will have to re-initialise the metric for every sample. PyTorch Ignite metrics also work similar to the add_batch function, which is why they have provided an option for macro/micro-averaging. So, I am just wondering whether HF and NLTK do not provide explicit options to macro-average the BLEU scores because the user can average the BLEU scores. But with Ignite, the user can't do that without re-initialising before every sample, which is why an option has been provided,

computes the micro-average precision, which is equivalent to
passing all samples (across batches) all at once. In other words,
summing the numerators and denominators for each
hypothesis-reference(s) pairs before the division (in order to
calculate the precision). The latter is the macro-average BLEU score
, which means that it computes the per sample BLEU score and
averages it. Defaults to `"corpus_bleu"`.
computes micro-average precision, which is equivalent to passing all
samples (across batches) all at once. In other words, summing the
numerators and denominators for each hypothesis-reference(s) pairs
before the division (in order to calculate precision). The latter is
the macro-averaged BLEU score which means that it computes the BLEU
score for every sample separately and averages over these scores.
Defaults to `"corpus_bleu"`.
dtype: string or tf.dtypes.Dtype. Precision of metric computation. If
not specified, it defaults to tf.float32.
name: string. Name of the metric instance.
**kwargs: Other keyword arguments.

References:
- [Papineni et al., 2002](https://aclanthology.org/P02-1040/)
"""

def __init__(
Expand Down Expand Up @@ -302,7 +306,7 @@ def aggregate_sentence_bleu(
smooth=False,
):
"""Computes the per-sample BLEU score and returns the aggregate of
all samples. Uses Python ops.
BLEU scores over all samples. Uses Python ops.

Args:
reference_corpus: list of lists of references for each
Expand Down
3 changes: 3 additions & 0 deletions keras_nlp/metrics/rouge_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ class RougeBase(keras.metrics.Metric):
not specified, it defaults to tf.float32.
name: string. Name of the metric instance.
**kwargs: Other keyword arguments.

References:
- [Lin et al., 2004](https://aclanthology.org/W04-1013/)
"""

def __init__(
Expand Down
3 changes: 3 additions & 0 deletions keras_nlp/metrics/rouge_l.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ class RougeL(RougeBase):
name: string. Name of the metric instance.
**kwargs: Other keyword arguments.

References:
- [Lin et al., 2004](https://aclanthology.org/W04-1013/)

Examples:

1. Various Input Types.
Expand Down
3 changes: 3 additions & 0 deletions keras_nlp/metrics/rouge_n.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ class RougeN(RougeBase):
name: string. Name of the metric instance.
**kwargs: Other keyword arguments.

References:
- [Lin et al., 2004](https://aclanthology.org/W04-1013/)

Examples:

1. Various Input Types.
Expand Down