-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BLEU Score #222
Add BLEU Score #222
Changes from 1 commit
f151982
7708cd9
0f757d5
e90bef3
eface1e
dc2110e
e18cc50
d59058b
5ddcfa7
b2d0822
45136fb
23f9a2f
0217b71
0b6ebfa
be897dd
363da3a
fa2c658
0acaae5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,39 +52,43 @@ class Bleu(keras.metrics.Metric): | |
|
||
This class implements the BLEU metric. BLEU is generally used to evaluate | ||
machine translation systems. Succinctly put, in BLEU score, we count the | ||
number of matching n-grams in the candidate translation to n-grams in the | ||
reference text. We find the "clipped count" of matching n-grams so as to not | ||
give a high score to a reference, prediction pair with repeated tokens. | ||
Secondly, BLEU score tends to reward shorter predictions more, which is why | ||
a brevity penalty is applied to penalise short predictions. | ||
number of matching n-grams in the candidate translation and the reference | ||
text. We find the "clipped count" of matching n-grams so as to not | ||
give a high score to a (reference, prediction) pair with redundant, repeated | ||
tokens. Secondly, BLEU score tends to reward shorter predictions more, which | ||
is why a brevity penalty is applied to penalise short predictions. | ||
|
||
Note on input shapes: | ||
For `y_true` and `y_pred`, this class supports scalar values and batch | ||
inputs of shapes `()`, `(batch_size,)` and `(batch_size, 1)`. | ||
|
||
Args: | ||
tokenizer: callable. A function that takes a string `tf.Tensor` (of | ||
any shape), and tokenizes the strings in the tensor. This function | ||
should use TensorFlow graph ops. If the tokenizer is not specified, | ||
the default tokenizer (`"tokenizer_13a"` present in the SacreBLEU | ||
package) will be used. | ||
tokenizer: callable. A function that takes a string `tf.RaggedTensor` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens it you pass a tokenizer layer here, will that work? Say byte tokenizer for simplicity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm, it won't work with byte tokeniser because we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should either support our tokenizers or not name this argument to something else. Tokenizer means something specific in our library now, if we use that name but don't support our tokenizer class that is a bad look. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do support our tokenisers. I've added a unit test here: https://github.com/keras-team/keras-nlp/blob/0b6ebfafe2a819bf39061d07f6382d4f0727d55e/keras_nlp/metrics/bleu_test.py#L105 |
||
(of any shape), and tokenizes the strings in the tensor. This | ||
function should use TensorFlow graph ops. If the tokenizer is not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it necessary? If people are not interested in using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, but we call the tokeniser after converting the inputs to tensors. So, we have to use TF ops here such as |
||
specified, the default tokenizer is used. The default tokenizer | ||
replicates the behaviour of SacreBLEU's `"tokenizer_13a"` tokenizer | ||
(https://github.com/mjpost/sacrebleu/blob/v2.1.0/sacrebleu/tokenizers/tokenizer_13a.py). | ||
max_order: int. The maximum n-gram order to use. For example, if | ||
`max_order` is set to 3, unigrams, bigrams, and trigrams will be | ||
considered. Defaults to 4. | ||
smooth: bool. Whether to apply Lin et al. 2004 smoothing to the BLEU | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we describe this better? Lin et al. 2004 with a period in the middle of the docstring does not read very well. Also please add to reference section. |
||
score. Defaults to False. | ||
variant: string. Either `"corpus_bleu"` or `"sentence_bleu"`. The former | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like corpus bleu is the better option here? I see that sacrebleu exposes methods for both of these, but does not seems to document the sentence one. Huggingface looks like it might not even have an option for this (is that true?). I guess I'm wondering if it might make sense to not even expose this, and wait till someone asks for the sentence option. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After doing a survey, this is what I found:
Conclusion: I think the expectation is that, if users want to compute the Sentence BLEU score, they can do so by passing one sample at a time, and average over the returned scores. Some additional notes
We use Keras metrics similar to the |
||
computes the micro-average precision, which is equivalent to | ||
passing all samples (across batches) all at once. In other words, | ||
summing the numerators and denominators for each | ||
hypothesis-reference(s) pairs before the division (in order to | ||
calculate the precision). The latter is the macro-average BLEU score | ||
, which means that it computes the per sample BLEU score and | ||
averages it. Defaults to `"corpus_bleu"`. | ||
computes micro-average precision, which is equivalent to passing all | ||
samples (across batches) all at once. In other words, summing the | ||
numerators and denominators for each hypothesis-reference(s) pairs | ||
before the division (in order to calculate precision). The latter is | ||
the macro-averaged BLEU score which means that it computes the BLEU | ||
score for every sample separately and averages over these scores. | ||
Defaults to `"corpus_bleu"`. | ||
dtype: string or tf.dtypes.Dtype. Precision of metric computation. If | ||
not specified, it defaults to tf.float32. | ||
name: string. Name of the metric instance. | ||
**kwargs: Other keyword arguments. | ||
|
||
References: | ||
- [Papineni et al., 2002](https://aclanthology.org/P02-1040/) | ||
""" | ||
|
||
def __init__( | ||
|
@@ -302,7 +306,7 @@ def aggregate_sentence_bleu( | |
smooth=False, | ||
): | ||
"""Computes the per-sample BLEU score and returns the aggregate of | ||
all samples. Uses Python ops. | ||
BLEU scores over all samples. Uses Python ops. | ||
|
||
Args: | ||
reference_corpus: list of lists of references for each | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably mention more prominently that this will replicate sacrebleu by default, but can be used with other tokenizers e.g. for other languages.