BLEU评估指标
BLEU评估指标在机器翻译任务中经常使用,本文主要对BLEU评估指标的计算过程以及计算工具的使用进行总结。
定义
- BLEU(全称为Bilingual Evaluation Understudy),其意思为双语评估替补,用于机器翻译任务的评价,原文如下BLEU: a Method for Automatic Evaluation of Machine Translation
- BLEU算法实际上就是在判断两个句子的相似程度
- BLEU有许多变种,根据
n-gram
可以划分成多种评价指标,常见的评价指标有BLEU-1、BLEU-2、BLEU-3、BLEU-4四种,其中n-gram
指的是连续的单词个数为n,BLEU-1衡量的是单词级别的准确性,更高阶的BLEU可以衡量句子的流畅性
计算
BLEU计算的一个大致步骤是:
分别计算
candidate
句和reference
句的N-grams
模型,然后统计其匹配的个数,计算匹配度 \[ candidate和reference中匹配的n-gram的个数/candidate中n-gram的个数 \]举例说明:
candidate: It is a nice day today
reference: Today is a nice day
使用
1-gram
进行匹配1
2candidate: {it, is, a, nice, day, today}
reference: {today, is, a, nice, day}其中
{today, is, a, nice, day}
匹配,所以匹配度为5/6使用
2-gram
进行匹配1
2candidate: {it is, is a, a nice, nice day, day today}
reference: {today is, is a, a nice, nice day}其中
{is a, a nice, nice day}
匹配,所以匹配度为3/5使用
3-gram
进行匹配1
2candidate: {it is a, is a nice, a nice day, nice day today}
reference: {today is a, is a nice, a nice day}其中
{is a nice, a nice day}
匹配,所以匹配度为2/4使用
4-gram
进行匹配1
2candidate: {it is a nice, is a nice day, a nice day today}
reference: {today is a nice, is a nice day}其中
{is a nice day}
匹配,所以匹配度为1/3
对匹配的
N-grams
计数进行修改,以确保它考虑到reference
文本中单词的出现,而非奖励生成大量合理翻译单词的候选结果举例说明:
candidate: the the the the
reference: The cat is standing on the ground
如果按照
1-gram
的方法进行匹配,则匹配度为1,显然是不合理的,所以计算某个词的出现次数进行改进将计算某个词的出现次数的方法改为计算某个词在译文中出现的最小次数,如下所示, \[ \operatorname{count}_{k}=\min ({c}_{k}, {s}_{k}) \] 其中\(k\)表示在机器译文(candidate)中出现的第\(k\)个词语,\(c_{k}\)则代表在机器译文中这个词语出现的次数,而\(s_{k}\)则代表在人工译文(reference)中这个词语出现的次数。
由此,可以定义BLEU计算公式,首先定义几个数学符号:
- 人工译文表示为\(s_{j}\),其中\({j} \in \mathrm{M}\),\(\mathrm{M}\)表示有\(\mathrm{M}\)个参考答案
- 翻译译文表示为\(c_{i}\),其中\(i \in \mathrm{E}\),\(\mathrm{E}\)表示共有\(\mathrm{E}\)个翻译
- \(n\)表示\(n\)个单词长度的词组集合,令\(k\)表示第\(k\)个词组
- \(h_{k}(c_{i})\)表示第\(k\)个词组在翻译译文\(c_{i}\)中出现的次数
- \(h_{k}(s_{i,j})\)表示第\(k\)个词组在人工译文\(s_{i,j}\)中出现的次数
最后可以得到计算每个
n-gram
的公式, \[ P_{n}=\frac{\sum_{i}^{\mathrm{E}} \sum_{k}^\mathrm{K} \min(h_{k}(c_{i}), \max_{j \in \mathrm{M}}h_{k}(s_{i,j})) } {\sum_{i}^{\mathrm{E}} \sum_{k}^\mathrm{K}\min(h_{k}(c_{i}))} \] 第一个求和符号统计的是所有的翻译句子,因为计算时可能有多个句子;第二个求和符号是统计一条翻译句子中所有的n-gram
,\(\max_{j \in \mathrm{M}}h_{k}(s_{i,j})\)表示第\(i\)条翻译句子对应的\(\mathrm{M}\)条人工译文中包含最多第\(k\)个词组的句子中第\(k\)个词组的数量n-gram
匹配度可能会随着句子长度的变短而变好,为了避免这种现象,BLEU在最后的评分结果中引入了长度惩罚因子(Brevity Penalty) \[ B P=\left\{\begin{array}{lll} 1 & \text { if } & l_{c}>l s \\ e^{1-\frac{l_{s}}{l_{c}}} & \text { if } & l_{c}<=l_{s} \end{array}\right. \] 其中,\(l_{c}\)表示机器翻译译文的长度,\(l_{s}\)表示参考译文的有效长度,当存在多个参考译文时,选取和翻译译文最接近的长度。当翻译译文长度大于参考译文长度时,惩罚因子为1,意味着不惩罚,只有翻译译文长度小于参考译文长度时,才会计算惩罚因子。计算BLEU最终公式
为了平衡各阶统计量的作用,对各阶统计量进行加权求和,一般来说,\(N\)取4,最多只统计
4-gram
的精度,\(\boldsymbol{W}_{n}\)取\(1/N\),进行均匀加权,最终公式如下: \[ B L E U=B P \times \exp \left(\sum_{n=1}^{N} \boldsymbol{W}_{n} \log P_{n}\right) \]
计算工具
nltk
计算独立的BLEU:也就是只计算某一种
n-gram
的BLEU1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19from nltk.translate.bleu_score import sentence_bleu
sentence1 = "it is a guide to action which ensures that the military always obeys the commands of the party"
sentence2 = "it is a guide to action that ensures that the military will forever heed party commands"
sentence3 = "it is the guiding principle which guarantees the military forces always being under the command of the party"
sentence4 = "it is the practical guide for the army always to heed the directions of the party"
candidate = list(sentence1.split(" "))
reference = [list(sentence2.split(" ")), list(sentence3.split(" ")), list(sentence4.split(" "))]
print('Individual 1-gram: {}'.format(sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))))
print('Individual 2-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 1, 0, 0))))
print('Individual 3-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 0, 1, 0))))
print('Individual 4-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))))
# Individual 1-gram: 0.9444444444444444
# Individual 2-gram: 0.5882352941176471
# Individual 3-gram: 0.4375
# Individual 4-gram: 0.26666666666666666- 计算\(P_{1}\):
词 候选译文 参考译文1 参考译文2 参考译文3 \(\max_{j \in \mathrm{M}}h(s)\) \(\min(h(c), \max_{j \in \mathrm{M}}h(s))\) it 1 1 1 1 1 1 is 1 1 1 1 1 1 a 1 1 0 0 1 1 guide 1 1 0 1 1 1 to 1 1 0 1 1 1 action 1 1 0 0 1 1 which 1 0 1 0 1 1 ensures 1 1 0 0 1 1 that 1 2 0 0 2 1 the 3 1 3 3 3 3 military 1 1 1 0 1 1 always 1 0 1 1 1 1 obeys 1 0 0 0 0 0 commands 1 1 0 0 1 1 of 1 0 1 1 1 1 party 1 1 1 1 1 1 \[ P_{1}=\frac{1+1+1+1+1+1+1+1+1+3+1+1+0+1+1+1}{1+1+1+1+1+1+1+1+1+3+1+1+1+1+1+1}=\frac{17}{18}=0.9444444444444444 \]
- 计算\(P_{2}\)
词 候选译文 参考译文1 参考译文2 参考译文3 \(\max_{j \in \mathrm{M}}h(s)\) \(\min(h(c), \max_{j \in \mathrm{M}}h(s))\) ensures that 1 1 0 0 1 1 guide to 1 1 0 0 1 1 which ensures 1 0 0 0 0 0 obeys the 1 0 0 0 0 0 commands of 1 0 0 0 0 0 that the 1 1 0 0 1 1 a guide 1 1 0 0 1 1 of the 1 0 1 1 1 1 always obeys 1 0 0 0 0 0 the commands 1 0 0 0 0 0 to action 1 1 0 0 1 1 the party 1 0 0 1 1 1 is a 1 1 0 0 1 1 action which 1 0 0 0 0 0 It is 1 1 1 1 1 1 military always 1 0 0 0 0 0 the military 1 1 1 0 1 1 \[ P_{2}=\frac{10}{17}=0.5882352941176471 \]
- 计算\(P_{3}\)
词 候选译文 参考译文1 参考译文2 参考译文3 \(\max_{j \in \mathrm{M}}h(s)\) \(\min(h(c), \max_{j \in \mathrm{M}}h(s))\) ensures that the 1 1 0 0 1 1 which ensures that 1 0 0 0 0 0 action which ensures 1 0 0 0 0 0 a guide to 1 1 0 0 1 1 military always obeys 1 0 0 0 0 0 the commands of 1 0 0 0 0 0 commands of the 1 0 0 0 0 0 to action which 1 0 0 0 0 0 the military always 1 0 0 0 0 0 obeys the commands 1 0 0 0 0 0 It is a 1 1 0 0 1 1 of the party 1 0 0 1 1 1 is a guide 1 1 0 0 1 1 that the military 1 1 0 0 1 1 always obeys the 1 0 0 0 0 0 guide to action 1 1 0 0 1 1 \[ P_{3}=\frac{7}{16}=0.4375 \]
- 计算\(P_{4}\)
词 候选译文 参考译文1 参考译文2 参考译文3 \(\max_{j \in \mathrm{M}}h(s)\) \(\min(h(c), \max_{j \in \mathrm{M}}h(s))\) to action which ensures 1 0 0 0 0 0 action which ensures that 1 0 0 0 0 0 guide to action which 1 0 0 0 0 0 obeys the commands of 1 0 0 0 0 0 which ensures that the 1 0 0 0 0 0 commands of the party 1 0 0 0 0 0 ensures that the military 1 1 0 0 1 1 a guide to action 1 1 0 0 1 1 always obeys the commands 1 0 0 0 0 0 that the military always 1 0 0 0 0 0 the commands of the 1 0 0 0 0 0 the military always obeys 1 0 0 0 0 0 military always obeys the 1 0 0 0 0 0 is a guide to 1 1 0 0 1 1 It is a guide 1 1 0 0 1 1 \[ P_{4}=\frac{4}{15}=0.26666666666666666 \]
计算累积的BLEU:指的是为各个
gram
对应的权重加权,来计算得到一个加权几何平均,需要注意以下几点:- BLEU-4并不是只看
4-gram
的情况,而是计算从1-gram
到4-gram
的累积分数,加权策略为1-gram
、2-gram
、3-gram
、4-gram
的权重各占25% - 默认情况下(不加
weights
参数的情况下),sentence_bleu()
和corpus_bleu()
都是计算BLEU-4分数的
1
2
3
4
5
6
7
8
9print('Cumulative 1-gram: {}'.format(sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))))
print('Cumulative 2-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))))
print('Cumulative 3-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))))
print('Cumulative 4-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))))
# Cumulative 1-gram: 0.9444444444444444
# Cumulative 2-gram: 0.7453559924999299
# Cumulative 3-gram: 0.6270220769211224
# Cumulative 4-gram: 0.5045666840058485计算
BLEU-1
:首先翻译句子的长度为18,而参考译文句子长度分别为16、18、16,选择与翻译句子长度最接近的参考译文句子,此时惩罚因子为1,即不惩罚。
1
2math.exp(1 * math.log(0.9444444444444444))
# 0.9444444444444444计算
BLEU-2
:1
2math.exp(0.5 * math.log(0.9444444444444444) + 0.5 * math.log(0.5882352941176471))
# 0.7453559924999299计算
BLEU-3
:1
2math.exp(0.33 * math.log(0.9444444444444444) + 0.33 * math.log(0.5882352941176471) + 0.33 * math.log(0.4375))
# 0.6270220769211224计算
BLEU-4
:1
2
3math.exp(0.25 * math.log(0.9444444444444444) + 0.25 * math.log(0.5882352941176471)
+ 0.25 * math.log(0.4375) + 0.25 * math.log(0.26666666666666666))
# 0.5045666840058485
- BLEU-4并不是只看
调用
corpus_bleu()
方法求得语料级别的BLEUsentence_bleu()
和corpus_bleu()
输入参数的对比(重点关注sentence_bleu()
方法的references
和hypothesis
参数,以及corpus_bleu()
方法的list_of_references
和hypotheses
参数)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
""""
:param references: reference sentences
:type references: list(list(str))
:param hypothesis: a hypothesis sentence
:type hypothesis: list(str)
:param weights: weights for unigrams, bigrams, trigrams and so on
:type weights: list(float)
:return: The sentence-level BLEU score.
:rtype: float
"""
return corpus_bleu([references], [hypothesis], weights, smoothing_function)
references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
hypothesis = ["This", "is", "cat"]
sentence_bleu(references, hypothesis)
def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
smoothing_function=None):
"""
:param references: a corpus of lists of reference sentences, w.r.t. hypotheses
:type references: list(list(list(str)))
:param hypotheses: a list of hypothesis sentences
:type hypotheses: list(list(str))
:param weights: weights for unigrams, bigrams, trigrams and so on
:type weights: list(float)
:return: The corpus-level BLEU score.
:rtype: float
"""计算语料级别的BLEU值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24from nltk.translate.bleu_score import corpus_bleu
s1 = "the dog bit the man"
s2 = "the dog had bit the man"
s3 = "it was not unexpected"
s4 = "no one was surprised"
s5 = "the man bit him first"
s6 = "the man had bitten the dog"
s7 = "the dog bit the man"
s8 = "it was not surprising"
s9 = "the man had just bitten him"
candidates = [list(s7.split(" ")), list(s8.split(" ")), list(s9.split(" "))]
references = [
[list(s1.split(" ")), list(s2.split(" "))],
[list(s3.split(" ")), list(s4.split(" "))],
[list(s5.split(" ")), list(s6.split(" "))]
]
print('Corpus BLEU: {}'.format(corpus_bleu(references, candidates)))
# Corpus BLEU: 0.5719285395120957PS:需要注意的是,计算所有单个句子的BLEU值然后求平均和直接计算
corpus
级别的BLEU值不同,如下所示,1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20reference1 = [list(s1.split(" ")), list(s2.split(" "))]
candidate1 = list(s7.split(" "))
reference2 = [list(s3.split(" ")), list(s4.split(" "))]
candidate2 = list(s8.split(" "))
reference3 = [list(s5.split(" ")), list(s6.split(" "))]
candidate3 = list(s9.split(" "))
print('Sentence1 BLEU: ', sentence_bleu(reference1, candidate1))
print('Sentence2 BLEU: ', sentence_bleu(reference2, candidate2))
print('Sentence3 BLEU: ', sentence_bleu(reference3, candidate3))
print('Average Sentence BLEU: ', (sentence_bleu(reference1, candidate1) +
sentence_bleu(reference2, candidate2) +
sentence_bleu(reference3, candidate3)) / 3)
# Sentence1 BLEU: 1.0
# Sentence2 BLEU: 8.636168555094496e-78
# Sentence3 BLEU: 6.562069055463047e-78
# Average Sentence BLEU: 0.3333333333333333正确的计算方法如下所示,将每个句子的
i-gram
概率的分子和分母对应相加,最后得出统一的4个独立BLEU(\(P_{i},i \in {1,2,3,4}\)),再按照公式进行计算,特别地,在计算BP惩罚因子时,翻译句子长度由所有翻译句子长度相加得到,参考译文长度由所有与对应翻译句子长度最接近的参考译文长度相加得到, \[ B L E U=B P \times \exp \left(\sum_{n=1}^{4} 0.25* \log P_{n}\right) \]1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29from collections import Counter
from nltk.translate.bleu_score import modified_precision
p_numerators = Counter()
p_denominators = Counter()
for refs, hyps in zip(references, candidates):
for i in range(1, 5):
p_i = modified_precision(refs, hyps, i)
p_numerators[i] += p_i.numerator
p_denominators[i] += p_i.denominator
print(p_numerators, p_denominators)
# Counter({1: 13, 2: 8, 3: 5, 4: 2}) Counter({1: 15, 2: 12, 3: 9, 4: 6})
res = 0
for i in range(1, 5):
res += 0.25 * math.log(p_numerators[i] / p_denominators[i])
# 本例中惩罚因子为1
print(math.exp(res))
# 0.5719285395120957
list(zip(references, candidates))
# [([['the', 'dog', 'bit', 'the', 'man'],
# ['the', 'dog', 'had', 'bit', 'the', 'man']],
# ['the', 'dog', 'bit', 'the', 'man']),
# ([['it', 'was', 'not', 'unexpected'], ['no', 'one', 'was', 'surprised']],
# ['it', 'was', 'not', 'surprising']),
# ([['the', 'man', 'bit', 'him', 'first'],
# ['the', 'man', 'had', 'bitten', 'the', 'dog']],
# ['the', 'man', 'had', 'just', 'bitten', 'him'])]
sacrebleu
计算
sentence bleu
:1
2
3
4
5
6
7
8import sacrebleu
sentence1 = "it is a guide to action which ensures that the military always obeys the commands of the party"
sentence2 = "it is a guide to action that ensures that the military will forever heed party commands"
sentence3 = "it is the guiding principle which guarantees the military forces always being under the command of the party"
sentence4 = "it is the practical guide for the army always to heed the directions of the party"
bleu = sacrebleu.sentence_bleu(sentence1, [sentence2, sentence3, sentence4])
print("Sentence BLEU: ", bleu)
# Sentence BLEU: BLEU = 50.46 94.4/58.8/43.8/26.7 (BP = 1.000 ratio = 1.000 hyp_len = 18 ref_len = 18)计算
corpus bleu
:1
2
3
4
5
6refs = [['the dog bit the man', 'it was not unexpected', 'the man bit him first'],
['the dog had bit the man', 'no one was surprised', 'the man had bitten the dog']]
sys = ['the dog bit the man', "it was not surprising", 'the man had just bitten him']
bleu = sacrebleu.corpus_bleu(sys, refs)
print("Corpus BLEU: ", bleu)
# Corpus BLEU: BLEU = 57.19 86.7/66.7/55.6/33.3 (BP = 1.000 ratio = 1.000 hyp_len = 15 ref_len = 15)
multi-bleu.perl
:使用multi-bleu
进行评测要求事先把句子进行tokenize
,这意味着multi-bleu
得到的分数受tokenizer
如何分词的影响mteval-v14.pl
:脚本内部有一套标准的分词器,不需要分词直接输入句子就可以进行评测,计算的值与sacrebleu
计算的值相同
参考文献: