BLEU评估指标

发表于 2022-12-28 更新于 2023-07-31 分类于机器翻译阅读次数：本文字数： 13k 阅读时长 ≈ 22 分钟

BLEU评估指标在机器翻译任务中经常使用，本文主要对BLEU评估指标的计算过程以及计算工具的使用进行总结。

定义

BLEU（全称为Bilingual Evaluation Understudy），其意思为双语评估替补，用于机器翻译任务的评价，原文如下BLEU: a Method for Automatic Evaluation of Machine Translation
BLEU算法实际上就是在判断两个句子的相似程度
BLEU有许多变种，根据n-gram可以划分成多种评价指标，常见的评价指标有BLEU-1、BLEU-2、BLEU-3、BLEU-4四种，其中n-gram指的是连续的单词个数为n，BLEU-1衡量的是单词级别的准确性，更高阶的BLEU可以衡量句子的流畅性

计算

BLEU计算的一个大致步骤是：
- 分别计算candidate句和reference句的N-grams模型，然后统计其匹配的个数，计算匹配度 \[ candidate和reference中匹配的n-gram的个数/candidate中n-gram的个数 \]
  
  举例说明：
  
  candidate: It is a nice day today
  
  reference: Today is a nice day
  - 使用1-gram进行匹配
    1
    2
    candidate: {it, is, a, nice, day, today}
    reference: {today, is, a, nice, day}
    其中{today, is, a, nice, day}匹配，所以匹配度为5/6
  - 使用2-gram进行匹配
    1
    2
    candidate: {it is, is a, a nice, nice day, day today}
    reference: {today is, is a, a nice, nice day}
    其中{is a, a nice, nice day}匹配，所以匹配度为3/5
  - 使用3-gram进行匹配
    1
    2
    candidate: {it is a, is a nice, a nice day, nice day today}
    reference: {today is a, is a nice, a nice day}
    其中{is a nice, a nice day}匹配，所以匹配度为2/4
  - 使用4-gram进行匹配
    1
    2
    candidate: {it is a nice, is a nice day, a nice day today}
    reference: {today is a nice, is a nice day}
    其中{is a nice day}匹配，所以匹配度为1/3
- 对匹配的N-grams计数进行修改，以确保它考虑到reference文本中单词的出现，而非奖励生成大量合理翻译单词的候选结果
  
  举例说明：
  
  candidate: the the the the
  
  reference: The cat is standing on the ground
  
  如果按照1-gram的方法进行匹配，则匹配度为1，显然是不合理的，所以计算某个词的出现次数进行改进
  
  将计算某个词的出现次数的方法改为计算某个词在译文中出现的最小次数，如下所示， \[ \operatorname{count}_{k}=\min ({c}_{k}, {s}_{k}) \] 其中\(k\)表示在机器译文（candidate）中出现的第\(k\)个词语，\(c_{k}\)则代表在机器译文中这个词语出现的次数，而\(s_{k}\)则代表在人工译文（reference）中这个词语出现的次数。
  
  由此，可以定义BLEU计算公式，首先定义几个数学符号：
  - 人工译文表示为\(s_{j}\)，其中\({j} \in \mathrm{M}\)，\(\mathrm{M}\)表示有\(\mathrm{M}\)个参考答案
  - 翻译译文表示为\(c_{i}\)，其中\(i \in \mathrm{E}\)，\(\mathrm{E}\)表示共有\(\mathrm{E}\)个翻译
  - \(n\)表示\(n\)个单词长度的词组集合，令\(k\)表示第\(k\)个词组
  - \(h_{k}(c_{i})\)表示第\(k\)个词组在翻译译文\(c_{i}\)中出现的次数
  - \(h_{k}(s_{i,j})\)表示第\(k\)个词组在人工译文\(s_{i,j}\)中出现的次数
  最后可以得到计算每个n-gram的公式， \[ P_{n}=\frac{\sum_{i}^{\mathrm{E}} \sum_{k}^\mathrm{K} \min(h_{k}(c_{i}), \max_{j \in \mathrm{M}}h_{k}(s_{i,j})) } {\sum_{i}^{\mathrm{E}} \sum_{k}^\mathrm{K}\min(h_{k}(c_{i}))} \] 第一个求和符号统计的是所有的翻译句子，因为计算时可能有多个句子；第二个求和符号是统计一条翻译句子中所有的n-gram，\(\max_{j \in \mathrm{M}}h_{k}(s_{i,j})\)表示第\(i\)条翻译句子对应的\(\mathrm{M}\)条人工译文中包含最多第\(k\)个词组的句子中第\(k\)个词组的数量
- n-gram匹配度可能会随着句子长度的变短而变好，为了避免这种现象，BLEU在最后的评分结果中引入了长度惩罚因子（Brevity Penalty） \[ B P=\left\{\begin{array}{lll} 1 & \text { if } & l_{c}>l s \\ e^{1-\frac{l_{s}}{l_{c}}} & \text { if } & l_{c}<=l_{s} \end{array}\right. \] 其中，\(l_{c}\)表示机器翻译译文的长度，\(l_{s}\)表示参考译文的有效长度，当存在多个参考译文时，选取和翻译译文最接近的长度。当翻译译文长度大于参考译文长度时，惩罚因子为1，意味着不惩罚，只有翻译译文长度小于参考译文长度时，才会计算惩罚因子。
- 计算BLEU最终公式
  
  为了平衡各阶统计量的作用，对各阶统计量进行加权求和，一般来说，\(N\)取4，最多只统计4-gram的精度，\(\boldsymbol{W}_{n}\)取\(1/N\)，进行均匀加权，最终公式如下： \[ B L E U=B P \times \exp \left(\sum_{n=1}^{N} \boldsymbol{W}_{n} \log P_{n}\right) \]

计算工具

nltk

计算独立的BLEU：也就是只计算某一种n-gram的BLEU

from nltk.translate.bleu_score import sentence_bleu

sentence1 = "it is a guide to action which ensures that the military always obeys the commands of the party"
sentence2 = "it is a guide to action that ensures that the military will forever heed party commands"
sentence3 = "it is the guiding principle which guarantees the military forces always being under the command of the party"
sentence4 = "it is the practical guide for the army always to heed the directions of the party"

candidate = list(sentence1.split(" "))
reference = [list(sentence2.split(" ")), list(sentence3.split(" ")), list(sentence4.split(" "))]

print('Individual 1-gram: {}'.format(sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))))
print('Individual 2-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 1, 0, 0))))
print('Individual 3-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 0, 1, 0))))
print('Individual 4-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))))

# Individual 1-gram: 0.9444444444444444
# Individual 2-gram: 0.5882352941176471
# Individual 3-gram: 0.4375
# Individual 4-gram: 0.26666666666666666

计算\(P_{1}\)：

词	候选译文	参考译文1	参考译文2	参考译文3	\(\max_{j \in \mathrm{M}}h(s)\)	\(\min(h(c), \max_{j \in \mathrm{M}}h(s))\)
it	1	1	1	1	1	1
is	1	1	1	1	1	1
a	1	1	0	0	1	1
guide	1	1	0	1	1	1
to	1	1	0	1	1	1
action	1	1	0	0	1	1
which	1	0	1	0	1	1
ensures	1	1	0	0	1	1
that	1	2	0	0	2	1
the	3	1	3	3	3	3
military	1	1	1	0	1	1
always	1	0	1	1	1	1
obeys	1	0	0	0	0	0
commands	1	1	0	0	1	1
of	1	0	1	1	1	1
party	1	1	1	1	1	1

\[ P_{1}=\frac{1+1+1+1+1+1+1+1+1+3+1+1+0+1+1+1}{1+1+1+1+1+1+1+1+1+3+1+1+1+1+1+1}=\frac{17}{18}=0.9444444444444444 \]

计算\(P_{2}\)

词	候选译文	参考译文1	参考译文2	参考译文3	\(\max_{j \in \mathrm{M}}h(s)\)	\(\min(h(c), \max_{j \in \mathrm{M}}h(s))\)
ensures that	1	1	0	0	1	1
guide to	1	1	0	0	1	1
which ensures	1	0	0	0	0	0
obeys the	1	0	0	0	0	0
commands of	1	0	0	0	0	0
that the	1	1	0	0	1	1
a guide	1	1	0	0	1	1
of the	1	0	1	1	1	1
always obeys	1	0	0	0	0	0
the commands	1	0	0	0	0	0
to action	1	1	0	0	1	1
the party	1	0	0	1	1	1
is a	1	1	0	0	1	1
action which	1	0	0	0	0	0
It is	1	1	1	1	1	1
military always	1	0	0	0	0	0
the military	1	1	1	0	1	1

\[ P_{2}=\frac{10}{17}=0.5882352941176471 \]

计算\(P_{3}\)

词	候选译文	参考译文1	参考译文3	\(\max_{j \in \mathrm{M}}h(s)\)	\(\min(h(c), \max_{j \in \mathrm{M}}h(s))\)
ensures that the	1	1	0	1	1
which ensures that	1	0	0	0	0
action which ensures	1	0	0	0	0
a guide to	1	1	0	1	1
military always obeys	1	0	0	0	0
the commands of	1	0	0	0	0
commands of the	1	0	0	0	0
to action which	1	0	0	0	0
the military always	1	0	0	0	0
obeys the commands	1	0	0	0	0
It is a	1	1	0	1	1
of the party	1	0	1	1	1
is a guide	1	1	0	1	1
that the military	1	1	0	1	1
always obeys the	1	0	0	0	0
guide to action	1	1	0	1	1

\[ P_{3}=\frac{7}{16}=0.4375 \]

计算\(P_{4}\)

词	候选译文	参考译文1	\(\max_{j \in \mathrm{M}}h(s)\)	\(\min(h(c), \max_{j \in \mathrm{M}}h(s))\)
to action which ensures	1	0	0	0
action which ensures that	1	0	0	0
guide to action which	1	0	0	0
obeys the commands of	1	0	0	0
which ensures that the	1	0	0	0
commands of the party	1	0	0	0
ensures that the military	1	1	1	1
a guide to action	1	1	1	1
always obeys the commands	1	0	0	0
that the military always	1	0	0	0
the commands of the	1	0	0	0
the military always obeys	1	0	0	0
military always obeys the	1	0	0	0
is a guide to	1	1	1	1
It is a guide	1	1	1	1

\[ P_{4}=\frac{4}{15}=0.26666666666666666 \]

计算累积的BLEU：指的是为各个gram对应的权重加权，来计算得到一个加权几何平均，需要注意以下几点：

BLEU-4并不是只看4-gram的情况，而是计算从1-gram到4-gram的累积分数，加权策略为1-gram、2-gram、3-gram、4-gram的权重各占25%
默认情况下（不加weights参数的情况下），sentence_bleu()和corpus_bleu()都是计算BLEU-4分数的

print('Cumulative 1-gram: {}'.format(sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))))
print('Cumulative 2-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))))
print('Cumulative 3-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))))
print('Cumulative 4-gram: {}'.format(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))))

# Cumulative 1-gram: 0.9444444444444444
# Cumulative 2-gram: 0.7453559924999299
# Cumulative 3-gram: 0.6270220769211224
# Cumulative 4-gram: 0.5045666840058485

计算BLEU-1：

首先翻译句子的长度为18，而参考译文句子长度分别为16、18、16，选择与翻译句子长度最接近的参考译文句子，此时惩罚因子为1，即不惩罚。
1
2
math.exp(1 * math.log(0.9444444444444444))
# 0.9444444444444444

计算BLEU-2：

1 2	math.exp(0.5 * math.log(0.9444444444444444) + 0.5 * math.log(0.5882352941176471)) # 0.7453559924999299

计算BLEU-3：

1 2	math.exp(0.33 * math.log(0.9444444444444444) + 0.33 * math.log(0.5882352941176471) + 0.33 * math.log(0.4375)) # 0.6270220769211224

计算BLEU-4：

1
2
3

math.exp(0.25 * math.log(0.9444444444444444) + 0.25 * math.log(0.5882352941176471) 
         + 0.25 * math.log(0.4375) + 0.25 * math.log(0.26666666666666666))
# 0.5045666840058485

调用corpus_bleu()方法求得语料级别的BLEU

sentence_bleu()和corpus_bleu()输入参数的对比（重点关注sentence_bleu()方法的references和hypothesis参数，以及corpus_bleu()方法的list_of_references和hypotheses参数）

def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                  smoothing_function=None):
		""""
    :param references: reference sentences
    :type references: list(list(str))
    :param hypothesis: a hypothesis sentence
    :type hypothesis: list(str)
    :param weights: weights for unigrams, bigrams, trigrams and so on
    :type weights: list(float)
    :return: The sentence-level BLEU score.
    :rtype: float
    """
    return corpus_bleu([references], [hypothesis], weights, smoothing_function)

references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
hypothesis = ["This", "is", "cat"]
sentence_bleu(references, hypothesis)

def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
                smoothing_function=None):
    """
    :param references: a corpus of lists of reference sentences, w.r.t. hypotheses
    :type references: list(list(list(str)))
    :param hypotheses: a list of hypothesis sentences
    :type hypotheses: list(list(str))
    :param weights: weights for unigrams, bigrams, trigrams and so on
    :type weights: list(float)
    :return: The corpus-level BLEU score.
    :rtype: float
    """

计算语料级别的BLEU值

from nltk.translate.bleu_score import corpus_bleu

s1 = "the dog bit the man"
s2 = "the dog had bit the man"

s3 = "it was not unexpected"
s4 = "no one was surprised"

s5 = "the man bit him first"
s6 = "the man had bitten the dog"

s7 = "the dog bit the man"
s8 = "it was not surprising"
s9 = "the man had just bitten him"

candidates = [list(s7.split(" ")), list(s8.split(" ")), list(s9.split(" "))]
references = [
    [list(s1.split(" ")), list(s2.split(" "))],
    [list(s3.split(" ")), list(s4.split(" "))],
    [list(s5.split(" ")), list(s6.split(" "))]
]

print('Corpus BLEU: {}'.format(corpus_bleu(references, candidates)))
# Corpus BLEU: 0.5719285395120957

PS：需要注意的是，计算所有单个句子的BLEU值然后求平均和直接计算corpus级别的BLEU值不同，如下所示，

reference1 = [list(s1.split(" ")), list(s2.split(" "))]
candidate1 = list(s7.split(" "))

reference2 = [list(s3.split(" ")), list(s4.split(" "))]
candidate2 = list(s8.split(" "))

reference3 = [list(s5.split(" ")), list(s6.split(" "))]
candidate3 = list(s9.split(" "))

print('Sentence1 BLEU: ', sentence_bleu(reference1, candidate1))
print('Sentence2 BLEU: ', sentence_bleu(reference2, candidate2))
print('Sentence3 BLEU: ', sentence_bleu(reference3, candidate3))
print('Average Sentence BLEU: ', (sentence_bleu(reference1, candidate1) + 
                                        sentence_bleu(reference2, candidate2) + 
                                        sentence_bleu(reference3, candidate3)) / 3)

# Sentence1 BLEU:  1.0
# Sentence2 BLEU:  8.636168555094496e-78
# Sentence3 BLEU:  6.562069055463047e-78
# Average Sentence BLEU:  0.3333333333333333

正确的计算方法如下所示，将每个句子的i-gram概率的分子和分母对应相加，最后得出统一的4个独立BLEU（\(P_{i},i \in {1,2,3,4}\)），再按照公式进行计算，特别地，在计算BP惩罚因子时，翻译句子长度由所有翻译句子长度相加得到，参考译文长度由所有与对应翻译句子长度最接近的参考译文长度相加得到， \[ B L E U=B P \times \exp \left(\sum_{n=1}^{4} 0.25* \log P_{n}\right) \]

from collections import Counter
from nltk.translate.bleu_score import modified_precision
p_numerators = Counter()
p_denominators = Counter()
for refs, hyps in zip(references, candidates):
    for i in range(1, 5):
        p_i = modified_precision(refs, hyps, i)
        p_numerators[i] += p_i.numerator
        p_denominators[i] += p_i.denominator
        
print(p_numerators, p_denominators)
# Counter({1: 13, 2: 8, 3: 5, 4: 2}) Counter({1: 15, 2: 12, 3: 9, 4: 6})
res = 0
for i in range(1, 5):
    res += 0.25 * math.log(p_numerators[i] / p_denominators[i])
    
# 本例中惩罚因子为1
print(math.exp(res))
# 0.5719285395120957

list(zip(references, candidates))
# [([['the', 'dog', 'bit', 'the', 'man'],
#    ['the', 'dog', 'had', 'bit', 'the', 'man']],
#   ['the', 'dog', 'bit', 'the', 'man']),
#  ([['it', 'was', 'not', 'unexpected'], ['no', 'one', 'was', 'surprised']],
#   ['it', 'was', 'not', 'surprising']),
#  ([['the', 'man', 'bit', 'him', 'first'],
#    ['the', 'man', 'had', 'bitten', 'the', 'dog']],
#   ['the', 'man', 'had', 'just', 'bitten', 'him'])]

sacrebleu

计算sentence bleu：

import sacrebleu
sentence1 = "it is a guide to action which ensures that the military always obeys the commands of the party"
sentence2 = "it is a guide to action that ensures that the military will forever heed party commands"
sentence3 = "it is the guiding principle which guarantees the military forces always being under the command of the party"
sentence4 = "it is the practical guide for the army always to heed the directions of the party"
bleu = sacrebleu.sentence_bleu(sentence1, [sentence2, sentence3, sentence4])
print("Sentence BLEU: ", bleu)
# Sentence BLEU:  BLEU = 50.46 94.4/58.8/43.8/26.7 (BP = 1.000 ratio = 1.000 hyp_len = 18 ref_len = 18)

计算corpus bleu：

refs = [['the dog bit the man', 'it was not unexpected', 'the man bit him first'],
        ['the dog had bit the man', 'no one was surprised', 'the man had bitten the dog']]
sys = ['the dog bit the man', "it was not surprising", 'the man had just bitten him']
bleu = sacrebleu.corpus_bleu(sys, refs)
print("Corpus BLEU: ", bleu)
# Corpus BLEU:  BLEU = 57.19 86.7/66.7/55.6/33.3 (BP = 1.000 ratio = 1.000 hyp_len = 15 ref_len = 15)

multi-bleu.perl：使用multi-bleu进行评测要求事先把句子进行tokenize，这意味着multi-bleu得到的分数受tokenizer如何分词的影响
mteval-v14.pl：脚本内部有一套标准的分词器，不需要分词直接输入句子就可以进行评测，计算的值与sacrebleu计算的值相同

参考文献：