1 引言
在《GPT2-Large模型解码方法比较》中显示了Beam search方法比greedy search方法的效果好,本文接着比较另外两种解码方法: Top-K sampling和Top-p sampling。
2 Top-K sampling
Facebook的Fan等人(2018)在他们的论文《Hierarchical Neural Story Generation(分层神经故事的产生)》引入了一个简单但非常强大的取样方案,称之为Top-K抽样。在Top-K抽样中,首先选择下一个词最大可能出现的K个单词,然后在这K个单词中重新计算概率。GPT2使用了这种抽样方案,这是它在故事生成方面取得成功的原因之一。在实践中,通常取K值为40或50。在本次试验中取top_k=50。仍然以"landslide produced by earthquakes"为引导句生成的结果如下:
landslide produced by earthquakes. The two parts of the landslide might be linked. "There was a huge change because of the tsunami," says Richard Zimba, a volcanologist at the Woods Hole Oceanographic Institution. "The sea level was a thousand meters higher than it is today. The whole of southern Utah was totally flooded, and the islands were lost." The landslide also set off a series of subsequent landslides and rockfalls that led back to its source. After the tsunami left much of the land in ruin, the landslide's debris washed out to sea. The watery disaster was triggered by the collapse of a massive tectonic plate, a gigantic column of crust that lies beneath the continental United States, Canada, Mexico, and Central America. "This is a classic example of a catastrophic collision," says Michael H. Hodge, a volcano expert at the British Columbia-based National Research Council. [由地震产生的山体滑坡。山体滑坡的两个部分可能有联系。"因为海啸,发生了巨大的变化,"伍兹霍尔海洋研究所的火山学家理查德-津巴说。"当时的海平面比现在高一千米。整个犹他州南部完全被淹没,岛屿也失去了。" 这场滑坡还引发了一系列后续的山体滑坡和落石,导致其起源。在海啸使大部分土地成为废墟之后,山体滑坡的碎片被冲到了海上。这场水灾是由一个巨大的构造板块坍塌引发的,这个板块是位于美国大陆、加拿大、墨西哥和中美洲下面的巨大地壳柱。"位于不列颠哥伦比亚省的国家研究委员会的火山专家迈克尔-H-霍奇说:"这是一个灾难性碰撞的典型例子。]
试验结果显示,在Top-K sampling, 不设置no_repeat_ngram_size的值可能会得到内容连贯性更好的结果。
3 Top-p sampling
Top-p sampling是在Top-K sampling的基础上发展起来的。在Top-p抽样中,不是只从最有可能的K个词中抽样,而是从累积概率超过概率p的尽可能小的词集中选择,然后在这个小的词集中重新分配概率。因此单词集的数量可以根据下一个单词的概率分布动态地增加和减少。在实践中,top_p一般取0.95以上。本次试验取top_p = 0.97。试验结果如下:
landslide produced by earthquakes. The two parts of the landslide might be linked. "There was a huge change in the number of earthquakes, particularly on the coast, that produced an enormous amount of volcanic rock. And the amount of molten rock was a huge increase, that was going from the seafloor and then going towards the crust. Then suddenly, there were massive and sudden changes in the type of eruptions and other things. So that there is another link between the two. And if we can figure out exactly what that link is, we can come up with a way of getting rid of the massive amounts of volcanic rock." Another mystery is how the lava was brought to the surface. "When these giant flows start, they actually melt away under pressure, so they can't be brought to the surface. They just rise into the air in the night – this is why the Icelandic glacier is huge. [由地震产生的山体滑坡。山体滑坡的两部分可能有联系。"地震的数量发生了巨大的变化,特别是在沿海地区,产生了大量的火山岩。而熔岩的数量也大量增加,那是从海底出发,然后进入到地壳的。然后喷发的类型和其他事情发生了大规模的突然变化。因此,这两者之间存在着另一种联系。而如果我们能够弄清楚这个联系到底是什么,我们就可以想出一个办法来摆脱大量的火山岩。" 另一个谜团是熔岩是如何被带到地表的。"当这些巨流开始时,它们实际上在压力下融化了,所以它们不能被带到地表。它们只是在夜间升到空中--这就是为什么冰岛的冰川是巨大的。]
3 哪个方法更好
虽然从理论上,解码方法的质量Top-p>Top-K>Beam>Greedy,但是在实践中Top-p可以与Top-K结合起来使用,可以避免排名很低的词,同时也允许一些动态选择。
outputs = model.generate(
input_ids,
max_length = 200,
early_stopping=True,
top_k=50,
top_p = 0.90,
do_sample=True)
另一方面,这种结论也不是绝对的。Welleck等人(2019) 在他们的论文《Neural Text Degeneration With Unlikelihood Training(神经性文本退化与不太可能的训练)》争论说Greedy search和Beam Search产生重复单词序列的缺陷是由模型的训练方式造成的,而不是解码方法。同时他们还说根据人对生成句子的评价,当调整模型的训练目标时,Beam搜索可以产生出比Top-p更流畅的文本。事实上我们的试验也证明了这一点。另外,文本流畅是必须的,但文本内容应该才是最重要的。
4 结束语
开放式语言生成是一个快速发展的研究领域,由于语言生成的本质是概率统计,因而没有一个放之四海而皆准的模型和解码方法,必须具体问题具体分析。最后以一首诗结束这个笔记吧:
秋天是丰收的季节
我们要携手创造美好的明天
让我们一起手牵手
一起走向阳光明媚的春天
我们的梦想一定会实现
让梦想永远陪伴在你我身边
让爱的花朵永远盛开在我们心间
备注: 这首诗由gpt2-chinese-lyric模型生成.