1 引言
最近两年来,由于在数以百万计网页上训练出来的基于Transformer的大型语言模型的兴起,如OpenAI的GPT2模型,使得开放式语言生成的技术越来越成熟。在《开放式文本生成(Open-Ended Text Generation)》一文中,使用Transformers的管道"text-generation"产生了句子,这种方法的构建基础是因果语言模拟(causal language modeling), 与BERT不同,BERT使用的是屏蔽语言模拟(masked language modeling)。不过,除了改进的Transformer架构和大量无监督训练数据,选择好的解码方法也至关重要。这个笔记简要回顾了不同的解码策略,并在大量测试后优化出参数值的选择范围。这些解码策略都用于自动回归语言生成。自动回归式语言生成基于如下假设:一个词序列的概率分布可以分解为邻接的下一个词条件概率分布的乘积。目前,GPT2, XLNet, OpenAi-GPT, CTRL, TransfoXL, XLM, Bart和T5都可用于自回归语言生成。
2 GPT2-Large模型
由OpenAI开发的GPT2是一个大规模的基于Tranformer的语言模型,它是在一个大型的超过800万个高质量网页的文本语料库上预先训练的。其结果是只使用预训练而不进行微调就可以获得具有竞争力的性能。由于GPT2是一个自回归语言模型,因此对语言生成任务非常有用。本试验使用的是GPT-Large模型(3.25G)。
3 解码方法
目前有四种流行的解码方法,分别是: (1) Greedy search; (2) Beam search; (3) Top-K sampling; and (4) Top-p sampling. 下面分别进行试验。
3.1 Greedy search
Greedy search(贪婪搜索)只是选择概率最高的词作为其接下来的词,如下图所示。从第一个词"The"开始搜索,下一个概率最大的单词是"nice", 接下来概率最大的词是"woman", 整体序列概率为0.5*0.4=0.2。因此产生出来的句子为"The nice woman",
Greedy search的主要缺点是,它错过了隐藏在低概率词后面的高概率词,在上面的图中,"dog" 的概率是0.4, 但"dog"后面的"has"概率是0.9,"The dog has”的整体概率为0.4*0.9=0.36>0.2,但Greedy search不能发现这一点,其结果是产生的句子中包含已经出现过的词和句子结构。
使用以"landslide produced by earthquakes"作为开始,使用Greedy search方法产生出如下句子:
landslide produced by earthquakes in the region. The study, published in the journal Geophysical Research Letters, found that the earthquake swarm was caused by a series of small earthquakes that occurred in the same area in the last few days. The researchers said that the swarm was triggered by a small earthquake that occurred on the same day as the swarm. "The swarm was produced by a sequence of small, shallow earthquakes that were triggered by the same earthquake swarm," said lead author Dr. David R. Smith, a geophysicist at the University of California, Santa Cruz. In the study, the scientists used a computer model to simulate the earthquake sequence. It was found that a series small earthquakes occurred in a short period of time, which caused the swarm to form. A series of smaller earthquakes occurred over a period of several days, which triggered the swarm, the study said.[该地区的地震产生的滑坡。发表在《地球物理研究通讯》杂志上的这项研究发现,地震群是由过去几天在同一地区发生的一系列小地震引起的。研究人员说,地震群是由与地震群同一天发生的一次小地震引发的。"这个地震群是由一连串的小型浅层地震产生的,这些地震是由同一个地震群引发的,"主要作者、加州大学圣克鲁兹分校的地球物理学家David R. Smith博士说。在这项研究中,科学家们使用一个计算机模型来模拟地震序列。研究发现,在很短的时间内发生了一系列的小地震,这导致了地震群的形成。研究报告说,在几天的时间里发生了一系列较小的地震,从而引发了地震群。]
可以发现,尽管这些句子的连贯性很好,但是有些冗余。
3.2 Beam search
Beam serach试图克服上面Greedy search的缺点,通过在每个时步中保留最有可能的"num_beams",最后选择总体概率最高的那个组合,我们仍以上面的例子,用 'num_beams=2,3,4三种设置来观察结果。
(1) num_beams=2
landslide produced by earthquakes. "We have a lot of work to do to understand how these earthquakes happen and how they can be prevented," he said. "But we have a pretty good idea of what's going on." New study finds that earthquakes are more likely to occur in areas with high seismic activity. More information: "The Role of Seismic Activity in the Occurrence of Large-Scale Seismicity in the United States," Science. [由地震产生的山体滑坡。"我们有很多工作要做,以了解这些地震是如何发生的,以及如何能够预防它们,"他说。"但是我们对发生的事情有一个相当好的想法。" 新的研究发现,在地震活动频繁的地区更容易发生地震。更多信息参看"地震活动在美国大规模地震发生中的作用",<科学>杂志。]
(2) num_beams=3
landslide produced by earthquakes in the Pacific Northwest. "It's a very exciting time to be a geophysicist," he said. "We're getting to the point where we can do a lot of things that we've never been able to do before." New study shows how earthquakes can trigger tsunamis. More information: The paper is available online. [西北太平洋地区地震产生的滑坡。"作为一名地球物理学家,这是一个非常令人兴奋的时刻,"他说。"我们正处于这样的阶段,我们可以做很多以前从未做过的事情。" 新的研究显示地震如何引发海啸。更多信息: 该论文可在网上查阅。]
(4) num_beams=4
landslide produced by earthquakes in the Pacific Northwest. "This is the first time we've seen this kind of landslide in the United States," he said. "It's a very rare event." The landslide occurred in the area of Mount St. Helens, a volcano in Washington state that erupted in 1980 and has been a source of concern for scientists for decades. The volcano is located in a seismically active area of the state, and scientists have been concerned about the possibility of an eruption in the future. The U.S. Geological Survey (USGS) has been monitoring the volcano for more than a decade, and the USGS has warned that an eruption could occur at any time. The USGS is also monitoring Mount Rainier, another volcano in the Cascade Range, which erupted in 1883 and is also located in an earthquake-prone area of Washington state. [西北太平洋地区的地震产生的山体滑坡。"这是我们第一次在美国看到这种山体滑坡,"他说。"这是一个非常罕见的事件。" 山体滑坡发生在圣海伦斯火山地区,这座位于华盛顿州的火山在1980年爆发,几十年来一直是科学家们关注的焦点。这座火山位于该州的一个地震活跃区,科学家们一直担心未来可能会有爆发。美国地质调查局(USGS)已经对该火山进行了十多年的监测,美国地质调查局警告说,火山爆发可能随时发生。美国地质调查局也在监测雷尼尔火山,这是卡斯卡特山脉的另一座火山,于1883年爆发,也位于华盛顿州的一个地震多发区。]
可以发现,随着num_beams的增加,产生出更有意义的句子。
3.3 Sampling
Sampling的意思是按照条件概率分布随机选取下一个单词。总的来说,单纯使用sampling方法 论导致结果的不确定性,因此在实际应用中不使用这种方法。这种方法产生的结果如下:
landslide produced by earthquakes. This was a serious development, and scientists were busy working towards a solution in the face of the uncertainties of the problem. By the early 1980s, the problem had been significantly reduced because physicists and engineers have been using a technique called seismoturbation to investigate fault sequences.[地震产生的山体滑坡。这是一个严重的事态发展,面对这个问题的不确定性,科学家们忙于寻找解决方案。到20世纪80年代初,由于物理学家和工程师一直在使用一种叫做地震扰动的技术来调查断层序列,这个问题已经大大减少。]
4 结束语
这个笔记比较了GPT2-Large模型下的解码方法,在上面试验的三种方法中,beam search产生的结果相对合理,但还不能太满意的效果。接下来要试验的是Top-K sampling和Top-p sampling这两种方法。