1 引言
生成摘要(Summarization)是指将一份文件或一篇文章归纳为一个较短的文本。这是自然语言处理非常重要的一个研究方向。在初始阶段,我们使用的算法主要是基于文本简单的词频统计《文本摘要生成的确定过程和随机过程》, 后来使用的算法《PyTextRank---文本关键字(keywords)的自动取出》《联合6种Transformers预训练模型》开始朝着大规模的预训练模型发展。本文继续探索Transformers在生成摘要中的应用。
2 虚拟环境
Transformers V4.9.0
Torch V1.9.0
Spacy V3.1.1
pytextrank 3.2.0
Spyder V5.0.5
Python 3.8.10
3 管道(Pipeline)方法
Transformers提供了许多管道(Pipeline)用来简化操作。在《Transformers之问题对答(Question Answering)》中,使用了"question-answering"管道进行回答问题。现在我们使用另一个管道"summarization"来生成摘要。
from transformers import pipeline
summarizer = pipeline("summarization")
原始数据集采用的是CNN/每日邮报数据集,它是为产生摘要而准备的,由CNN的长篇新闻文章组成,然后在此基础上进行了微调,产生出巴特模型(Bart model). 文章节选自一篇论文:
ARTICLE="Chuquicamata, one of the largest open pit copper mines and the second deepest open-pit mine in the world, is located 1,650km north of Santiago, Chile. The mine, popularly known as Chuqui, has been operating since 1910. The century-old copper mine is owned and operated by Codelco and forms part of the company’s Codelco Norte division, which includes the Radomiro Tomic (RT) mine found on the same mineralised system. A new underground mine is being developed to access the ore body situated beneath the present open pit mine. The conceptual engineering for the underground Chuquicamata mine began in 2007 and was finalised in March 2009. The project obtained the environmental authorisation in September 2010. The new underground mine, scheduled begin operations in 2019, will comprise of four production levels, a 7.5km main access tunnel, five clean air injection ramps, and two air-extraction shafts." 【丘基卡马塔是最大的露天铜矿之一,也是世界上第二深的露天矿,位于智利圣地亚哥北部1650公里处。该矿俗称丘基,自1910年以来一直在运营。这个有百年历史的铜矿由Codelco公司拥有和经营,是该公司Codelco Norte部门的一部分,该部门包括在同一矿化系统上发现的Radomiro Tomic(RT)矿。目前正在开发一个新的地下矿,以进入位于目前露天矿下面的矿体。Chuquicamata地下矿的概念工程于2007年开始,并在2009年3月最终完成。该项目于2010年9月获得环境授权。新的地下矿场计划于2019年开始运营,将包括四个生产层,一条7.5公里的主通道,五个清洁空气注入坡道,以及两个空气提取井。】
Chuquicamata, one of the largest open pit copper mines in the world, is located 1,650km north of Santiago, Chile . The century-old mine is owned and operated by Codelco and forms part of the company’s Radomiro Tomic (RT) division.
4 模型和标记方法
尽管使用管道方法简单,但正如在《阅读理解回答问题(Question Answering)---一个更强的BERT预训练模型》中指出的一样,默认的管道模型精度太低,为了提高精度,下面使用模型和标记的方法来进行段落总结任务。采用Google的encoder-decoder模型T5(t5-base),它与上面使用的Bart模型类似,是一个多任务混合数据集,包括了Bart模型中的数据。
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
Chuquicamata is one of the largest open pit copper mines and the second deepest open-pit mine in the world. the century-old copper mine is owned and operated by Codelco and forms part of the company's Codelco Norte division. new underground mine is being developed to access the ore body situated beneath the present open pit mine.