Longformer---回答问题长文本的Transformer

2年前浏览633

1 引言

在深度学习模型的帮助下，从数据中实现人类水平的预测已经有了很大的改进。由于在用参考文件回答问题方面有广泛的应用，因此NLP中的问题回答子领域很流行，解决这个问题的方法是使用一个数据集，该数据集由输入文本、查询以及输入文本中包含查询答案的文本段或跨段组成。

Transformers架构是在论文《 Attention is All You Need 注意力是你所需要的》中提出的。编码器(encoder)对输入文本进行编码，解码器(decoder)对编码进行处理，以理解序列背后的上下文信息。堆栈中的每个编码器和解码器都使用一个注意力机制，将每个输入与其他每个输入一起处理，以权衡它们的相关性，并在解码器的帮助下生成输出序列。注意力机制能够动态地突出和理解输入文本的特征。

谷歌研究院的BERT(Bidirectional Encoder Representations from Transformers)基于Transformers架构，BERT模型在BookCorpus上进行了预训练，该数据集由11,038本未出版的书籍和英文维基百科组成(不包括列表、表格和标题)。最流行的问题回答数据集是SQuAD，SQuAD数据集包含大约15万个问题。在<阅读理解回答问题(Question Answering)---一个更强的BERT预训练模型>和<Transformers之问题对答(Question Answering)>中, 共采用了三个预训练Bert模型，这些模型是SQuAD数据集在Transformers架构下预训练形成的:

(1) mrm8488/bert-multi-cased-finetuned-xquadv1

(2) bert-large-uncased-whole-word-masking-finetuned-squad

(3) ktrapeznikov/albert-xlarge-v2-squad-v2

不过, 这些模型都使用了512个标记(token)序列, 当处理长文本超出这个限制时，就会发生错误，因而我们必须使用Longformer代替标准的Bert模型。

2 Longformer模型

Longformer在长文档任务中的表现优于其他大多数模型，可以明显减少内存和时间的复杂度。长文本的预训练模型使用

valhalla/longformer-base-4096-finetuned-squadv1，这个模型基于SQuAD v1进行了微调, 可以处理最多4096个标记的序列。经过多次测试发现，对应的文本尺寸应该不能大于19k.

3 测试

使用一段文档test3.txt进行试验，程序根据提出的问题做出了回答。

(1) 问题: What is the numerical modelling approaches for rock slope analysis?

回答: finite element and the finite difference methods

(2) 问题: What is the most popular codes for rock slope analysis?

回答: Phase2 and RS3

(3) 问题: Where is Chuquicamata mine?

回答: major joints sets, shear planes and fault planes

(4) 问题: What is fracture propagation paths?

回答: ability to consider intact rock deformation and movements, complex hydro-mechanical and dynamic analyses, and additional insight into identification of potential instability mechanisms

(5) 问题: How to simulate step-path brittle fracture?

回答: using a simplified discrete fracture network (DFN) model coupled with a Voronoi polygonal mesh approach

(6) 问题: What is the toppling failure?

回答: rotation of columns or blocks of rocks about some fixed base

来源：计算岩土力学

试验

著作权归作者所有，欢迎分享，未经许可，不得转载

首次发布时间：2022-11-20