1 引言
在上一篇文章中,呈现了使用WMD(Word Moving Distance)---单词移动距离确定句子相似度的方法。这种方法基于Doc2Vec的训练向量,效果比单纯使用余弦计算要好。 这篇文章使用一种新的方法Transformers确定句子之间的相似度。
2 Transformers的工作原理
Transformers首先需要建立model, 于WMD不同的是,这个model不是依靠自身的corpus来训练的,而是基于一些预训练的模型。
model = SentenceTransformer('my corpus-model')
Trained on NLI data
bert-base-nli-mean-tokens
bert-large-nli-mean-tokens
roberta-base-nli-mean-tokens
roberta-large-nli-mean-tokens
distilbert-base-nli-mean-tokens
Trained on STS data
bert-base-nli-stsb-mean-tokens
bert-large-nli-stsb-mean-tokens
roberta-base-nli-stsb-mean-tokens
roberta-large-nli-stsb-mean-tokens
distilbert-base-nli-stsb-mean-tokens
这些模型需要先下载才能使用。
读取自己的文件:
with open('corpus-pfc.txt','r', encoding='utf-8') as outfile:
_c = outfile.read()
转换文本文件成为列表文件
corpus=[i for i in _c.split('\n')if i != ''and len(i.split(' '))>=4]
对每一个句子获取矢量
corpus_embeddings = model.encode(corpus)
查询语句获取矢量
queries = ['PFC2D PFC3D slope stability simulation']
query_embeddings = model.encode(queries)
返回相似句子
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist( \
[query_embedding], corpus_embeddings, "cosine")[0]
3 Transformers计算结果
使用corpus-pfc.txt(E:\Geotech\mydata)作为corpus, 这个文档是上一篇文章产生的一个经过优化的PFC数据集。查询句子仍然如下:
query = 'PFC2D PFC3D slope stability simulation'
Top10 相似结果如下:
PFC2D PFC3D slope stability simulation (Similarity: 1.00)
PFC2D PFC3D slope stability (Similarity: 0.89)
slope instability, pfc2d, numerical simulation, parallel bond model, stability analysis (Similarity: 0.84)
PFC2D rock slopes stability simulation (Similarity: 0.81)
General two-dimensional slope stability analysis (Similarity: 0.81)
Then the particle discrete element software PFC2D is used to simulate the stability of slope excavation from the meso-mechanical level. (Similarity: 0.80)
"System reliability analysis of slope stability using generalized Subset Simulation". (Similarity: 0.80)
Application of distinct element analysis in slope stability problems (Similarity: 0.78)
Fluid coupling in PFC2D and PFC3D (Similarity: 0.78)
'Then the particle discrete element software PFC2D is used to simulate the stability of slope excavation from the meso-mechanical level.' (Similarity: 0.82) (Similarity: 0.77)
使用WMD进行相似查询,得出的Top 10相似结果如下:
PFC2D PFC3D slope stability simulation (Similarity: 1.00)
PFC2D Simulation on Stability of Loose Deposits Slope in Highway Cutting Excavation (Similarity: 0.99)
PFC2D PFC3D slope stability (Similarity: 0.98)
The PFC3D simulation platform was employed to calcaulate the single-hole blasting processes with different heights,buried depths and charge amounts in the open mine slope,and the slope stability after blasting was discussed. (Similarity: 0.98)
Simulation and analysis of the earthquake stability of the tailing reservoir based on PFC3D (Similarity: 0.98)
"Study on the similar materials simulation of the slope stability of the west-l zone in Luming Molybdenum Mine". (Similarity: 0.98)
NUMERICAL SIMULATION OF A FILLED SLOPE STABILITY ON SOFT SOIL ROADBED REINFORCED BY GRAVEL PILE USING PFC2D (Similarity: 0.98)
'DEM simulation pfc2d slope', (Similarity: 0.84) (Similarity: 0.98)
"The Numerical Simulation on the Stability of Steep Rock Slope by DDA". (Similarity: 0.97)
We show by simulation that the proposed robot model can walk down a slope passively and also verify the stability of this walking by calculating the eigenvalues of the Jacobian of the Poincare map. (Similarity: 0.97)
二者比较,可以发现,Transformers的结果更好一些。
4 Transformers聚类
Transformers能够实现聚类,通过输入sklearn模块:
from sklearn.cluster import KMeans
下面是聚类后其中的一个结果,通过词频统计我们发现这个聚类的主题是 "Failure"。 聚类能帮助我们集中关注某一类论题。
'3-D Granular Simulation on the Process of Slope Failure and Collapse',
'Failure process simulation of sliding unstable rock based on PFC2D',
'rock slope; step-path failure; rock bridge; slope stability; PFC2D',
'Similar to slope stability failure',
'The effect of discontinuity Persistence an Rock Slope Stability',
'slope stability 1; wedge failure',
'Jointed rock slope Step-path failure Rock bridges Slope stability PFC',
'rock slope step-path failure rock bridge slope stability after blasting was discussed.'
5 结束语
本文使用Transformers确定句子之间的相似度。结果发现,Transformers得出的结果优于WMD得出的结果,同时,Transformers的聚类能帮助我们集中关注某一类论题。今后将继续开发Tranformers的功能。
本文相似文档: