1 引言
在过去的文章中,曾经讨论了两种句子相似度确定的方法。今天呈现一种使用WMD(Word Moving Distance)---单词移动距离确定句子相似度的方法。 其实这种方法是最早实现的一种方法,大约3,4年前在作GMAT大数据库时使用的一种方法。本文首先对先前提出的两种方法作简要回顾,然后使用WMD的方法确定句子之间的相似度。
2 过去工作的回顾
在笔记《一个快速的句子和段落相似查询方法》(geotech-words-combination.py)中,通过把一个句子分词,然后对分词进行组合C(n,3)对数据集(.txt)进行查询,这种方法能够把数据集中所有包含关键词组合的句子提取出来,缺点是没有对句子按照相似度排序。而《利用文本相似度聚类产生能够再学习的新文档》(geotech-similarity-query.py) 使用余弦计算对数据集(.txt)进行了检查,这种方法对于短句子效果较好,对于长句子效果差一些。对于大规模的数据集来说,这两种方法结合起来使用效果可能会好一些。首先使用第一种方法进行筛选,然后使用第二种方法进行相似度排序。目前这两个代码还没有合并,近期将完成这个工作。
3 WMD确定句子相似度算法
WMD能够评价两个文档之间的距离。这种方法来源于论文《From Word Embeddings To Document Distances》。WMD是基于Doc2Vec建立的向量库(geotech-wmd-doc2vec.py)实现的。这种方法的优点是能够部分实现语义查询,缺点是运行时间较长。
(1) 读入数据集。在这里我使用了一个判断语句,把那些字符长度小于50的句子排除出去。
(2) 训练Doc2Vec
(3) 相似度计算
4 WMD确定句子相似度实例
corpus.txt(E:\Geotech\mydata)是一个相对小型的岩土工程数据集,文件尺寸大约59.6M,我们现在使用WMD从这个数据集中查询相似句子。查询句子如下:
query = 'PFC2D PFC3D slope stability simulation'
程序运行了793秒,大约13分钟。相似度从0.44到0.31,相似查询的前30个结果如下所示。
PFC2D PFC3D slope stability simulation
--------------------------------------
PFC2D Simulation on Stability of Loose Deposits Slope.pdf
Stability analysis of jointed rock slope
Blasting Process Simulation and Stability Study of an Open Mine Slope Based on PFC3D
Slope stability analysis by strength reduction
Jointed Rock Slopes Stability Analysis Using PFC2D
Strength Reduction Method for Rock Slope Stability Analysis Based on PFC2D
rock slope;step-path failure;rock bridge;slope stability;PFC2D
PFC2D Simulation on Stability of Loose Deposits Slope in Highway Cutting Excavation
Stability analysis of slope under earthquake with FLAC3D
Particle Flow Simulation of Slope Instability and Failure
'rock slope stability analysis and slope design',
slope instability, pfc2d, numerical simulation, parallel bond model, stability analysis
The numerical simulation of the stability conditions concurs with our hypothesis.
'PFC2D Simulation on Stability of Loose Deposits Slope in Highway Cutting Excavation',
'Strength Reduction Method for Rock Slope Stability Analysis Based on PFC2D',
'rock slope step-path failure rock bridge slope stability PFC2D',
Cone Crusher Simulation Based on PFC3D
'slope instability, pfc2d, numerical simulation, parallel bond model, stability analysis',
Simulation and analysis of the earthquake stability of the tailing reservoir based on PFC3D
3-D Granular Simulation on the Process of Slope Failure and Collapse
Simulated study on the influence of the slope blasting height on the slope stability
Slope stability analysis for earthquakes
Physical and numerical models in rock and soil slope stability
Application of distinct element analysis in slope stability problems
Failure process simulation of sliding unstable rock based on PFC2D
'General two-dimensional slope stability analysis',
Numerical simulation for different motion forms of sliding soil along slope with PFC3D
Numerical Simulation of Fracture Mechanism by Blasting using PFC2D
5 一个集成的解答
把三种方法得出的结果汇合再一起,然后再使用WMD方法进行计算。(1)使用geotech-words-combination.py进行组合查询,运行了122秒,大约2分钟,得出的结果比WMD多一些。
(3) 使用geotech-similarity-query.py进行相似查询. (3) 把这两部分的结果与WMD的结果放在一起,然后再进行WMD查询。TOP 10结果如下所示。这个结果比较满意。
PFC2D PFC3D slope stability simulation
--------------------------------------
0, sim = 1.00
keywords:['pfc2d', 'pfc3d', 'slope', 'stability', 'simulation']
1, sim = 0.98
PFC2D Simulation on Stability of Loose Deposits Slope.
65, sim = 0.98
Simulation and analysis of the earthquake stability of the tailing reservoir based on PFC3D
77, sim = 0.98
The PFC3D simulation platform was employed to calcaulate the single-hole blasting processes with different heights,buried depths and charge amounts in the open mine slope,and the slope stability after blasting was discussed.
8, sim = 0.98
NUMERICAL SIMULATION OF A FILLED SLOPE STABILITY ON SOFT SOIL ROADBED REINFORCED BY GRAVEL PILE USING PFC2D
17, sim = 0.98
Strength Reduction Method for Rock Slope Stability Analysis Based on PFC2D
49, sim = 0.98
The stability analysis of this slope requires in depth geotechnical investigation which are aided by result orientated stability assessment using empirical methods and numerical simulation.
76, sim = 0.98
Impacts of material property variability and spatial variability on slope stability were analyzed using Monte Carlo simulation,
159, sim = 0.97
It is shown that the reformulation of the reference slope is both in keeping with the underlying derivation of the Saint-Venant equations and provides practical numerical stability without altering the realism of the simulation.
5, sim = 0.97
Blasting Process Simulation and Stability Study of an Open Mine Slope Based on PFC3D
6 结束语
本文使用WMD进行句子之间的相似度计算,特别是结合以前描述的两种方法能够得出相对满意的结果。尽管WMD是一种语义查询,但这种方法由于受到训练库的限制,还达不到真正的语义查询,接下来我们将要使用tranformer----目前最火的NLP算法。
本文相似文档: