1. 引言
如《通过Euclidean距离计算向量值来对句子相似度排序》所言,欧几里得距离计算文本相似有缺陷,特别是处理非结构化的文本时经常返回空值,因此决定抛弃这种算法,改用余弦相似计算。本文首先讨论使用余弦相似的算法,然后讨论如何遍历整个目录来得到目录下文档所有的相似内容。
2. 余弦相似计算
我们仍然使用sklearn库,首先导入模块:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import codecs
import os
然后进行余弦相似计算,这里使用了一个判断语句if (round(each[1],2)) >= 0.3,用来获得相似度大于等于0.3的那些句子。
#方法1--余弦相似判断
cosine_similarity_cofficient = cosine_similarity(features)
cos_sort_sims = sorted(enumerate(cosine_similarity_cofficient[0]), key=lambda item: -item[1])
number_of_sentence = 0
for each in cos_sort_sims:
number_of_sentence = number_of_sentence 1
if (round(each[1],2)) >= 0.3:
print('(' str(number_of_sentence) ') ' str(round(each[1],2)), sentences[each[0]])
print('-'*70 '\n')
3. 遍历整个目录
我们把这个用法扩展到整个目录下,用来查找与query句子最相似的句子。使用下面的代码:
databases = []
directory = ".\\temp"
for fpathe,dirs,fs in os.walk(directory):
for f in fs:
flname = os.path.join(fpathe,f)
databases.append(flname)
4. 试验结果
我们使用“PFC2D rock slopes stability simulation”作为目标句子进行相似度计算,把阈值设定为0.5,这将保证显示相对精确的结果。产生的结果保存在文件中。一些主要的
PFC2D rock slopes stability simulation
PFC2D Simulation on Stability of Loose Deposits Slope
PFC2D rock slopes toppling failure stability analysis
Strength Reduction Method for Rock Slope Stability Analysis Based on PFC2D
Step-path failure mode and stability calculation of jointed rock slopes by PFC2D
Simulation of step-path brittle failure in rock slopes
Shear resistance of jointed rock masses and stability calculation of rock slopes
Keywords:
Rock-Slope-Stability-Modeling, Stability of Earth Slopes, The stability of rock slopes is considered crucial to public safety in highways passing through rock loose deposit slope, seismic wave, microscopic, rock slope step-path failure rock bridge slope stability, slope instability, numerical simulation, parallel bond model, stability analysis, stability rock slope jointed rock weathered rock, strength reduction, limit analysis stability number to the stability of slopes. rock falls RocFall software rock slopes rock fall roll-out distance catchment ditch.
Reference
[1] Kumsar, H., et al. (2000). "Dynamic and static stability assessment of rock slopes against wedge failures." Rock Mechanics and Rock Engineering 33(1): 31-51.
[2] Zheng, Y., et al. (2019). "Stability analysis of anti-dip bedding rock slopes locally reinforced by rock bolts." Engineering Geology 251: 228-240.
[3] Taheri, A. and K. Tani (2010). "Assessment of the Stability of Rock Slopes by the Slope Stability Rating Classification System." Rock Mechanics and Rock Engineering 43(3): 321-333.
[4] Pal, S., et al. (2012). "Earthquake Stability Analysis of Rock Slopes: a Case Study." Rock Mechanics and Rock Engineering 45(2): 205-215.
[5] J.E. Jennings. (1972) An approach to the stability of rock slopes.
5. 出现的问题
单个数据文件的尺寸不能太大,初步估计超过2M就会出现错误:“Unable to allocate 9.04 GiB for an array with shape (53973, 22490) and data type float64”,这是因为内存不够,试着把虚拟内存调到了最大值,也不起作用。目前还没有找到好的解决方法,只能把大的文件分割成小文件。
6. 结束语
本文使用余弦相似度产生一个特定主题的文本,在此基础上可以进行二次创作,同时这个新的文档也保存在数据集中能够进行再学习。