代码之家 › 专栏 › 技术社区 › Sanjay Kamath

在python中,如何在具有相似性分数的大字符串中找到相似的子字符串?

similarity distance nlp string python

Sanjay Kamath · 技术社区 · 7 年前

我要找的不仅仅是两篇文章之间的简单相似度得分。而是字符串中子字符串的相似性分数。说:

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

在上面的示例中,所有的单词 text1 存在于 text2 完全相同,因此相似性应为100%。

如果一些单词 文本1 如果缺少,则分数应较低。

我正在处理一个段落大小不同的大型数据集,因此在一个具有如此相似性分数的较大段落中找到一个较小的段落是至关重要的。

我只找到了比较两个字符串的字符串相似性,如余弦相似性、difflib相似性等。但不是关于另一个字符串中的子字符串分数。

3 回复 | 直到 7 年前

DarkCygnus 7 年前

根据您的描述,您认为:

>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
        if word in b: #if it is in your bigger string increase score
            score += 1
>>> score/len(a) #obtain percentage given total word number
1.0

如果缺少一个单词,例如:

>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
        if w in b:
            score +=1
>>> score/len(c)
0.875

此外,您还可以按照@roadrunner的建议进行拆分 b 并将其另存为一组,以加快性能 b = set(b.split(" ")) . 这将使该零件的复杂性降低到 O(1) 并将整体算法改进为 O(n) 复杂性

编辑: 您说您已经尝试了一些度量,如余弦相似性等。但是我怀疑您可能会从检查 Levenshtein Distance 相似性,我怀疑在这种情况下,除了提供的解决方案之外,还可以使用相似性。

RoadRunner 7 年前

您也可以使用 collections.defaultdict 将字数存储在 word_a 存在于 word_b 然后 sum() 计数除以 word\u a 最后:

from collections import defaultdict

a = "the cat is not sleeping on the mat"
b = "the cat is sleeping on the red mat in the living room"

word_a = a.split()
word_b = set(b.split())

d = defaultdict(int)
for word in word_a:
    if word in word_b:
        d[word] += 1

print(sum(d.values()) / len(word_a))

其输出:

0.875

注: 因为我们只关心 word\u a 存在于 word\u b ,然后转换 word\u b 到a set() 将允许 O(1) 查找,而不是保留一个列表 O(n) . 这使得上述代码的总体时间复杂度 O(n) .

Carlos Fernández 7 年前

与DarkCygbus相似,但相似性基于其计数总字符而不是单词。另一方面,此脚本只检查了与完整单词的一致性(text\u 2.split())

from __future__ import division

text_1 = 'cat is sleeping on the mat'
text_2 = 'The cat is sleeping on the red mat in the living room'
no_match = 0
match = 0

for word in text_1.split():
    if word not in text_2.split():
        no_match += len(word)
    else:
        match += len(word)

similarity = match/(match + no_match)
print ('{0:.0%}'.format(similarity))

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

3 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

3 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

3 月前

user29715306 · from_users=和chats=电视节目中的差异

3 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

4 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

4 月前

prayner · 更新嵌套字典包含列表中的项

4 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

4 月前

Dave · 如何在for循环中修改列表值

4 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

4 月前