代码之家 › 专栏 › 技术社区 › Mr.D

两个句子之间奇怪的相似之处

spacy nlp python

Mr.D · 技术社区 · 7 年前

我已经下载了 en_core_web_lg

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

0.9066019751888448

这两句话不应该 90%相似 它们有非常不同的含义。

为什么会这样?为了使相似度结果更合理,我需要添加一些额外的词汇吗?

2 回复 | 直到 7 年前

dennlinger 7 年前

这个 Spacy documentation 因为向量相似性解释了它的基本思想:
每个单词都有一个向量表示,通过上下文嵌入学习( Word2Vec

现在,单词嵌入 完整的句子 就是所有不同单词的平均值。如果现在有很多词汇在语义上位于同一区域(例如“he”、“was”、“this”等填充词),并且附加词汇“cancels out”,那么您可能会得到类似的结果,如案例所示。

search_doc 和 main_doc 如果有其他信息,比如原始句子,你可以通过长度差惩罚来修改向量,或者尝试比较句子中较短的部分,并计算成对的相似性(同样,问题是要比较哪些部分)。

遗憾的是,目前还没有一个干净的方法可以简单地解决这个问题。

demongolem 6 年前

Spacy通过平均单词嵌入来构造句子嵌入。因为,在一个普通的句子里,有很多没有意义的词(叫做 stop words

search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))

print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))

或者只保留名词,因为它们有最多的信息:

doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']]))

Martino Mensio 6 年前

正如其他人所指出的,您可能希望使用通用句子编码器或Infersent。

对于通用句子编码器,您可以安装预构建的空间模型来管理TFHub的包装,这样您只需安装带有pip的包,向量和相似性就可以按预期工作。

您可以按照此存储库的说明操作(我是作者) https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub

pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub/releases/download/en_use_md-0.2.0/en_use_md-0.2.0.tar.gz#en_use_md-0.2.0

import spacy
# this loads the wrapper
nlp = spacy.load('en_use_md')

# your sentences
search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))
# this will print 0.310783598221594

Mohammed Sunasra 6 年前

正如@dennlinger所指出的,Spacy的句子嵌入只是所有单词向量嵌入的平均值。因此,如果你有一个否定词的句子,比如“good”和“bad”,它们的向量可能会相互抵消,导致不太好的上下文嵌入。如果你的用例是特定于获得句子嵌入的,那么你应该尝试下面的SOTA方法。

谷歌的通用句子编码器: https://tfhub.dev/google/universal-sentence-encoder/2
https://github.com/facebookresearch/InferSent

我已经尝试了这两种嵌入,并给你很好的结果,从大多数时候开始,使用单词嵌入作为构建句子嵌入的基础。

干杯!