代码之家 › 专栏 › 技术社区 › Matt W.

解析电子邮件以识别关键字

text python

Matt W. · 技术社区 · 7 年前

sentences = [['this is a paragraph there should be lots more words here'],
 ['more information in this one'],
 ['just more words to be honest, not sure what to write']]

我想使用正则表达式检查关键字列表中的单词是否在列表中的任何句子中。我不想 informations 被抓获,只有 information

keywords = ['information', 'boxes', 'porcupine']

['words' in words for [word for word in [sentence for sentence in sentences]]

或

for sentence in sentences:
    sentence.split(' ')

keywords = ['information', 'boxes']

sentences = [['this is a paragraph there should be lots more words here'],
     ['more information in this one'],
     ['just more words to be honest, not sure what to write']]

output: [False, True, False]

或者最终:

parsed_list = [['more information in this one']]

4 回复 | 直到 7 年前

Zach Estela 7 年前

这是一个解决你问题的简单方法。我发现使用lambda语法比嵌套列表理解更容易阅读。

keywords = ['information', 'boxes']

sentences = [['this is a paragraph there should be lots more words here'],
             ['more information in this one'],
             ['just more words to be honest, not sure what to write']]


results_lambda = list(
    filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))

print(results_lambda)

[['more information in this one']]

Paul Mikulskis 7 年前

这可以通过快速列表理解来完成!

lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']] filter = ['filter', 'one'] result = list(set([x for s in filter for x in lists if s in x[0]])) print(result)

[['let us filter!'], ['more than one word filter'], ['here is one sentence']] 希望这有帮助!

trans1st0r 7 年前

是否要查找包含关键字列表中所有单词的句子?

如果是这样,那么您可以使用一组关键字,并根据列表中是否存在所有单词来过滤每个句子:

keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
    return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set

filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]

# filtered is the final set of sentences which satisfy your condition

# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]

另外,要明白使用集合意味着关键字列表中的重复项将被消除。因此,如果您有一个包含一些重复的关键字的列表,那么使用dict而不是set来记录每个关键字的数量,并重用上述逻辑。

从您的示例来看,至少有一个关键字匹配就足够了。然后需要修改allKeywdsPresent()

def allKeywdsPresent(sentence):
   return any(word in keyword_set for word in sentence.split())

zwer 7 年前

如果你只想匹配整个单词,而不只是子字符串,你必须考虑所有的单词分隔符(空格、puctuation等),首先将句子拆分成单词,然后将它们与关键词匹配。最简单的,尽管不是傻瓜式的方法是只使用正则表达式 \W

一旦你有了文本中的单词列表和要匹配的关键字列表,查看是否存在匹配的最简单、可能也是最有效的方法就是在两者之间设置交集。因此:

# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
             ['more information in this one'],
             ['just more disinformation, to make sure we have no partial matches']]

keywords = {'information', 'boxes', 'porcupine'}  # note we're using a set here!

WORD = re.compile(r"\W+")  # a simple regex to split sentences into words

# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]

那么,它是如何工作的呢?很简单,我们迭代每一个句子(并用小写字母表示大小写不敏感),然后用前面提到的正则表达式将句子拆分成单词。这意味着,例如,第一句话将分为:

['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']

然后我们将其转换为一个集合,以便进行快速比较( set 是一个散列序列,基于散列的求交速度非常快),并且,作为一个额外的功能,这还可以消除重复的单词。

最后,我们做的是与我们的 keywords -如果返回任何内容,则这两个集合至少有一个相同的单词,这意味着 if ... 比较评估为 True 在这种情况下,当前句子会添加到结果中。

\W+ nltk