代码之家 › 专栏 › 技术社区 › wuzz

比较单词列表和句子列表并打印匹配行的方法

python

wuzz · 技术社区 · 6 年前

我现在正在清理我们的数据库,它变得非常耗时。典型

for email in emails:

环在任何地方都不够快。

例如,我目前正在将23万封电子邮件的列表与3900万行完整记录列表进行比较。将这些电子邮件与它们所属的记录行匹配并打印需要数小时。有人知道如何在这个查询中实现线程以加快速度吗?尽管这太快了

strings = ("string1", "string2", "string3")
for line in file:
    if any(s in line for s in strings):
        print "yay!"

那就永远不会打印匹配的线条,只打印针。

提前谢谢你

2 回复 | 直到 6 年前

Filip MÅynarski 6 年前

下面是使用线程的示例解决方案。此代码将数据分成相等的块,并将它们用作 compare() 根据我们声明的线程数量。

strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',\
         '1234', 'string2', 'string1',\
         "string1", 'abcd', 'xyz']

def compare(x, thread_idx):
    print('Thread-{} started'.format(thread_idx))
    for line in x:
        if any(s in line for s in strings):
            print("We got one of strings in line: {}".format(line))
    print('Thread-{} finished'.format(thread_idx))

穿线部分:

from threading import Thread

threads = []
threads_amount = 3
chunk_size = len(lines) // threads_amount

for chunk in range(len(lines) // chunk_size):
    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
    threads[-1].start()

for i in range(threads_amount):
    threads[i].join()

输出:

Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished

slider 6 年前

一种可能是使用 set 存储电子邮件。这就是账单 if word in emails O(1) . 因此,所做的工作与文件中的单词总数成比例:

emails = {"string1", "string2", "string3"} # this is a set

for line in f:
    if any(word in emails for word in line.split()):
        print("yay!")

你原来的解决方案是 O(nm) (用于 n 单词和米电子邮件)与 o(n) 与 设置 .

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

7 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

7 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

7 月前

user29715306 · from_users=和chats=电视节目中的差异

7 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

7 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

7 月前

prayner · 更新嵌套字典包含列表中的项

7 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

8 月前

Dave · 如何在for循环中修改列表值

8 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

8 月前