对于在一个大列表中搜索匹配,我相信哈希(元素)然后构建索引将是一个很好的解决方案。
您将获得的好处:
只需构建一次索引,即可节省时间供将来使用(无需为每次搜索反复循环)。
甚至,我们可以在启动程序时建立索引,然后在程序退出时释放它,
下面的代码使用两种方法来获取哈希值
:hash()和str();有时,您应该根据特定场景自定义一个哈希函数。
如果使用str(),代码看起来很简单,不需要考虑哈希冲突。但这可能会导致内存爆炸。
对于hash(),我使用该列表保存所有具有相同哈希值的sub\u lst。您可以使用hash(sub\u lst)%designed\u length来控制哈希大小(但这会增加哈希冲突率)。
以下代码的输出:
By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]
测试代码
:
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
如果您想从一个列表中删除匹配的元素,这是可行的,但其效果是您可能必须为新列表重建索引。除非列表是链表,否则请保存索引中每个元素的指针。我只是谷歌
Python how to get the pointer for one element of a list
,但找不到任何有用的东西。如果有人知道怎么做,请毫不犹豫地分享您的解决方案。谢谢
以下是一个示例:
(
它生成一个新列表而不是返回原始列表,有时我们仍然需要从原始列表中筛选一些内容
)
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
else:
hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
return hash_db
hash_db = prepare(my_huge_list_of_lists, interval=2)
def check_sublist(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
return []
def remove_if_match_sublist(target_list, hash_db, sublsts):
matches = []
for sublst in sublsts:
matches += check_sublist(hash_db, sublst)
#make sure delete elements from end to begin
sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
new_list = list(target_list)
for item in sorted_match:
del new_list[item['beg']:item['end']]
return new_list
print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))