代码之家  ›  专栏  ›  技术社区  ›  Joylove

确定有序子列表是否位于大型列表中的最快方法?

  •  1
  • Joylove  · 技术社区  · 8 年前

    假设我有一个my\u Large\u list\u of\u列表,其中包含2000000个列表,每个列表的长度约为50。

    我想缩短2000000 my_huge_list_of_lists 通过丢弃序列中不包含两个元素的子列表。

    到目前为止,我已经:

    # https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
    def check_if_list_is_sublist(lst, sublst):
        #checks if a list appears in order in another larger list.
        n = len(sublst)
        return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))
    
    my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                                if not check_if_list_is_sublist(x, [a,b])]
    my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                                if not check_if_list_is_sublist(x, [b,a])]
    

    搜索词[a,b]或[b,a]的连续性很重要,因此我不能使用 set.issubset()

    我觉得这很慢。我想加快速度。我考虑了一些选项,如使用“提前退出”和声明:

    my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                                if (a in x and not check_if_list_is_sublist(x, [a,b]))]
    

    而且在 for 循环使用 or 声明:

    my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                                if not (check_if_list_is_sublist(x, [a,b])
                                        or check_if_list_is_sublist(x, [b,a]))]
    

    并致力于加速功能(WIP)

    # https://stackoverflow.com/questions/48232080/the-fastest-way-to-check-if-the-sub-list-exists-on-the-large-list
    def check_if_list_is_sublist(lst, sublst):
            checks if a list appears in order in another larger list.
            set_of_sublists = {tuple(sublst) for sublist in lst}
    

    并对堆栈溢出进行了搜索;但是想不出办法因为 check_if_list_is_sublist() 被称为is len(my_huge_list) * 2

    编辑:根据请求添加一些用户数据

    from random import randint
    from string import ascii_lowercase
    my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
    my_neighbor_search_fwd = [i,c]
    my_neighbor_search_rev = my_neighbor_search_fwd.reverse()
    
    4 回复  |  直到 8 年前
        1
  •  2
  •   Oluwafemi Sule    8 年前

    将n大小的子序列中的项解压缩为n个变量。然后写一个列表理解来过滤列表,检查子列表中的a、b或b、a。 e、 g。

    a, b = sublst
    
    def checklst(lst, a, b):
        l = len(lst)
        start = 0
        while True:
            try:
                a_index = lst.index(a, start)
            except ValueError:
                return False
            try:
                return a_index > -1 and lst[a_index+1] == b
            except IndexError:
                try:
                    return a_index > -1 and lst[a_index-1] == b
                except IndexError:
                    start = a_index + 1
                    if start == l:
                        return False
                    continue # keep looking at the next a
    
    %timeit found = [l for l in lst if checklst(l, a, b)]
    1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
    22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
        2
  •  1
  •   Paul Becotte    8 年前

    因此,我想不出任何聪明的算法检查来真正减少这里的工作量。然而,您在代码中进行了大量的分配,并且迭代过多。所以,只是把一些声明从函数中移出一点就让我

    sublst = [a, b]
    l = len(sublst)
    indices = range(len(sublst))
    def check_if_list_is_sublist(lst):
        for i in range(len(lst) - (l -1)):
            if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
                return True
            if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
                return True
        return False
    
    my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                               if not check_if_list_is_sublist(x)]
    

    这将上述示例代码的运行时间减少了约50%。对于这样一个大小的列表,生成更多的进程并划分工作可能也会提高性能。但是,我想不出任何方法来真正减少比较的数量。。。

        3
  •  1
  •   martineau    8 年前

    虽然这本身并不是你所说的“答案”,但它是一个基准框架,可以帮助你确定最快的方式来完成你想要的,因为它允许相对容易的修改以及添加不同的方法。

    我已经把当前发布的答案以及与他们一起运行的结果放在了里面。

    注意事项:请注意 尚未验证 测试中的所有答案都是“正确的”,因为他们实际上做了你想要的事情,也不知道他们在这个过程中会消耗多少内存,这可能是另一个考虑因素。

    目前看来,@Oluwafemi Sule的答案是距离最接近的竞争对手一个数量级(10倍)的最快答案。

    from __future__ import print_function
    from collections import namedtuple
    import sys
    from textwrap import dedent
    import timeit
    import traceback
    
    N = 10  # Number of executions of each "algorithm".
    R = 3  # Number of repetitions of those N executions.
    
    from random import randint, randrange, seed
    from string import ascii_lowercase
    
    a, b = 'a', 'b'
    NUM_SUBLISTS = 1000
    SUBLIST_LEN = 50
    PERCENTAGE = 50  # Percentage of sublist that should get removed.
    
    seed(42)  # Initialize random number so the results are reproducible.
    my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
                                    for __ in range(NUM_SUBLISTS)]
    
    # Put the target sequence in percentage of the sublists so they'll be removed.
    for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
        list_index = randrange(NUM_SUBLISTS)
        sublist_index = randrange(SUBLIST_LEN)
        my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]
    
    # Common setup for all testcases (executed before any algorithm specific setup).
    COMMON_SETUP = dedent("""
        from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
    """)
    
    class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
        """ A test case is composed of separate setup and test code fragments. """
        def __new__(cls, setup, test):
            """ Dedent code fragment in each string argument. """
            return tuple.__new__(cls, (dedent(setup), dedent(test)))
    
    testcases = {
        "OP (Nas Banov)": TestCase("""
            # https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
            def check_if_list_is_sublist(lst, sublst):
                ''' Checks if a list appears in order in another larger list. '''
                n = len(sublst)
                return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
            """, """
            shortened = [x for x in my_huge_list_of_lists
                            if not check_if_list_is_sublist(x, [a, b])]
            """
        ),
        "Sphinx Solution 1 (hash)": TestCase("""
            # https://stackoverflow.com/a/49518843/355230
    
            # Solution 1: By using built-in hash function.
            def prepare1(huge_list, interval=1): # Use built-in hash function.
                hash_db = {}
                for index in range(len(huge_list) - interval + 1):
                    hash_sub = hash(str(huge_list[index:index+interval]))
                    if hash_sub in hash_db:
                        hash_db[hash_sub].append(huge_list[index:index+interval])
                    else:
                        hash_db[hash_sub] = [huge_list[index:index+interval]]
                return hash_db
    
            def check_sublist1(hash_db, sublst): # Use built-in hash function.
                hash_sub = hash(str(sublst))
                if hash_sub in hash_db:
                    return any([sublst == item for item in hash_db[hash_sub]])
                return False
            """, """
            hash_db = prepare1(my_huge_list_of_lists, interval=2)
            shortened = [x for x in my_huge_list_of_lists
                            if check_sublist1(hash_db, x)]
            """
        ),
        "Sphinx Solution 2 (str)": TestCase("""
            # https://stackoverflow.com/a/49518843/355230
    
            #Solution 2: By using str() as hash function
            def prepare2(huge_list, interval=1): # Use str() as hash function.
                return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
                            for index in range(len(huge_list) - interval + 1)}
    
    
            def check_sublist2(hash_db, sublst): #use str() as hash function
                hash_sub = str(sublst)
                if hash_sub in hash_db:
                    return sublst == hash_db[hash_sub]
                return False
            """, """
            hash_db = prepare2(my_huge_list_of_lists, interval=2)
            shortened = [x for x in my_huge_list_of_lists
                            if check_sublist2(hash_db, x)]
            """
        ),
        "Paul Becotte": TestCase("""
            # https://stackoverflow.com/a/49504792/355230
            sublst = [a, b]
            l = len(sublst)
            indices = range(len(sublst))
    
            def check_if_list_is_sublist(lst):
                for i in range(len(lst) - (l -1)):
                    if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
                        return True
                    if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
                        return True
                return False
            """, """
            shortened = [x for x in my_huge_list_of_lists
                            if not check_if_list_is_sublist(x)]
            """
        ),
        "Oluwafemi Sule": TestCase("""
            # https://stackoverflow.com/a/49504440/355230
            def checklst(lst, a, b):
                try:
                    a_index = lst.index(a)
                except ValueError:
                    return False
                try:
                    return a_index > -1 and lst[a_index+1] == b
                except IndexError:
                    try:
                        return a_index > -1 and lst[a_index-1] == b
                    except IndexError:
                        return False
            """, """
            shortened = [x for x in my_huge_list_of_lists
                            if not checklst(x, a, b)]
            """
        ),
    }
    
    # Collect timing results of executing each testcase multiple times.
    try:
        results = [
            (label,
             min(timeit.repeat(testcases[label].test,
                               setup=COMMON_SETUP + testcases[label].setup,
                               repeat=R, number=N)),
            ) for label in testcases
        ]
    except Exception:
        traceback.print_exc(file=sys.stdout)  # direct output to stdout
        sys.exit(1)
    
    # Display results.
    print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
            .format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
    major, minor, micro = sys.version_info[:3]
    print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
          '({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
    print()
    
    longest = max(len(result[0]) for result in results)  # length of longest label
    ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
    fastest = ranked[0][1]
    for result in ranked:
        print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
              ''.format(
                    result[0], result[1], round(result[1]/fastest, 2),
                    round((result[1]/fastest - 1) * 100, 2),
                    width=longest))
    print()
    

    输出:

    Results for 1,000 sublists of length 50 with 50% percent of them matching
    Fastest to slowest execution speeds using Python 3.6.4
    (10 executions, best of 3 repetitions)
    
              Oluwafemi Sule :  0.006441 secs, rel speed  1.00x,   0.00% slower
                Paul Becotte :  0.069462 secs, rel speed 10.78x, 978.49% slower
              OP (Nas Banov) :  0.082758 secs, rel speed 12.85x, 1184.92% slower
     Sphinx Solution 2 (str) :  0.152119 secs, rel speed 23.62x, 2261.84% slower
    Sphinx Solution 1 (hash) :  0.154562 secs, rel speed 24.00x, 2299.77% slower
    
        4
  •  1
  •   Sphinx    8 年前

    对于在一个大列表中搜索匹配,我相信哈希(元素)然后构建索引将是一个很好的解决方案。

    您将获得的好处: 只需构建一次索引,即可节省时间供将来使用(无需为每次搜索反复循环)。 甚至,我们可以在启动程序时建立索引,然后在程序退出时释放它,

    下面的代码使用两种方法来获取哈希值 :hash()和str();有时,您应该根据特定场景自定义一个哈希函数。

    如果使用str(),代码看起来很简单,不需要考虑哈希冲突。但这可能会导致内存爆炸。

    对于hash(),我使用该列表保存所有具有相同哈希值的sub\u lst。您可以使用hash(sub\u lst)%designed\u length来控制哈希大小(但这会增加哈希冲突率)。

    以下代码的输出:

    By Hash: 0.00023986603994852955
    By str(): 0.00022884208565612796
    By OP's: 0.3001317172469765
    [Finished in 1.781s]
    

    测试代码 :

    from random import randint
    from string import ascii_lowercase
    import timeit
    
    #Generate Test Data
    my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
    #print(my_huge_list_of_lists)
    test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
    #Solution 1: By using built-in hash function
    def prepare1(huge_list, interval=1): #use built-in hash function
        hash_db = {}
        for index in range(len(huge_list) - interval + 1):
            hash_sub = hash(str(huge_list[index:index+interval]))
            if hash_sub in hash_db:
                hash_db[hash_sub].append(huge_list[index:index+interval])
            else:
                hash_db[hash_sub] = [huge_list[index:index+interval]]
        return hash_db
    
    hash_db = prepare1(my_huge_list_of_lists, interval=2)
    def check_sublist1(hash_db, sublst): #use built-in hash function
        hash_sub = hash(str(sublst))
        if hash_sub in hash_db:
            return any([sublst == item for item in hash_db[hash_sub]])
        return False
    
    print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
    
    #Solution 2: By using str() as hash function
    def prepare2(huge_list, interval=1): #use str() as hash function
        return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
    
    hash_db = prepare2(my_huge_list_of_lists, interval=2)
    def check_sublist2(hash_db, sublst): #use str() as hash function
        hash_sub = str(sublst)
        if hash_sub in hash_db:
            return sublst == hash_db[hash_sub]
        return False
    
    print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
    
    #Solution 3: OP's current solution
    def check_if_list_is_sublist(lst, sublst):
        #checks if a list appears in order in another larger list.
        n = len(sublst)
        return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
    
    print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
    

    如果您想从一个列表中删除匹配的元素,这是可行的,但其效果是您可能必须为新列表重建索引。除非列表是链表,否则请保存索引中每个元素的指针。我只是谷歌 Python how to get the pointer for one element of a list ,但找不到任何有用的东西。如果有人知道怎么做,请毫不犹豫地分享您的解决方案。谢谢

    以下是一个示例: ( 它生成一个新列表而不是返回原始列表,有时我们仍然需要从原始列表中筛选一些内容 )

    from random import randint
    from string import ascii_lowercase
    import timeit
    
    #Generate Test Data
    my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
    #print(my_huge_list_of_lists)
    test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
    #Solution 1: By using built-in hash function
    def prepare(huge_list, interval=1): #use built-in hash function
        hash_db = {}
        for index in range(len(huge_list) - interval + 1):
            hash_sub = hash(str(huge_list[index:index+interval]))
            if hash_sub in hash_db:
                hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
            else:
                hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
        return hash_db
    
    hash_db = prepare(my_huge_list_of_lists, interval=2)
    
    def check_sublist(hash_db, sublst): #use built-in hash function
        hash_sub = hash(str(sublst))
        if hash_sub in hash_db:
            return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
        return []
    
    def remove_if_match_sublist(target_list, hash_db, sublsts):
        matches = []
        for sublst in sublsts:
            matches += check_sublist(hash_db, sublst)
        #make sure delete elements from end to begin
        sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
        new_list = list(target_list)
        for item in sorted_match:
            del new_list[item['beg']:item['end']]
        return new_list
    
    print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))