代码之家  ›  专栏  ›  技术社区  ›  Mechanician

获取出现在3个或更多列表中的元素

  •  0
  • Mechanician  · 技术社区  · 8 年前

    比如说,我总共有5个列表

    # Sample data
    
    a1 = [1,2,3,4,5,6,7]
    
    a2= [1,21,35,45,58]
    a3= [1,2,15,27,36]
    a4=[2,3,1,45,85,51,105,147,201]
    a5=[3,458,665]
    

    我需要找到a1的元素,这些元素在a2、a3、a4、a5中的存在次数超过3次,包括a1中的元素

    我需要所有列表(a1-a5)中频率大于或等于3的元素,以及它们的频率。

    从上面的示例中,预期输出为:

    2,频率为3

    3,频率为3

    对于我的实际问题,列表的数量和长度都是如此巨大,有人能给我一个简单快速的方法吗?

    普利提维

    3 回复  |  直到 8 年前
        1
  •  1
  •   jme    8 年前

    帕特里克在评论中写道, chain Counter 你的朋友在这里吗

    import itertools
    import collections
    
    targets = [1,2,3,4,5,6,7]
    
    lists = [
        [1,21,35,45,58],
        [1,2,15,27,36],
        [2,3,1,45,85,51,105,147,201],
        [3,458,665]
        ]
    
    chained = itertools.chain(*lists)
    counter = collections.Counter(chained)
    result = [(t, counter[t]) for t in targets if counter[t] >= 2]
    

    如此

    >>> results
    [(1, 3), (2, 2), (3, 2)]
    

    你说你有很多列表,每个列表都很长。尝试这个解决方案,看看需要多长时间。如果需要加快,那是另一个问题。可能是这样 collections.Counter 对于您的应用程序来说太慢。

        2
  •  1
  •   Whud    8 年前
    a1= [1,2,3,4,5,6,7]
    a2= [1,21,35,45,58]
    a3= [1,2,15,27,36]
    a4= [2,3,1,45,85,51,105,147,201]
    a5= [3,458,665]
    
    b = a1+a2+a3+a4+a5                              #make b all lists together
    
    for x in set(b):                                #iterate though b's set
        print(x, 'with a frequency of', b.count(x)) #print the count
    

    将为您提供:

    1 with a frequency of 4
    2 with a frequency of 3
    3 with a frequency of 3
    4 with a frequency of 1
    5 with a frequency of 1
    6 with a frequency of 1
    7 with a frequency of 1
    35 with a frequency of 1
    36 with a frequency of 1
    ...
    

    编辑:

    使用:

    for x in range(9000):
        a1.append(random.randint(1,10000))
        a2.append(random.randint(1,10000))
        a3.append(random.randint(1,10000))
        a4.append(random.randint(1,10000))
    

    time 我检查了程序花费了多长时间(不打印,而是保存信息),程序花费了4.9395秒。我希望这足够快。

        3
  •  1
  •   VersBersch    8 年前

    使用熊猫的解决方案相当快

    import pandas as pd
    
    a1=[1,2,3,4,5,6,7]
    a2=[1,21,35,45,58]
    a3=[1,2,15,27,36]
    a4=[2,3,1,45,85,51,105,147,201]
    a5=[3,458,665]
    
    # convert each list to a DataFrame with an indicator column
    A = [a1, a2, a3, a4, a5]
    D = [ pd.DataFrame({'A': a, 'ind{0}'.format(i):[1]*len(a)}) for i,a in enumerate(A)]
    
    # left join each dataframe onto a1
    # if you know the integers are distinct then you don't need drop_duplicates
    df = pd.merge(D[0], D[1].drop_duplicates(['A']), how='left', on='A')
    for d in D[2:]:
        df = pd.merge(df, d.drop_duplicates(['A']), how='left', on='A')
    
    # sum accross the indicators
    df['freq'] = df[['ind{0}'.format(i) for i,d in enumerate(D)]].sum(axis=1)
    
    # drop frequencies less than 3
    print df[['A','freq']].loc[df['freq'] >= 3]
    

    在我的机器上,使用以下较大输入的测试运行时间不到0.2秒

    import numpy.random as npr
    a1 = xrange(10000)
    a2 = npr.randint(10000, size=100000) 
    a3 = npr.randint(10000, size=100000) 
    a4 = npr.randint(10000, size=100000) 
    a5 = npr.randint(10000, size=100000)