代码之家  ›  专栏  ›  技术社区  ›  Jaap Baanders

统计子组中缺少的实例

  •  2
  • Jaap Baanders  · 技术社区  · 7 年前

    我在熊猫身上有一个数据框,里面有收集到的数据;

    import pandas as pd
    df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})
    
    +-------+----------+-----+
    | Group | Subgroup | Obs |
    +-------+----------+-----+
    | A     | Blue     |   1 |
    | A     | Blue     |   2 |
    | A     | Blue     |   4 |
    | A     | Red      |   1 |
    | A     | Red      |   2 |
    | A     | Red      |   3 |
    | A     | Red      |   4 |
    | B     | Blue     |   1 |
    | B     | Blue     |   2 |
    | B     | Blue     |   3 |
    | B     | Blue     |   6 |
    | B     | Red      |   1 |
    | B     | Red      |   2 |
    | B     | Red      |   3 |
    +-------+----------+-----+
    

    观察结果(“Obs”)应无间隔编号,但您可以看到,我们在A组中“遗漏”了蓝色3,在B组中“遗漏”了蓝色4和5。预期结果是每组所有“遗漏”观察结果(“Obs”)的百分比,因此在示例中:

    +-------+--------------------+--------+--------+
    | Group | Total Observations | Missed |   %    |
    +-------+--------------------+--------+--------+
    | A     |                  8 |      1 | 12.5%  |
    | B     |                  9 |      2 | 22.22% |
    +-------+--------------------+--------+--------+
    

    我尝试了for循环和使用组(例如:

    df.groupby(['Group','Subgroup']).sum()
    print(groups.head)
    

    )但我似乎无法以任何方式让它发挥作用。我是不是走错了方向?

    从…起 another answer (对@Lie Ryan大喊)我找到了一个查找缺失元素的函数,但我还不太明白如何实现它;

    def window(seq, n=2):
        "Returns a sliding window (of width n) over data from the iterable"
        "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
        it = iter(seq)
        result = tuple(islice(it, n))
        if len(result) == n:
            yield result
        for elem in it:
            result = result[1:] + (elem,)
            yield result
    
    def missing_elements(L):
        missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
        return list(missing)
    

    谁能给我一个指针,它是正确的方向吗?

    3 回复  |  直到 7 年前
        1
  •  4
  •   cs95 abhishek58g    7 年前

    很简单,你需要 groupby 在这里:

    1. 使用 子句 + diff ,计算每个 Group SubGroup
    2. df 在…上 ,并计算 size sum 在上一步中计算的列的
    3. 还有几个更简单的步骤(计算百分比)可以为您提供预期的输出。

    f = [   # declare an aggfunc list in advance, we'll need it later
          ('Total Observations', 'size'), 
          ('Missed', 'sum')
    ]
    

    g = df.groupby(['Group', 'Subgroup'])\
          .Obs.diff()\
          .sub(1)\
          .groupby(df.Group)\
          .agg(f)
    
    g['Total Observations'] += g['Missed']
    g['%'] = g['Missed'] / g['Total Observations'] * 100 
    

    g
    
           Total Observations  Missed          %
    Group                                       
    A                     8.0     1.0  12.500000
    B                     9.0     2.0  22.222222
    
        2
  •  2
  •   Allen Qin    7 年前

    使用groupby、apply和assign的类似方法:

    (
        df.groupby(['Group','Subgroup']).Obs
        .apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])
        .apply(pd.Series)
        .groupby(level=0).sum()
        .assign(pct=lambda x: x[1]/x[0]*100)
        .set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)
    )
    
    Out[75]: 
           Total Observations  Missed          %
    Group                                       
    A                       8       1  12.500000
    B                       9       2  22.222222
    
        3
  •  2
  •   piRSquared    7 年前
    from collections import Counter
    
    gs = ['Group', 'Subgroup']
    old_tups = set(zip(*df.values.T))
    
    missed = pd.Series(Counter(
        g for (g, s), d in df.groupby(gs)
        for o in range(d.Obs.min(), d.Obs.max() + 1)
        if (g, s, o) not in old_tups
    ), name='Missed')
    
    hit = df.set_index(gs).Obs.count(level=0)
    total = hit.add(missed).rename('Total')
    ratio = missed.div(total).rename('%')
    
    pd.concat([total, missed, ratio], axis=1).reset_index()
    
      Group  Total  Missed         %
    0     A      8       1  0.125000
    1     B      9       2  0.222222