我在熊猫身上有一个数据框,里面有收集到的数据;
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})
+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A | Blue | 1 |
| A | Blue | 2 |
| A | Blue | 4 |
| A | Red | 1 |
| A | Red | 2 |
| A | Red | 3 |
| A | Red | 4 |
| B | Blue | 1 |
| B | Blue | 2 |
| B | Blue | 3 |
| B | Blue | 6 |
| B | Red | 1 |
| B | Red | 2 |
| B | Red | 3 |
+-------+----------+-----+
观察结果(“Obs”)应无间隔编号,但您可以看到,我们在A组中“遗漏”了蓝色3,在B组中“遗漏”了蓝色4和5。预期结果是每组所有“遗漏”观察结果(“Obs”)的百分比,因此在示例中:
+-------+--------------------+--------+--------+
| Group | Total Observations | Missed | % |
+-------+--------------------+--------+--------+
| A | 8 | 1 | 12.5% |
| B | 9 | 2 | 22.22% |
+-------+--------------------+--------+--------+
我尝试了for循环和使用组(例如:
df.groupby(['Group','Subgroup']).sum()
print(groups.head)
)但我似乎无法以任何方式让它发挥作用。我是不是走错了方向?
从…起
another answer
(对@Lie Ryan大喊)我找到了一个查找缺失元素的函数,但我还不太明白如何实现它;
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def missing_elements(L):
missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
return list(missing)
谁能给我一个指针,它是正确的方向吗?