代码之家 › 专栏 › 技术社区 › Giampaolo Levorato

大熊猫随机分层抽样

sampling random dataframe pandas python

Giampaolo Levorato · 技术社区 · 1 年前

我创建了一个pandas数据帧,如下所示:

import pandas as pd
import numpy as np

ds = {'col1' : [1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,4,4,4,4],
      'col2' : [12,3,4,5,4,3,2,3,4,6,7,8,3,3,65,4,3,2,32,1,2,3,4,5,32],
      }

df = pd.DataFrame(data=ds)

数据帧如下:

print(df)

    col1  col2
0      1    12
1      1     3
2      1     4
3      1     5
4      1     4
5      1     3
6      1     2
7      2     3
8      2     4
9      2     6
10     2     7
11     3     8
12     3     3
13     3     3
14     3    65
15     3     4
16     4     3
17     4     2
18     4    32
19     4     1
20     4     2
21     4     3
22     4     4
23     4     5
24     4    32

基于列的值 col1 ,我需要提取:

3个随机记录,其中 col1 == 1
2个随机记录 col1 = 2
2个随机记录 col1 = 3
3个随机记录 col1 = 4

有人能帮帮我吗?

1 回复 | 直到 1 年前

mozway 1 年前

我会对整个输入进行洗牌 sample(frac=1) ,然后计算a groupby.cumcount 选择第一个 N 每组样本(含 map 和 boolean indexing )在哪里 N 在字典中定义:

# {col1: number of samples}
n = {1: 3, 2: 2, 3: 2, 4: 3}

out = df[df[['col1']].sample(frac=1)
                     .groupby('col1').cumcount()
                     .lt(df['col1'].map(n))]

使用自定义代码,代码更短,但效率可能更低 groupby.apply 与不同 sample 对于每个组:

n = {1: 3, 2: 2, 3: 2, 4: 3}

out = (df.groupby('col1', group_keys=False)
         .apply(lambda g: g.sample(n=n[g.name]))
      )

输出示例:

    col1  col2
0      1    12
3      1     5
4      1     4
7      2     3
8      2     4
11     3     8
13     3     3
17     4     2
18     4    32
24     4    32

推荐文章

TheCodeNovice · R中符号格式的尾随零和其他问题[重复]

5 月前

Daniel Estévez · 扩展数据帧以包含不存在的值

6 月前

T Richard · 根据条件交换分组数据中的字符串或值

6 月前

Homer Jay Simpson · R中flextable的标题字体和垂直合并

6 月前

RKIDEV · Panda迭代行并将第n行值乘以下一(n+1)行值

7 月前

Ssong · 如何有条件地运用资本化?

7 月前

Marcio Lino · 在Pandas中转换多个值列

7 月前

Ray · 在Python pandas包中使用groupby函数时,输出结果存在差异的原因是什么?

7 月前

RobertF · 如果列没有表头,如何在R数据帧中引用变量名?

7 月前

Homer Jay Simpson · ggplot2`geom_label()中的警告消息`

7 月前