我刚开始学习Python,有一个很宽的数据文件,我想通过不同的变量聚合它,并创建一个新的文件用于仪表板。我可以使用groupby()一次获取一列的聚合,然后将结果附加到csv文件中。然而,我想知道是否有更有效的方法来使用某种循环来实现这一点。
以下是我的原始数据文件的示例:
data = {'ID': [105, 106, 107, 108, 109, 110, 111, 112],
'Name': ['Bill', 'Jane', 'Mary', 'Rich', 'Tomas', 'Kiki', 'Martin', 'Larry'],
'Cohort': ['Cohort A', 'Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B', 'Cohort B', 'Cohort B'],
'Program Size': ['small', 'large', 'medium', 'medium', 'large', 'small', 'large', 'medium'],
'Rating': ['excellent', 'good', 'needs improvement', 'needs improvement', 'good', 'excellent', 'good', 'excellent']}
df = pd.DataFrame(data)
看起来像这样:
ID Name Cohort Program Size Rating
0 105 Bill Cohort A small excellent
1 106 Jane Cohort A large good
2 107 Mary Cohort A medium needs improvement
3 108 Rich Cohort A medium needs improvement
4 109 Tomas Cohort B large good
5 110 Kiki Cohort B small excellent
6 111 Martin Cohort B large good
7 112 Larry Cohort B medium excellent
我总是想按队列列分组,再加上其他列中的一列,并获得每种组合的计数。在我的新数据文件中,我将有一个队列列,一个“变量”列(这将是分组中的另一列),选项显示“描述”,然后是频率。以下是我想要的结果:
desired = {'Cohort': ['Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B', 'Cohort B', 'Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B'],
'Variable': ['size', 'size', 'size', 'size', 'size', 'size', 'rating', 'rating', 'rating', 'rating', 'rating'],
'Description': ['small', 'medium', 'large', 'small', 'medium', 'large', 'excellent', 'good', 'needs improvement', 'excellent', 'good'],
'Frequency': [1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2]}
desired_df = pd.DataFrame(desired)
Cohort Variable Description Frequency
0 Cohort A size small 1
1 Cohort A size medium 2
2 Cohort A size large 1
3 Cohort B size small 1
4 Cohort B size medium 1
5 Cohort B size large 2
6 Cohort A rating excellent 1
7 Cohort A rating good 1
8 Cohort A rating needs improvement 2
9 Cohort B rating excellent 2
10 Cohort B rating good 2
我已经能够编写一个循环,按队列和我的其他列分组并获得频率,但我不确定如何将它们放在一起(附加数据帧?)以获得我想要的结果。我感谢您对下一步的指导!(以下是我写的循环遍历我的列并获取频率的内容:
cols = ['Program Size', 'Rating']
for i in cols:
grouped_df = df.groupby(['Cohort', (i)], as_index=False).agg(
frequency=('ID', 'count')
)
print(f"Grouped by Cohort and {i}:\n{grouped_df}\n")