代码之家 › 专栏 › 技术社区 › Ray

在Python pandas包中使用groupby函数时,输出结果存在差异的原因是什么?

dataframe pandas python

Ray · 技术社区 · 1 年前

嗨,我最近一直在用Python pandas练习数据处理,遇到了一个与groupby函数相关的问题,这是我的文件和代码:

#my file
data = {
    'species': ['a', 'b', 'c', 'd', 'e', 'rt', 'gh', 'ed', 'e', 'd', 'd', 'q', 'ws', 'f', 'fg', 'a', 'a', 'a', 'a', 'a'],
    's1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    's2': [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9],
    's3': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
    's4': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
}

df = pd.DataFrame(data)

#my code:
grouped_df1 = df.groupby(df.columns[0], as_index=False).sum()

grouped_df2 = df.groupby(df.iloc[:, 0], as_index=False).sum()

我想了解grouped_df1和grouped_df2都是按第0列的数据分组的,但在输出时,grouped_dfl成功地将第0列中具有相同值的行合并为一行,这就是我想要的结果。但是,grouped_df2在合并过程中将第0列中的相同字符串组合成一个长字符串,而不是将它们合并成一行。以下是输出结果:

print(grouped_df1)
   species  s1  s2  s3  s4
0        a  91  54  97  60
1        b   2   9   3  10
2        c   3   9   4  10
3        d  25  27  28  30
4        e  14  18  16  20
5       ed   8   9   9  10
6        f  14   9  15  10
7       fg  15   9  16  10
8       gh   7   9   8  10
9        q  12   9  13  10
10      rt   6   9   7  10
11      ws  13   9  14  10

print(grouped_df2)
   species  s1  s2  s3  s4
0   aaaaaa  91  54  97  60
1        b   2   9   3  10
2        c   3   9   4  10
3      ddd  25  27  28  30
4       ee  14  18  16  20
5       ed   8   9   9  10
6        f  14   9  15  10
7       fg  15   9  16  10
8       gh   7   9   8  10
9        q  12   9  13  10
10      rt   6   9   7  10
11      ws  13   9  14  10

到目前为止,我仍然不知道原因。如果你能帮助回答这个问题,我将不胜感激。

2 回复 | 直到 1 年前

Bhargav 1 年前

In groupby -列名被视为内部分组键,而Series被视为外部键。

参考资料- https://pandas.pydata.org/docs/reference/groupby.html

使用时 df.iloc[:, 0]:

Pandas将species列中的字符串值视为独立于DataFrame结构的单独分组键。

使用df.columns[0]时:

Pandas直接使用DataFrame中的“species”列作为分组键。这使得Pandas能够正确地管理分组和求和。

代码更正

您应该始终明确引用列名

grouped_df1 = df.groupby('species', as_index=False).sum()

或者这也行得通

grouped_df1 = df.groupby(df[df.columns[0]], as_index=False).sum()

user19077881 1 年前

df.groupby(df.columns[0]... 正确地将第一列分组,尽管通常只是使用 df.groupby('species')... .使用时 df.groupby(df.iloc[:, 0]... 那么这适用 sum 将第一列的内容(即连接字符串值)以及应用 总和 到其他数字列。

如果你尝试 print(df.columns[0] )而且 print(df.iloc[:, 0]) 然后,您将看到第一个是选定的列名,第二个是具有列中值的Pandas Series。

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

1 年前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

1 年前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

1 年前

user29715306 · from_users=和chats=电视节目中的差异

1 年前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

1 年前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

1 年前

prayner · 更新嵌套字典包含列表中的项

1 年前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

1 年前

Dave · 如何在for循环中修改列表值

1 年前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

1 年前