代码之家 › 专栏 › 技术社区 › doplano

Pandas vs Dask排序列以及字符串和数字的索引

dask pandas python

doplano · 技术社区 · 1 年前

鉴于 :

小样本Panda数据帧:

import pandas as pd
import numpy as np
import dask.dataframe as dd

df = pd.DataFrame({"usr": ["ip1", "ip7", "ip12", "ip4"], "colB": [1, 2, 3, 0], "ColA": [3, np.nan, 7, 1]}, dtype="float32").set_index("usr")
    
        colB    ColA
usr         
ip1     1.0     3.0
ip7     2.0     NaN
ip12    3.0     7.0
ip4     0.0     1.0

我可以使用对索引和列的数据帧进行排序 sort_index 和 reindex 如下所示:

df_s = df.sort_index(key=lambda x: ( x.to_series().str[2:].astype(int) )) # sort index
df_s = df_s.reindex(columns=sorted(df_s.columns)) # sort columns

        ColA    colB
usr         
ip1     3.0     1.0
ip4     1.0     0.0
ip7     NaN     2.0
ip12    7.0     3.0

问题 :

我真正的数据集是一个大数据帧,我使用Dask从并行计算中受益。自从 排序索引 不存在于Dask中,我尝试使用 sort_values 如下所示:

ddf = dd.from_pandas(df, npartitions=2)
ddf_s = ddf.map_partitions(lambda inp_ddf: inp_ddf.sort_values( ["usr"], ascending=True) ).compute()

但我得到的结果与我的完全不同 df_s 。索引和列都未正确排序。

        ColA    colB
usr         
ip1     3.0     1.0
ip4     1.0     0.0
ip7     NaN     2.0
ip12    7.0     3.0

如何在Dask中对索引和列进行排序?

干杯

2 回复 | 直到 1 年前

valentinmk 1 年前

您不需要像这样对每个聚会进行排序:

ddf.map_partitions(
    lambda inp_ddf: inp_ddf.sort_values( ....

它将导致数据集包含已排序的分区,但不包含已排序结果。

我以这个版本结束。

ddf = dd.from_pandas(df.reset_index(), npartitions=2) # to make ip normal column
ddf_s = ddf.assign(
    n=ddf['usr'].str[2:].astype(int) # create the new column to able to sort (1, 7, 12, 4)
).set_index(
    'usr' # restore index back
).sort_values(
    ['n'] # sort by values
).drop(
    columns=['n'] # cleanup the created column
)

最后,为了对列进行排序,我建议只使用输出数据和列的顺序进行操作

ddf_s[sorted(ddf_s.columns)].compute()

推荐文章

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

5 月前

Cam · Pandas列表日期到日期时间

5 月前

jjkennedy · Pandas文本文件导入:当每个文件中存在多个表时,自动选择1个表

5 月前

Sun Jar · 在另一个系列中查找当前df值的索引,并将其添加到列中

6 月前

dietzi96 · Pandas DataFrame.to_sql随机和静默地失败,没有错误消息

6 月前

Bijan · Pandas批量更新帐户字符串

6 月前

Kernel · TypeError:Index.reindex()收到意外的关键字参数fill_value'

6 月前

Kernel · 进入熊猫的定义。系列super().reindex

6 月前

adventurous_chip_55 · 如何引爆柱子

6 月前

RKIDEV · Panda迭代行并将第n行值乘以下一(n+1)行值

6 月前