这是一个旧帖子,但作为搜索此错误时出现的第一个帖子,它可能需要一个答案:
TL;博士:
在Dask数据帧上运行以下序列:
ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
out_df = ddf.compute()
根据Dask的
documentation
,您可以在Dask中在“已知类别”和“未知类别”之间转换分类数据类型。在这种情况下,它需要“已知”的类别,因为它需要从列元数据中提取类别映射。
import pandas as pd
from dask import dataframe as dd
>>> d = pd.Series(['A','B','D'], dtype='category').to_frame(name=âPROD_NAMEâ)
>>> d = d.assign(id=(d["PROD_NAME"]).astype('category').cat.codes)
>>> d
PROD_NAME id
0 A 0
1 B 1
2 D 2
>>> ddf = dd.from_pandas(d, npartitions=1)
>>> ddf
Dask DataFrame Structure:
PROD_NAME
npartitions=1
0 category[known]
2 ...
Dask Name: from_pandas, 1 tasks
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_unknown()
>>> ddf
Dask DataFrame Structure:
PROD_NAME
npartitions=1
0 category[unknown]
2 ...
Dask Name: assign, 3 tasks
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
>>> ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
>>> out_df = ddf.compute()
>>> out_df
PROD_NAME id
0 A 0
1 B 1
2 D 2