代码之家 › 专栏 › 技术社区 › The Great

TypeError:编码器要求输入统一的字符串或数字。得到[int',str']

smote deep-learning scikit-learn machine-learning python

The Great · 技术社区 · 3 年前

我已经推荐了这些帖子 here , here 和 here .不要将其标记为重复。

我正在研究一个二进制分类问题,我的数据集有分类列和数字列。

然而,一些分类列混合了数值和字符串值。如果没有,它们只表示类别名称。

例如,我有一个专栏叫做 biz_category 它的价值观是 A,B,C,4,5 等

我猜下面的错误是由于以下值引发的 4 and 5 .

因此,我试着将它们转换成 category 数据类型。(但仍然不起作用)

cols=X_train.select_dtypes(exclude='int').columns.to_list()
X_train[cols]=X_train[cols].astype('category')

我的数据信息如下所示

<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 21 to 965
Data columns (total 9 columns):
 #   Column                                           Non-Null Count  Dtype   
---  ------                                           --------------  -----   
 0   Feature_A                                        683 non-null    category
 1   Product Classification                           683 non-null    category
 2   Industry                                         683 non-null    category
 3   DIVISION                                         683 non-null    category
 4   biz_category                                     683 non-null    category
 5   Country                                          683 non-null    category
 6   Product segment                                  683 non-null    category
 7   SUBREGION                                        683 non-null    category
 8   Quantity 1st year                                683 non-null    int64   
dtypes: category(8), int64(1)

所以,在数据类型转换之后,当我尝试下面的SMOTENC时,我得到了一个错误

print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
cat_index = [0,1,2,3,4,5,6,7]
# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE, SMOTENC
sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority')
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

这会导致如下所示的错误

---------------------------------------------------------------------------TypeError回溯(最近一次通话) (最后) ~\AppData\Roaming\Python\Python39\site packages\sklearn\utils\u encode。py 在_unique_python中(值,返回_逆) 134 -->135 uniques=已排序(uniques_集) 136个单号。扩展(缺少_值。到_列表())

TypeError:“<”在'str'和'int'实例之间不受支持

在处理上述异常期间,发生了另一个异常:

TypeError回溯(最近一次通话) (最后) C:\Users\SATHAP~1\AppData\Local\Temp/ipykernel_31168/1931674352。进来 6.从中学习。过度采样输入SMOTE,SMOTENC 7 sm=SMOTENC(分类特征=cat索引,随机状态=2,抽样策略=minority) ---->8 X_train_res,y_train_res=sm。fit_重采样(X_-train,y_-train) 9 10打印('过采样后,序列X:{}的形状。格式(X_序列res.形状))

~\AppData\Roaming\Python\Python39\site packages\imblearn\base。进来 fit_重采样(自我、X、y) 81 ) 82 --->83输出=自我。 拟合重采样(X,y) 84 85岁 = (

~\AppData\Roaming\Python\Python39\site packages\imblearn\over\u sampling\u smote\base。py 在里面 fit_重采样(自我、X、y) 511 512#OneHotEncoder的输入需要密集 -->513 X_ohe=自我。哦 .fit_transform( 514 X_绝对的。toarray()如果稀疏。issparse(X_分类)else X_分类 515 )

~\AppData\Roaming\Python\Python39\site packages\sklearn\preprocessing\u编码器。py in-fit_变换(self,X,y) 487自我_验证_关键字() -->488 return super()。拟合_变换(X,y) 489 490 def变换(自我,X):

~\AppData\Roaming\Python\Python39\site packages\sklearn\base。进来拟合变换(自、X、y、**拟合参数) 850如果y为无: 851#算术1的拟合方法(无监督变换) -->852回归自我。配合(X,**配合参数)。变换(X) 853其他: 854#算术2的拟合方法(监督变换)

~\AppData\Roaming\Python\Python39\site packages\sklearn\preprocessing\u编码器。py 合身(自我,X,y) 459 """ 460自我_验证_关键字() -->461赛尔夫。 fit(X,handle_unknown=self.handle_unknown,force_all_finite=“allow nan”) 462自我。投递 自我_compute_drop_idx() 463回归自我

~\AppData\Roaming\Python\Python39\site packages\sklearn\preprocessing\u编码器。py 不匹配(自我,X,手柄未知,强制所有有限) 93如果是赛尔夫。类别==“自动”: --->94只猫=独一无二(Xi) 95其他: 96只猫=np。数组(self.categories[i],dtype=Xi.dtype)

~\AppData\Roaming\Python\Python39\site packages\sklearn\utils\u encode。py in _unique(值,返回_倒数) 29 """ 30个if值。dtype==对象: --->31 return\u unique\u python(值,return\u inverse=return\u inverse) 32#数字 33 out=np。唯一(值,返回值=返回值)

~\AppData\Roaming\Python\Python39\site packages\sklearn\utils\u encode。py 在_unique_python中(值,返回_逆) 138除类型错误外: 139种类型=已排序(t。魁名对于集合中的t(对于值中的v,类型(v)) -->140上升类型错误( 141“编码器要求其输入一致” 142 f“字符串或数字。有{types}”

TypeError:编码器要求输入统一为字符串或字符串数字。得到[int',str']

我应该改变吗 y_train 也变成了绝对的?目前是 int64 .

请帮帮我

0 回复 | 直到 3 年前

Shubham Sharma mkln 3 年前

问题的原因

SMOTE 要求每个分类/数字列中的值具有统一的数据类型。基本上,在本例中,任何列中都不能有混合数据类型 biz_category 柱此外,仅仅将列强制转换为分类类型并不一定意味着该列中的值将具有统一的数据类型。

可能的解决方案

这个问题的一个可能的解决方案是重新编码那些数据类型混合的列中的值。例如,您可以使用lableencoder,但我认为在您的情况下,只需更改 dtype 到 string 这也行。