只需将分类变量设置为
dtype="category"
不够,无法工作。
需要将类别值转换为真正的类别值
pd.factorize()
,其中每个类别都被分配一个数字标签。
比如说
df
是您的熊猫数据框。一般来说,您可以使用此样板代码:
df_numeric = df.select_dtypes(exclude=['object'])
df_obj = df.select_dtypes(include=['object']).copy()
# factorize categoricals columnwise
for c in df_obj:
df_obj[c] = pd.factorize(df_obj[c])[0]
# if you want to one hot encode then add this line:
df_obj = pd.get_dummies(df_obj, prefix_sep='_', drop_first = True)
# merge dataframes back to one dataframe
df_final = pd.concat([df_numeric, df_obj], axis=1)
因为你的分类变量已经被分解了(据我所知),你可以跳过分解,尝试一个热编码。
另请参见
this post on stats.stackexchange
.
如果要标准化/规范化数字数据(而不是类别数据),请使用以下函数:
from sklearn import preprocessing
def scale_data(data, scale="robust"):
x = data.values
if scale == "minmax":
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "standard":
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "quantile":
scaler = preprocessing.QuantileTransformer()
x_scaled = scaler.fit_transform(x)
elif scale == "robust":
scaler = preprocessing.RobustScaler()
x_scaled = scaler.fit_transform(x)
data = pd.DataFrame(x_scaled, columns = data.columns)
return data
scaled_df = scale_data(df_numeric, "robust")
把它们放在一起
for your dataset
:
from sklearn import preprocessing
df = pd.read_excel("default of credit card clients.xls", skiprows=1)
y = df['default payment next month']
del df['default payment next month']
c = [2,3,4]
r = list(range(0,24))
r = [x for x in r if x not in c]
df_cat = df.iloc[:, [2,3,4]].copy()
df_con = df.iloc[:, r].copy()
for c in df_cat:
df_cat[c] = pd.factorize(df_cat[c])[0]
scaler = preprocessing.MinMaxScaler()
df_scaled = scaler.fit_transform(df_con)
df_scaled = pd.DataFrame(df_scaled, columns=df_con.columns)
df_final = pd.concat([df_cat, df_scaled], axis=1)
cols = df.columns
df_final = df_final[cols]
为了进一步改进代码,在标准化之前进行列车/测试拆分,
fit_transform()
关于培训数据,
transform()
在测试数据上。否则将出现数据泄漏。