在尝试使用pyspark为分类特征生成一个热编码向量时发现了一个奇怪的问题
OneHotEncoder
(
https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder
)其中,OneHot向量似乎缺少某些类别(或者在显示时格式可能很奇怪?).
在回答了这个问题(或提供
安
答案),下面的细节似乎与理解问题并不完全相关。
具有窗体的数据集
1. Wife's age (numerical)
2. Wife's education (categorical) 1=low, 2, 3, 4=high
3. Husband's education (categorical) 1=low, 2, 3, 4=high
4. Number of children ever born (numerical)
5. Wife's religion (binary) 0=Non-Islam, 1=Islam
6. Wife's now working? (binary) 0=Yes, 1=No
7. Husband's occupation (categorical) 1, 2, 3, 4
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high
9. Media exposure (binary) 0=Good, 1=Not good
10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term
实际数据看起来像
wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1
来源:
https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice
.
在对数据做了一些其他的预处理之后,尝试将分类和二进制(只是为了实践)特性通过……
for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
print encoder.k
dataset = encoder.transform(dataset)
生成一行
Row(
....,
numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]),
wife_edu_1hot=SparseVector(4, {2: 1.0}),
husband_edu_1hot=SparseVector(4, {3: 1.0}),
husband_occupation_1hot=SparseVector(4, {2: 1.0}),
SoL_index_1hot=SparseVector(4, {3: 1.0}),
wife_religion_1hot=SparseVector(1, {0: 1.0}),
wife_working_1hot=SparseVector(1, {0: 1.0}),
media_exposure_1hot=SparseVector(1, {0: 1.0}),
contraceptive_1hot=SparseVector(2, {0: 1.0})
)
我对稀疏向量格式的理解是
SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn})
表示长度为s的向量,其中所有值均为0,除了索引I1,…,其中具有相应的值v1,…,vn。(
https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html
)
基于此,它看起来像
这
case实际上表示向量中的最高索引(而不是大小)。此外,结合所有功能(通过pyspark
VectorAssembler
)并检查结果的数组版本
dataset.head(n=1)
矢量显示
input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})
indicates a vector looking like
indices: 0 1 2 3 4... 9 12 17 18 19 20 21
[-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
我认为不可能有一个连续的序列>=3个1(可以在上面向量的尾部附近看到),因为这表明一个热向量(例如中间1)的大小只有1,这对于任何数据特征都没有意义。
对机器学习很陌生,所以可能对这里的一些基本概念感到困惑,但是有人知道这里会发生什么吗?