是(…)此处的行的顺序与我指定输入列的顺序相同
是的,他们是。让我们追踪一下发生了什么:
from pyspark.ml.feature import PCA, VectorAssembler
data = [
(0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0),
(4.0, 0.0, 0.0, 6.0, 7.0)
]
df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])
VectorAseembler
按列顺序排列:
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vectors = assembler.transform(df).select("features")
vectors.schema[0].metadata
# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'},
# {'idx': 1, 'name': 'v'},
# {'idx': 2, 'name': 'x'},
# {'idx': 3, 'name': 'y'},
# {'idx': 4, 'name': 'z'}]},
# 'num_attrs': 5}}
主成分也是如此
model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors)
?model.pc
# Type: property
# String form: <property object at 0x7feb5bdc1d68>
# Docstring:
# Returns a principal components Matrix.
# Each column is one principal component.
#
# .. versionadded:: 2.0.0
最后,健全性检查:
import numpy as np
x = np.array(data)
y = model.pc.values.reshape(3, 5).transpose()
z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect())
np.linalg.norm(x.dot(y) - z)
# 8.881784197001252e-16