代码之家  ›  专栏  ›  技术社区  ›  mentongwu

如何使矢量汇编程序不压缩数据?

  •  1
  • mentongwu  · 技术社区  · 8 年前

    我想使用将多列转换为一列 VectorAssembler ,但默认情况下,数据是压缩的,没有其他选项。

    val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6))
    val df=sc.parallelize(arr2).toDF("a","b","c","e","f")
    val colNames=Array("a","b","c","e","f")
    val assembler = new VectorAssembler()
      .setInputCols(colNames)
      .setOutputCol("newCol")
    val transDF= assembler.transform(df).select(col("newCol"))
    transDF.show(false)
    

    输入为:

      +---+---+---+---+---+
      |  a|  b|  c|  e|  f|
      +---+---+---+---+---+
      |  1|  2|  0|  0|  0|
      |  1|  2|  3|  0|  0|
      |  1|  2|  4|  5|  0|
      |  1|  2|  2|  5|  6|
      +---+---+---+---+---+
    

    结果是:

    +---------------------+
    |newCol               |
    +---------------------+
    |(5,[0,1],[1.0,2.0])  |
    |[1.0,2.0,3.0,0.0,0.0]|
    |[1.0,2.0,4.0,5.0,0.0]|
    |[1.0,2.0,2.0,5.0,6.0]|
    +---------------------+
    

    我的预期结果是:

    +---------------------+
    |newCol               |
    +---------------------+
    |[1.0,2.0,0.0,0.0,0.0]|
    |[1.0,2.0,3.0,0.0,0.0]|
    |[1.0,2.0,4.0,5.0,0.0]|
    |[1.0,2.0,2.0,5.0,6.0]|
    +---------------------+
    

    我应该怎么做才能得到预期的结果?

    1 回复  |  直到 8 年前
        1
  •  2
  •   GPI    8 年前

    如果确实要将所有向量强制为其密集表示,可以使用用户定义的函数:

    val toDense = udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
    transDF.select(toDense($"newCol")).show
    
    +--------------------+
    |         UDF(newCol)|
    +--------------------+
    |[1.0,2.0,0.0,0.0,...|
    |[1.0,2.0,3.0,0.0,...|
    |[1.0,2.0,4.0,5.0,...|
    |[1.0,2.0,2.0,5.0,...|
    +--------------------+