代码之家  ›  专栏  ›  技术社区  ›  Arturo Gatto

Spark-将RDD[Vector]转换为具有可变列的数据帧

  •  0
  • Arturo Gatto  · 技术社区  · 8 年前

    输入是不同的RDD[矢量]。

    我尝试使用无形状库,但它们需要声明的列数和类型。 锿:

    val df = rddVector.map(_.toArray.toList)
      .collect  {
              case t: List[Double] if t.length == 3 => t.toHList[Double :: Double :: Double :: HNil].get.tupled.productArity
      }
      .toDF( "column_1", "column_2", "column_3" )
    

    谢谢

    1 回复  |  直到 8 年前
        1
  •  3
  •   Sanchit Grover    8 年前

    这对我很有效。

      // Create a vector rdd
      val vectorRDD = sc.parallelize(Seq(Seq(123L, 345L), Seq(567L, 789L), Seq(567L, 789L, 233334L))).
        map(s => Vectors.dense(s.toSeq.map(_.toString.toDouble).toArray))
    
      // Calculate the maximum length of the vector to create a schema 
      val vectorLength = vectorRDD.map(x => x.toArray.length).max()
    
      // create the dynamic schema
      var schema = new StructType()
      var i = 0
      while (i < vectorLength) {
        schema = schema.add(StructField(s"val${i}", DoubleType, true))
        i = i + 1
      }
    
      // create a rowRDD variable and make each row have the same arity 
      val rowRDD = vectorRDD.map { x =>
        var row = new Array[Double](vectorLength)
        val newRow = x.toArray
    
        System.arraycopy(newRow, 0, row, 0, newRow.length);
    
        println(row.length)
    
        Row.fromSeq(row)
      }
    
      // create your dataframe
      val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
    

     root
     |-- val0: double (nullable = true)
     |-- val1: double (nullable = true)
     |-- val2: double (nullable = true)
    
    +-----+-----+--------+
    | val0| val1|    val2|
    +-----+-----+--------+
    |123.0|345.0|     0.0|
    |567.0|789.0|     0.0|
    |567.0|789.0|233334.0|
    +-----+-----+--------+
    
    推荐文章