代码之家  ›  专栏  ›  技术社区  ›  chaouki

在Spark dataframe列中连接两个数组时出错

  •  -1
  • chaouki  · 技术社区  · 6 年前

       df.show()
       root
         |-- context_id: long (nullable = true)
         |-- data1: array (nullable = true)
         |    |-- element: struct (containsNull = true)
         |    |    |-- k: struct (nullable = false)
         |    |    |    |-- v: string (nullable = true)
         |    |    |    |-- t: string (nullable = false)
         |    |    |-- resourcename: string (nullable = true)
         |    |    |-- criticity: string (nullable = true)
         |    |    |-- v: string (nullable = true)
         |    |    |-- vn: double (nullable = true)
         |-- data2: array (nullable = true)
         |    |-- element: struct (containsNull = true)
         |    |    |-- k: struct (nullable = false)
         |    |    |    |-- v: string (nullable = true)
         |    |    |    |-- t: string (nullable = false)
         |    |    |-- resourcename: string (nullable = true)
         |    |    |-- criticity: string (nullable = true)
         |    |    |-- v: string (nullable = true)
         |    |    |-- vn: double (nullable = true)
    

    我创造 udf concat tow数组和我提供了结果的模式

    val schema=df.select("data1").schema
    val concatArray = udf ({ (x: Seq[Row], y: Seq[Row]) => x ++ y}, schema)
    

    当我应用自定义项时,我会得到这个错误

    org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>, array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>) => struct<data1:array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>>)
    

    有什么建议吗

    1 回复  |  直到 6 年前
        1
  •  0
  •   user10462628    6 年前

    提供架构的方式不正确。单个列的架构 DataFrame

    df.select("data1").schema
    

    与列本身的架构不同。相反,您应该使用字段的架构:

    val schema = df.schema("data1").dataType