代码之家 › 专栏 › 技术社区 › chaouki

在Spark dataframe列中连接两个数组时出错

apache-spark-sql dataframe apache-spark

-1

chaouki · 技术社区 · 6 年前

   df.show()
   root
     |-- context_id: long (nullable = true)
     |-- data1: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- k: struct (nullable = false)
     |    |    |    |-- v: string (nullable = true)
     |    |    |    |-- t: string (nullable = false)
     |    |    |-- resourcename: string (nullable = true)
     |    |    |-- criticity: string (nullable = true)
     |    |    |-- v: string (nullable = true)
     |    |    |-- vn: double (nullable = true)
     |-- data2: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- k: struct (nullable = false)
     |    |    |    |-- v: string (nullable = true)
     |    |    |    |-- t: string (nullable = false)
     |    |    |-- resourcename: string (nullable = true)
     |    |    |-- criticity: string (nullable = true)
     |    |    |-- v: string (nullable = true)
     |    |    |-- vn: double (nullable = true)

我创造 udf concat tow数组和我提供了结果的模式

val schema=df.select("data1").schema
val concatArray = udf ({ (x: Seq[Row], y: Seq[Row]) => x ++ y}, schema)

当我应用自定义项时,我会得到这个错误

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>, array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>) => struct<data1:array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>>)

有什么建议吗

1 回复 | 直到 6 年前

user10462628 6 年前

提供架构的方式不正确。单个列的架构 DataFrame

df.select("data1").schema

与列本身的架构不同。相反,您应该使用字段的架构:

val schema = df.schema("data1").dataType

推荐文章

srinath tripuraneni · {DataFrameWriter CSV到HDFS文件系统}不分区写入数据

3 年前

Calcutta · Google Colab中的Spark SQL在大数据上失败

3 年前

Palkin Jangra · 使用循环在Pyspark数组元素上和元素本身内迭代两次

3 年前

Doraemon · PySpark:使用不同值的字符串类型列创建聚合列

3 年前

amol visave · spark作业失败时会发生什么?

3 年前

Alex Jolly · 如何在另一个pyspark数据帧中查询开始时间和结束时间之间的日期时间

3 年前

chun · pyspark dataframe在s3中两次写入csv文件

3 年前

Mod_x · 如何将特定列的行标题更改为行标题,并在pySpark中生成矩阵?

3 年前

katty · 动态地将参数传递给scala中的函数

6 年前

Abhishek Choudhary · 如何更新数组列?

6 年前