一种选择是总结
array
然后计算总的和。这可以通过Spark SQL函数来实现
aggregate
val tmp = df.withColumn("summed_val",expr("aggregate(val,0,(acc, x) -> acc + x)"))
tmp.show()
+---+---------+----------+
| id| val|summed_val|
+---+---------+----------+
| 1|[3, 4, 5]| 12|
| 2| [1, 2]| 3|
+---+---------+----------+
//one row dataframe with the overall sum. collecting to a scalar value is possible too.
tmp.agg(sum("summed_val").alias("total")).show()
+-----+
|total|
+-----+
| 15|
+-----+
explode
import org.apache.spark.sql.functions.explode
val tmp = df.withColumn("elem",explode($"val"))
tmp.agg(sum($"elem").alias("total")).show()