简短的回答
spark.sql("SELECT name , " +
"element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary ," +
"element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex" +
" FROM people1 ")
+-----------+------+---+
| name|salary|sex|
+-----------+------+---+
| helloworld| 1200| M|
|helloworld2| null| M|
+-----------+------+---+
冗长的回答
主要关注的是数组的可空性
root
|-- Name: string (nullable = true)
|-- info: struct (nullable = true)
| |-- house: string (nullable = true)
| |-- privateInfo: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- salary: long (nullable = true)
| | | |-- sex: string (nullable = true)
|-- otherinfo: long (nullable = true)
所以我们需要一种过滤空值的方法,幸运的是spark
2.4
有内置的
Higher-Order Functions
第一次尝试是
array_remove
,但不幸的是
null
不能等于
无效的
.
使用更详细的语法仍然是可能的
df.selectExpr("filter(info.privateInfo.salary, salary -> salary is not null)")
+------+
|salary|
+------+
|[1200]|
| []|
+------+
现在我们需要一些方法来爆炸阵列,幸运的是我们的星火
explode
功能!
df.selectExpr(
"explode(filter(info.privateInfo.salary, salary -> salary is not null)) AS salary",
"explode(filter(info.privateInfo.sex, sex -> sex is not null)) AS sex")
繁荣
Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 2
我们知道数组中应该只有一个值,我们可以使用
element_at
df.selectExpr(
"element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary",
"element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex")
p.s.还没注意到10个月前有人问过这个问题