代码之家 › 专栏 › 技术社区 › HbnKing

sparksql使用内部数组读取json

apache-spark-sql apache-spark scala

1

HbnKing · 技术社区 · 6 年前

我试图将json读入数据集(spark 2.3.2)。不幸的是,它不能很好地工作。

这是数据,它是一个带有内部数组的json文件

{ "Name": "helloworld", 
  "info": { "privateInfo": [ {"salary":1200}, {"sex":"M"}],
            "house": "sky road" 
          }, 
  "otherinfo":2
}   
{ "Name": "helloworld2",
  "info": { "privateInfo": [ {"sex":"M"}],
            "house": "sky road" 
          }, 
  "otherinfo":3
}

我使用sparksession来选择列,但它有一些问题:结果不是它自己的数据,而是数组中的数据。

val sqlDF = spark.sql("SELECT name , info.privateInfo.salary ,info.privateInfo.sex   FROM people1 ")
    sqlDF.show()

但是列 salary &安培; sex 在一个数组中:

+-----------+-------+-----+
|       name| salary|  sex|
+-----------+-------+-----+
| helloworld|[1200,]|[, M]|
|helloworld2|     []|  [M]|
+-----------+-------+-----+

如何使用数据类型本身获取数据?

例如

+-----------+-------+-----+
|       name| salary|  sex|
+-----------+-------+-----+
| helloworld|  1200 |  M  |
|helloworld2|none/null| M |
+-----------+-------+-----+

0 回复 | 直到 6 年前

1

Gelerion 6 年前

简短的回答

spark.sql("SELECT name , " +
      "element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary ," +
      "element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex" +
      "   FROM people1 ")

+-----------+------+---+
|       name|salary|sex|
+-----------+------+---+
| helloworld|  1200|  M|
|helloworld2|  null|  M|
+-----------+------+---+

冗长的回答
主要关注的是数组的可空性

root
 |-- Name: string (nullable = true)
 |-- info: struct (nullable = true)
 |    |-- house: string (nullable = true)
 |    |-- privateInfo: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- salary: long (nullable = true)
 |    |    |    |-- sex: string (nullable = true)
 |-- otherinfo: long (nullable = true)

所以我们需要一种过滤空值的方法,幸运的是spark 2.4 有内置的 Higher-Order Functions

第一次尝试是 array_remove ,但不幸的是 null 不能等于 无效的 .
使用更详细的语法仍然是可能的

df.selectExpr("filter(info.privateInfo.salary, salary -> salary is not null)")

+------+
|salary|
+------+
|[1200]|
|    []|
+------+

现在我们需要一些方法来爆炸阵列,幸运的是我们的星火 explode 功能!

df.selectExpr(
 "explode(filter(info.privateInfo.salary, salary -> salary is not null)) AS salary",
 "explode(filter(info.privateInfo.sex, sex -> sex is not null)) AS sex")

繁荣

Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 2

我们知道数组中应该只有一个值,我们可以使用 element_at

 df.selectExpr(
      "element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary",
      "element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex")

p.s.还没注意到10个月前有人问过这个问题