代码之家  ›  专栏  ›  技术社区  ›  tree em

使用spark xml读取值xml标记值,希望获取该值但给出列表

  •  0
  • tree em  · 技术社区  · 6 年前
    <row id='185685445477437.020001' xml:space='preserve'>
        <c2>KH0013001</c2>
        <c3>-2271164.00</c3>
        <c4>9</c4>
        <c7>65395</c7>
        <c9>1</c9>
        <c12>KHR</c12>
        <c16>TR</c16>
        <c17>6-71-10-1-001-030</c17>
        <c20>1</c20>
        <c22>1</c22>
        <c23>DC183050001030071</c23>
        <c24>DC</c24>
        <c25>20181101</c25>
        <c26>185685445477437.02</c26>
        <c26 m='3'>1</c26>
        <c29>1</c29>
        <c30>5011_DMUSER__OFS_DM.OFS.SRC.VAL</c30>
        <c31>1811012130</c31>
        <c32>6010_DMUSER</c32>
        <c56>PL.65395.......1.....KH0013001</c56>
        <c98></c98>
    </row>
    

    用Spark XML激发

    import org.apache.spark.sql.{SQLContext, SparkSession}
    
    object sparkXml {
      def main(args: Array[String]): Unit = {
    
        val spark = SparkSession.
          builder.master("local[*]")
          //.config("spark.debug.maxToStringFields", "100")
          .appName("Insight Application Big Data")
          .getOrCreate()
    
        val df = spark.read
          .format("com.databricks.spark.xml")
          .option("rowTag", "row")
          .load("src/main/resources/in/FBNK_CATEG_ENTRY.xml")
        df.createOrReplaceTempView("categ_entry")
    
       df.printSchema()
      spark.sql("Select c26['_VALUE'] as value, c26['_m'] as option from categ_entry").show()
    
    
      }
    }
    

    印刷模式

    root
     |-- _id: double (nullable = true)
     |-- _space: string (nullable = true)
     |-- c12: string (nullable = true)
     |-- c16: string (nullable = true)
     |-- c17: string (nullable = true)
     |-- c2: string (nullable = true)
     |-- c20: long (nullable = true)
     |-- c22: long (nullable = true)
     |-- c23: string (nullable = true)
     |-- c24: string (nullable = true)
     |-- c25: long (nullable = true)
     |-- c26: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- _VALUE: double (nullable = true)
     |    |    |-- _m: long (nullable = true)
     |-- c29: long (nullable = true)
     |-- c3: double (nullable = true)
     |-- c30: string (nullable = true)
     |-- c31: long (nullable = true)
     |-- c32: string (nullable = true)
     |-- c4: long (nullable = true)
     |-- c56: string (nullable = true)
     |-- c7: long (nullable = true)
     |-- c9: long (nullable = true)
     |-- c98: string (nullable = true)
    

    运行后的结果

    +--------------------+------+
    |[1.85685445477437...| [, 3]|
    +--------------------+------+
    

    我希望结果是这样的。

    +--------------------+------+
    | 185685445477437.02  | 3   |
    +--------------------+------+
    

    任何人请指导我应该如何更正代码以产生预期的结果

    1 回复  |  直到 6 年前
        1
  •  1
  •   NNK    6 年前

    数据的获取方式很难生成所需的输出。

    <c26>185685445477437.02</c26>   
    <c26 m='3'>1</c26>
    

    您有两个标签,Spark结构将其视为一个数组。您需要从第一个c26_值得到185685445477437.02,从第二个c26_属性得到3,这可能不正确。

    如果你在寻找下面的输出。用下面的语句替换最后一行

    val df2 = df.withColumn("c26Struct",explode(col("c26")))
    df2.select(col("c26Struct._VALUE").alias("value"),col("c26Struct._m").alias("option") ).show(false)
    
    
    +---------------------+------+
    |value                |option|
    +---------------------+------+
    |1.8568544547743703E14|null  |
    |1.0                  |3     |
    +---------------------+------+
    

    在这里,我通过分解CR26数组和从分解的新列中选择的值创建了一个新的列C26结构。

    希望这有帮助!!

    谢谢, 纳文