代码之家  ›  专栏  ›  技术社区  ›  Shankar Panda

如何使用Pyspark创建json列表?

  •  -1
  • Shankar Panda  · 技术社区  · 7 年前

    目标产出:

    [{
        "Loaded_data": [{
            "Loaded_numeric_columns": ["id", "val"],
            "Loaded_category_columns": ["name", "branch"]
        }],
        "enriched_data": [{
            "enriched_category_columns": ["country__4"],
            "enriched_index_columns": ["id__1", "val__3"]
        }]
    }]
    

    我可以为每个部分创建列表。请参考下面的代码。我有点困在这里了,你能帮我一下吗。

    样本数据: enter image description here

    input_data=spark.read.csv("/tmp/test234.csv",header=True, inferSchema=True)
    def is_numeric(data_type):
        return data_type not in ('date', 'string', 'boolean')
    def is_nonnumeric(data_type):
        return data_type in ('string')
    
    sub="__"
    Loaded_numeric_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub not in name)]
    print Loaded_numeric_columns
    Loaded_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub not in name)]
    print Loaded_category_columns
    enriched_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub in name)]
    print enriched_category_columns
    enriched_index_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub in name)]
    print enriched_index_columns
    
    1 回复  |  直到 7 年前
        1
  •  1
  •   Steven    7 年前

    您只需使用创建新的列类型即可 struct array :

    from pyspark.sql import functions as F
    
    df.show()
    
    +---+-----+-------+------+----------+-----+-------+
    | id|  val|   name|branch|country__4|id__1| val__3|
    +---+-----+-------+------+----------+-----+-------+
    |  1|67.87|Shankar|     a|         1|67.87|Shankar|
    +---+-----+-------+------+----------+-----+-------+
    
    
    
    df.select(
      F.struct(
        F.array(F.col("id"), F.col("val")).alias("Loaded_numeric_columns"),
        F.array(F.col("name"), F.col("branch")).alias("Loaded_category_columns"),
      ).alias("Loaded_data"),
      F.struct(
        F.array(F.col("country__4")).alias("enriched_category_columns"),
        F.array(F.col("id__1"), F.col("val__3")).alias("enriched_index_columns"),
      ).alias("enriched_data"),
    ).printSchema()
    
    root
     |-- Loaded_data: struct (nullable = false)
     |    |-- Loaded_numeric_columns: array (nullable = false)
     |    |    |-- element: double (containsNull = true)
     |    |-- Loaded_category_columns: array (nullable = false)
     |    |    |-- element: string (containsNull = true)
     |-- enriched_data: struct (nullable = false)
     |    |-- enriched_category_columns: array (nullable = false)
     |    |    |-- element: long (containsNull = true)
     |    |-- enriched_index_columns: array (nullable = false)
     |    |    |-- element: string (containsNull = true)
    
    推荐文章