代码之家  ›  专栏  ›  技术社区  ›  Connell.O'Donnell

PySpark DataFrame-筛选器嵌套列

  •  0
  • Connell.O'Donnell  · 技术社区  · 4 年前

    使用复制标志触发。我正在用Spark 3.0.1在Azure Databricks中使用Python3笔记本。

    +---+---------+--------+
    |ID |FirstName|LastName|
    +---+---------+--------+
    |1  |John     |Doe     |
    |2  |Michael  |        |
    |3  |Angela   |Merkel  |
    +---+---------+--------+
    

    from pyspark.sql.types import StructType,StructField, StringType, IntegerType
    import pyspark.sql.functions as F
    
    data2 = [(1,"John","Doe"),
        (2,"Michael",""),
        (3,"Angela","Merkel")
      ]
    
    schema = StructType([ \
        StructField("ID",IntegerType(),True), \
        StructField("FirstName",StringType(),True), \
        StructField("LastName",StringType(),True), \
      ])
     
    df1 = spark.createDataFrame(data=data2,schema=schema)
    df1.printSchema()
    df1.show(truncate=False)
    

    我把它转换成这个数据帧

    +---+-----------------------------------------+
    |ID |Names                                    |
    +---+-----------------------------------------+
    |1  |[[FirstName, John], [LastName, Doe]]     |
    |2  |[[FirstName, Michael], [LastName, ]]     |
    |3  |[[FirstName, Angela], [LastName, Merkel]]|
    +---+-----------------------------------------+
    

    使用此代码

    df2 = df1.select(
                'ID', 
                F.array(
                    F.struct(
                        F.lit('FirstName').alias('NameType'), 
                        F.col('FirstName').alias('Name')
                    ), 
                    F.struct(
                        F.lit('LastName').alias('NameType'), 
                        F.col('LastName').alias('Name')
                    )
                ).alias('Names')
            )
    
    df2.printSchema()
    df2.show(truncate=False)
    

    现在,我尝试过滤掉姓氏为null或是空字符串的名称。 Names 一个空的 Name 值被排除在外。

    这样地

    [
        {
            "ID": 1,
            "Names": [
                {
                    "NameType": "FirstName",
                    "Name": "John"
                },
                {
                    "NameType": "LastName",
                    "Name": "Doe"
                }
            ]
        },
        {
            "ID": 2,
            "Names": [
                {
                    "NameType": "FirstName",
                    "Name": "Michael"
                }
            ]
        },
        {
            "ID": 3,
            "Names": [
                {
                    "NameType": "FirstName",
                    "Name": "Angela"
                },
                {
                    "NameType": "LastName",
                    "Name": "Merkel"
                }
            ]
        }
    ]
    

    我试过了

    df2 = df1.select(
                'ID', 
                F.array(
                    F.struct(
                        F.lit('FirstName').alias('NameType'), 
                        F.col('FirstName').alias('Name')
                    ), 
                    F.struct(
                        F.lit('LastName').alias('NameType'), 
                        F.col('LastName').alias('Name')
                    )
                ).filter(lambda x: x.col('LastName').isNotNull()).alias('Names')
            )
    

    但我得到了错误 'Column' object is not callable .

    df2 = df2.filter(F.col('Names')['LastName']) > 0) 但这给了我一个机会 invalid syntax 错误。

    我试过了

    df2 = df2.filter(lambda x: (len(x)>0), F.col('Names')['LastName'])
    

    TypeError: filter() takes 2 positional arguments but 3 were given .

    有人能告诉我怎么做吗?

    1 回复  |  直到 4 年前
        1
  •  2
  •   mck    4 年前

    你可以使用高阶函数 filter

    import pyspark.sql.functions as F
    
    df3 = df2.withColumn(
        'Names', 
        F.expr("filter(Names, x -> case when x.NameType = 'LastName' and length(x.Name) = 0 then false else true end)")
    )
    
    df3.show(truncate=False)
    +---+-----------------------------------------+
    |ID |Names                                    |
    +---+-----------------------------------------+
    |1  |[[FirstName, John], [LastName, Doe]]     |
    |2  |[[FirstName, Michael]]                   |
    |3  |[[FirstName, Angela], [LastName, Merkel]]|
    +---+-----------------------------------------+