代码之家 › 专栏 › 技术社区 › User12345

根据条件使用其他列值覆盖列值

pyspark apache-spark

6

User12345 · 技术社区 · 7 年前

我有一个 data frame 在里面 pyspark 如下图所示。

df.show()

+-----------+------------+-------------+
|customer_id|product_name|      country|
+-----------+------------+-------------+
|   12870946|        null|       Poland|
|     815518|       MA401|United States|
|    3138420|     WG111v2|           UK|
|    3178864|    WGR614v6|United States|
|    7456796|       XE102|United States|
|   21893468|     AGM731F|United States|
+-----------+------------+-------------+

我有另一个数据框,如下所示 df1.show()

+-----------+------------+
|customer_id|product_name|
+-----------+------------+
|   12870946|     GS748TS|
|     815518|       MA402|
|    3138420|        null|
|    3178864|    WGR614v6|
|    7456796|       XE102|
|   21893468|     AGM731F|
|       null|       AE171|
+-----------+------------+

现在我想做一个 fuller outer join 在这些表上更新 product_name 列值如下所示。

1) Overwrite the values in `df` using values in `df1` if there are values in `df1`.
2) if there are `null` values or `no` values in `df1` then leave the values in `df` as they are

expected result

+-----------+------------+-------------+
|customer_id|product_name|      country|
+-----------+------------+-------------+
|   12870946|     GS748TS|       Poland|
|     815518|       MA402|United States|
|    3138420|     WG111v2|           UK|
|    3178864|    WGR614v6|United States|
|    7456796|       XE102|United States|
|   21893468|     AGM731F|United States|
|       null|       AE171|         null|
+-----------+------------+-------------+

我做过如下的事情

import pyspark.sql.functions as f
df2 = df.join(df1, df.customer_id == df1.customer_id, 'full_outer').select(df.customer_id, f.coalesce(df.product_name, df1.product_name).alias('product_name'), df.country)

但我得到的结果是不同的

df2.show()

+-----------+------------+-------------+
|customer_id|product_name|      country|
+-----------+------------+-------------+
|   12870946|        null|       Poland|
|     815518|       MA401|United States|
|    3138420|     WG111v2|           UK|
|    3178864|    WGR614v6|United States|
|    7456796|       XE102|United States|
|   21893468|     AGM731F|United States|
|       null|       AE171|         null|
+-----------+------------+-------------+

我怎么才能拿到票 预期结果

3 回复 | 直到 7 年前

1

6

pault Tanjin 7 年前

您编写的代码为我生成了正确的输出,因此我无法重现您的问题。我曾在其他帖子中看到,在进行连接时使用别名解决了问题,因此下面是一个稍微修改过的代码版本,它也可以做同样的事情:

import pyspark.sql.functions as f

df.alias("r").join(df1.alias("l"), on="customer_id", how='full_outer')\
    .select(
        "customer_id",
        f.coalesce("r.product_name", "l.product_name").alias('product_name'),
        "country"
    )\
    .show()
#+-----------+------------+-------------+
#|customer_id|product_name|      country|
#+-----------+------------+-------------+
#|    7456796|       XE102|United States|
#|    3178864|    WGR614v6|United States|
#|       null|       AE171|         null|
#|     815518|       MA401|United States|
#|    3138420|     WG111v2|           UK|
#|   12870946|     GS748TS|       Poland|
#|   21893468|     AGM731F|United States|
#+-----------+------------+-------------+

当我运行你的代码时,我也会得到同样的结果(复制如下):

df.join(df1, df.customer_id == df1.customer_id, 'full_outer')\
    .select(
        df.customer_id,
        f.coalesce(df.product_name, df1.product_name).alias('product_name'),
        df.country
    )\
    .show()

我使用的是spark 2.1和python 2.7.13。

2

3

Ramesh Maharjan 7 年前

如果值不是字符串null,那么代码就是完美的 .但是看看你得到的df2数据帧 中的价值观 product_name 字符串似乎为空 .你得检查一下 字符串空值 使用 when 内置函数 和 isnull 内在功能 像

import pyspark.sql.functions as f
df2 = df.join(df1, df.customer_id == df1.customer_id, 'full_outer')\
    .select(df.customer_id, f.when(f.isnull(df.product_name) | (df.product_name == "null"), df1.product_name).otherwise(df.product_name).alias('product_name'), df.country)
df2.show(truncate=False)

这应该给你

+-----------+------------+------------+
|customer_id|product_name|country     |
+-----------+------------+------------+
|7456796    |XE102       |UnitedStates|
|3178864    |WGR614v6    |UnitedStates|
|815518     |MA401       |UnitedStates|
|3138420    |WG111v2     |UK          |
|12870946   |GS748TS     |Poland      |
|21893468   |AGM731F     |UnitedStates|
|null       |AE171       |null        |
+-----------+------------+------------+

3

1

AiDev 7 年前

由于存在一些相互冲突的报告——首先,假设df的维度相同,只需在df1中使用想要使用的df2中的列创建一个新列,或者根据需要将它们连接起来。然后可以使用SQL条件。

from pyspark.sql import functions as F
df1 = df1.withColumn('column', F.when(df1['column'].isNull(), df1['column']).otherwise(df1['other-column-originally-from-df2']) )