如何在Pypark中执行此操作?
join
A.join(other=B, on=(A['lkey'] == B['rkey']), how='outer')\
.select(A['lkey'], A['value'].alias('value_x'), B['rkey'], B['value'].alias('value_y'))\
.show(truncate=False)
这应该给你
+----+-------+----+-------+
|lkey|value_x|rkey|value_y|
+----+-------+----+-------+
|bar |2 |bar |6 |
|bar |2 |bar |8 |
|null|null |qux |7 |
|foo |1 |foo |5 |
|foo |4 |foo |5 |
|baz |3 |null|null |
+----+-------+----+-------+
更进一步,如何将lkey和rkey合并到一个列中,补充两边缺少的值?
rename
列和使用
参加
作为
from pyspark.sql.functions import col
A.select(col('lkey').alias('key'), col('value').alias('value_x'))\
.join(other=B.select(col('rkey').alias('key'), col('value').alias('value_y')), on=['key'], how='outer')\
.show(truncate=False)
+---+-------+-------+
|key|value_x|value_y|
+---+-------+-------+
|bar|2 |6 |
|bar|2 |8 |
|qux|null |7 |
|foo|1 |5 |
|foo|4 |5 |
|baz|3 |null |
+---+-------+-------+