代码之家 › 专栏 › 技术社区 › ZK Zhao

Pyspark:在groupBy之后获得百分比结果

pyspark group-by sql python

ZK Zhao · 技术社区 · 6 年前

test = spark.createDataFrame([
    (0, 1, 5, "2018-06-03", "Region A"),
    (1, 1, 2, "2018-06-04", "Region B"),
    (2, 2, 1, "2018-06-03", "Region B"),
    (3, 3, 1, "2018-06-01", "Region A"),
    (3, 1, 3, "2018-06-05", "Region A"),
])\
  .toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

我可以得到这样的摘要数据

test.groupBy("customerid", "location").agg(sum("price")).show()

+----------+--------+----------+ 
|customerid|location|sum(price)| percentage
+----------+--------+----------+ 
|         1|Region B|         2|    20%
|         1|Region A|         8|    80%
|         3|Region A|         1|    100%
|         2|Region B|         1|    100%
+----------+--------+----------+

我想知道

我该怎么做?也许用窗口功能?

我只在 How to get percentage of counts of a column after groupby in Pandas

更新:

在@Gordon Linoff的帮助下,我可以通过

from pyspark.sql.window import Window
test.groupBy("customerid", "location").agg(sum("price"))\
  .withColumn("percentage", col("sum(price)")/sum("sum(price)").over(Window.partitionBy(test['customerid']))).show()

1 回复 | 直到 6 年前

Gordon Linoff 6 年前

这回答了问题的原始版本。

在SQL中,可以使用窗口函数:

select customerid, location, sum(price),
       (sum(price) / sum(sum(price)) over (partition by customerid) as ratio
from t
group by customerid, location;

推荐文章

John Smith · Java按值对映射进行分组,其中值为列表

2 年前

The Great · 拆分并存储数据帧,但名称基于特定列中的唯一值

2 年前

Programming Noob · AttributeError:“SeriesGroupBy”对象没有属性“tolist”

2 年前

Flo · 分组依据中的SQL大小写

2 年前

user13663655 · 我可以使用条件分组并返回源类型值吗

3 年前

pinkiBet · 使用MAX和UNION分组,还是加入?

3 年前

Shubham · 如何在R中查找数据表的两个多类别列之间的差异/setdiff()

3 年前

user7052482 · 将来自联合的SQL结果分组(SQL Server)

7 年前

lugger1 · 如何获取Postgres中时间间隔的平均值

7 年前

JustCallMeGary · 如何使用R中的一个因子操作两个变量[重复]

7 年前