代码之家  ›  专栏  ›  技术社区  ›  ZK Zhao

Pyspark:在groupBy之后获得百分比结果

  •  2
  • ZK Zhao  · 技术社区  · 6 年前

    test = spark.createDataFrame([
        (0, 1, 5, "2018-06-03", "Region A"),
        (1, 1, 2, "2018-06-04", "Region B"),
        (2, 2, 1, "2018-06-03", "Region B"),
        (3, 3, 1, "2018-06-01", "Region A"),
        (3, 1, 3, "2018-06-05", "Region A"),
    ])\
      .toDF("orderid", "customerid", "price", "transactiondate", "location")
    test.show()
    

    我可以得到这样的摘要数据

    test.groupBy("customerid", "location").agg(sum("price")).show()
    

    enter image description here

    +----------+--------+----------+ 
    |customerid|location|sum(price)| percentage
    +----------+--------+----------+ 
    |         1|Region B|         2|    20%
    |         1|Region A|         8|    80%
    |         3|Region A|         1|    100%
    |         2|Region B|         1|    100%
    +----------+--------+----------+
    

    我想知道

    • 我该怎么做?也许用窗口功能?

    enter image description here


    我只在 How to get percentage of counts of a column after groupby in Pandas

    更新:

    在@Gordon Linoff的帮助下,我可以通过

    from pyspark.sql.window import Window
    test.groupBy("customerid", "location").agg(sum("price"))\
      .withColumn("percentage", col("sum(price)")/sum("sum(price)").over(Window.partitionBy(test['customerid']))).show()
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   Gordon Linoff    6 年前

    这回答了问题的原始版本。

    在SQL中,可以使用窗口函数:

    select customerid, location, sum(price),
           (sum(price) / sum(sum(price)) over (partition by customerid) as ratio
    from t
    group by customerid, location;