代码之家  ›  专栏  ›  技术社区  ›  Simon Breton

如何用大查询按7天移动平均线阈值分组

  •  0
  • Simon Breton  · 技术社区  · 7 年前

    以下是我拥有的数据表示例:

    Account     Date        Items
    Account1    2018-01-01  249
    Account1    2018-01-02  298
    Account1    2018-01-03  297
    Account1    2018-01-04  263
    Account2    2018-02-05  251
    Account2    2018-02-12  249
    Account2    2018-02-13  298
    Account2    2018-03-14  297
    Account2    2018-05-23  263
    Account2    2018-05-24  251
    etc...
    

    我有100多个账户的数据在过去的10年中,每个帐户每天都有可用的项目数。

    使用以下查询,我有一个7天移动平均值列表,分类在我用case函数创建的bucket中:

    SELECT Account,
    CASE
        WHEN max(mov_avg_7d) <1000 THEN "less than 1k"
        WHEN max(mov_avg_7d) >=1000 AND max(mov_avg_7d) <2500 THEN "1k"
        WHEN max(mov_avg_7d) >=2500 AND max(mov_avg_7d) <5000 THEN "2.5K"
        WHEN max(mov_avg_7d) >=5000 AND max(mov_avg_7d) <10000 THEN "5k"
        WHEN max(mov_avg_7d) >=10000 AND max(mov_avg_7d) <20000 THEN "10k"
        WHEN max(mov_avg_7d) >=20000 AND max(mov_avg_7d) <30000 THEN "20k"
        WHEN max(mov_avg_7d) >=30000 AND max(mov_avg_7d) <40000 THEN "30k"
        WHEN max(mov_avg_7d) >=40000 AND max(mov_avg_7d) <50000 THEN "40k"
        WHEN max(mov_avg_7d) >=50000 AND max(mov_avg_7d) <60000 THEN "50k"
        WHEN max(mov_avg_7d) >=60000 AND max(mov_avg_7d) <70000 THEN "60k"
        WHEN max(mov_avg_7d) >=70000 AND max(mov_avg_7d) <90000 THEN "70k"
        WHEN max(mov_avg_7d) >=90000 AND max(mov_avg_7d) <100000 THEN "90k"
        WHEN max(mov_avg_7d) >=100000 THEN "100k"
        ELSE "error"
    END AS status
    FROM (
      SELECT Account,date,Items,
      AVG(Items) OVER (PARTITION BY Account ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d
      FROM `my_big_table_here` 
    )
    group by Account
    

    我想要每个桶和每个帐户的最大值(mov_avg_7d)。换句话说,我希望构建相同的查询,但是 group by Account,status

    我不太愿意解释这些。我希望我的问题是清楚的。

    编辑:

    我的输出是这样的:

    Account     Status        Items_max
    Account1    less than 1k  249
    Account1    1k            1500
    Account1    2.5K          2400
    Account2    less than 1k  133
    Account2    30k           25000
    Account2    50k           49000
    etc...
    
    1 回复  |  直到 7 年前
        1
  •  2
  •   Cedric    7 年前

    我假设你想得到以下结果:

    Account      Status             Max
    Account1     less than 1k       999
    Account1     1k                 2499
    ...
    Account2     less than 1k       888
    Account2     1k                 2488
    ...
    

    另外,我对BigQuery和SQL还不熟悉,所以如果查询不起作用或者有其他方法可以解决这个问题,请纠正我。

    我的代码是:

    CREATE TEMP FUNCTION STATUS(mov_avg_7d FLOAT64) AS (
      CASE 
        WHEN mov_avg_7d <1000 THEN "less than 1k"
        WHEN mov_avg_7d >=1000 AND mov_avg_7d <2500 THEN "1k"
        WHEN mov_avg_7d >=2500 AND mov_avg_7d <5000 THEN "2.5K"
        WHEN mov_avg_7d >=5000 AND mov_avg_7d <10000 THEN "5k"
        WHEN mov_avg_7d >=10000 AND mov_avg_7d <20000 THEN "10k"
        WHEN mov_avg_7d >=20000 AND mov_avg_7d <30000 THEN "20k"
        WHEN mov_avg_7d >=30000 AND mov_avg_7d <40000 THEN "30k"
        WHEN mov_avg_7d >=40000 AND mov_avg_7d <50000 THEN "40k"
        WHEN mov_avg_7d >=50000 AND mov_avg_7d <60000 THEN "50k"
        WHEN mov_avg_7d >=60000 AND mov_avg_7d <70000 THEN "60k"
        WHEN mov_avg_7d >=70000 AND mov_avg_7d <90000 THEN "70k"
        WHEN mov_avg_7d >=90000 AND mov_avg_7d <100000 THEN "90k"
        WHEN mov_avg_7d >=100000 THEN "100k"
        ELSE "error"
      END);
    SELECT Account, STATUS(mov_avg_7d) as status, max(mov_avg_7d) as max
    FROM (
      SELECT Account,date,Items,
      AVG(Items) OVER (PARTITION BY Account ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d
      FROM `my_big_table_here` 
    )
    group by Account, status
    

    希望这有帮助!


    说明:

    内部子查询:

    SELECT Account,date,Items,
          AVG(Items) OVER (PARTITION BY Account ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d
          FROM `my_big_table_here`
    

    提供以下结果(内部结果表):

    Account     Date     Items  mov_avg_7d
    Account1  01012018    xx       xxx
    Account1  02012018    xx       xxx
    Account1  03012018    xx       xxx
    ... 
    Account2  01012018    xx       xxx
    Account2  02012018    xx       xxx
    Account2  03012018    xx       xxx
    ...
    

    至于外部子查询:

    SELECT Account, STATUS(mov_avg_7d) as status, max(mov_avg_7d) as max
    FROM (
      <inner subquery> 
    )
    group by Account, status
    

    status()是一个用户定义的函数,它为内部结果表的每一行获取mov_avg_7d的值,并将其放入相应的类别中。

    让我们分离select语句,使事情更简单。例1:

    SELECT Account, max(mov_avg_7d) as max
    ...
    GROUP BY Account
    

    给出了7天的最大滚动平均值。 解释 ,无论状态如何即:

    Account       Max
    Account1      99199
    Account2      82849
    ...
    

    那么,例2:

    SELECT STATUS(mov_avg_7d) as status, max(mov_avg_7d) as max
    ...
    GROUP BY status
    

    给出了7天的最大滚动平均值。 地位 ,与帐户无关。即:

      Status         Max
    less than 1k     899
        1k           2488
       2.5k          4500
        ...
    

    因此,如果您想获得每个帐户的最大7天滚动平均值,每个状态,您只需通过帐户和状态来选择和分组。

    我希望这个解释能有所帮助,我不知道该如何解释才能让事情更清楚,因为我对SQL还很陌生如果您有任何疑问,请随时发表评论,我们可以共同努力!:)