代码之家 › 专栏 › 技术社区 › Merlin

红移:计算窗口分区上的不同客户数

amazon-redshift sql

Merlin · 技术社区 · 6 年前

红移不支持 DISTINCT 在其窗口函数中聚合。 AWS documentation for COUNT states this 和 distinct 不支持任何窗口函数。

我的用例:在不同的时间间隔和流量通道上统计客户

我希望每月和年初至今 唯一的 本年度的客户数量,也按流量渠道以及所有渠道的总数量划分。由于客户可以访问多次,我只需要统计不同的客户,因此红移窗口聚合不会有帮助。

我可以使用 count(distinct customer_id)...group by ,但这将只给出所需四个结果中的一个。
我不要想养成对堆在一堆数据之间的每个所需计数运行完整查询的习惯吗 union all . 我希望这不是唯一的解决办法。

这就是我在postgres(或Oracle)中所写的内容:

select order_month
       , traffic_channel
       , count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
       , count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
       , count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
       , count(distinct customer_id) over() as ytd_total_customers

from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */

where to_char(order_month, 'YYYY') = '2017'

我如何在红移中解决这个问题?

结果需要在红移集群上工作,此外,这是一个简化的问题,实际期望的结果具有产品类别和客户类型,这将乘以所需的分区数量。因此 联合所有 汇总不是一个好的解决方案。

3 回复 | 直到 6 年前

Merlin 2 年前

A. blog post from 2016 调用这个问题并提供了一个基本的解决方法,所以感谢Mark D.Adams。奇怪的是,我在网上几乎找不到什么,所以我分享了我的(测试过的)解决方案。

关键是 dense_rank() ,按问题项排序,为相同项提供相同的秩,因此最高秩也是唯一项的计数。如果您尝试为我想要的每个分区交换以下内容,这将是一个可怕的混乱:

dense_rank() over(partition by order_month, traffic_channel order by customer_id)

由于您需要最高秩,因此必须子查询所有内容,并从每个排名中选择最大值。 将外部查询中的分区与子查询中的相应分区相匹配很重要。

/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
       order_month
       , traffic_channel
       , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
       , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
       , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
       , max(cust_rnk) over()  ytd_total_customers

from (
       select order_month
              , traffic_channel
              , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
              , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
              , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
              , dense_rank() over(order by customer_id)  cust_rnk

       from orders_traffic_channels

       where to_char(order_month, 'YYYY') = '2017'
     )

order by order_month, traffic_channel
;

笔记

分区 max() 和 稠密\u秩() 必须匹配
稠密\u秩() 将为空值排序(所有值都处于相同的排序,最大值)。如果你不想数数 null 你需要的价值观 case when customer_id is not null then dense_rank() ...etc... ,或者可以从如果你知道有空值。

2022年更新

红移中分区上的不同计数仍未实现。

我的结论是,如果您在将其纳入生产管道时注意以下几点,那么这种变通方法是合理的:

它创建了大量可能会影响可读性和维护的代码。
将分组计数过程隔离到一个转换阶段,而不是将其与同一查询中的其他逻辑概念混合。
使用子查询和非分区组 count(distinct ..) 获取每个不同的计数更为混乱,可读性更低。

然而,更好的方法是使用支持分组汇总(如Spark或Pandas)的数据帧语言。按组列出的Spark汇总是紧凑且可读的,权衡的是将另一个执行环境和语言引入到您的流中。

albielin 4 年前

虽然红移在其窗口函数中不支持不同的聚合,但它确实有一个 listaggdistinct 作用所以你可以这样做:

regexp_count(
   listaggdistinct(customer_id, ',') over (partition by field2), 
   ','
) + 1

当然,如果你有 , 在customer\u id字符串中自然出现,您必须找到一个安全的分隔符。

Ilja 2 年前

另一种方法是使用

在第一选择中:

row_number() over (partition by customer_id,order_month,traffic_channel) as row_n_month_channel

在下一个选择中

sum(case when row_n_month_channel=1 then 1 else 0 end)