代码之家  ›  专栏  ›  技术社区  ›  Shekhar

基于一列的配置单元自连接

  •  0
  • Shekhar  · 技术社区  · 8 年前

    我在Hive中有一个表,其中的数据来自SAP系统。此表包含以下列和数据:

    +======================================================================+
    |document_number | year | cost_centre | vendor_account_number | amount | 
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |                       |  123.5 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |                       |  586   |
    +----------------------------------------------------------------------+
    

    vendor_account_number 列只存在于一行中,我想将其带到所有其他行中。

    +======================================================================+
    |document_number | year | cost_centre | vendor_account_number | amount | 
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  586   |
    +----------------------------------------------------------------------+
    

    为了实现这一点,我在Hive中编写了以下CTE

    with non_blank_account_no as(
      select document_number, vendor_account_number
      from my_table
      where vendor_account_number != ''
    )
    

    然后进行自左外连接,如下所示:

    select 
        a.document_number, a.year, 
        a.cost_centre, a.amount,
        b.vendor_account_number
    from my_table a
    left outer join non_blank_account_no b on a.document_number = b.document_number
    where a.document_number = ' '
    

    但我得到了如下所示的重复输出

    +======================================================================+
    |document_number | year | cost_centre | vendor_account_number | amount | 
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  586   |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
    +----------------------------------------------------------------------+
    |       1        | 2016 |     XZ10    |      1234567890       |  586   |
    +----------------------------------------------------------------------+
    

    谁能帮我理解我的蜂巢查询出了什么问题?

    1 回复  |  直到 8 年前
        1
  •  1
  •   David דודו Markovitz    8 年前

    在许多用例中,自连接可以被windows函数替换

    select  document_number
           ,year
           ,cost_center
    
           ,max (case when vendor_account_number <> '' then vendor_account_number end) over 
            (
                partition by    document_number
            )                                       as vendor_account_number
    
           ,amount
    
    from    my_table    
    
    推荐文章