代码之家  ›  专栏  ›  技术社区  ›  te time

按组获取完整的行集

  •  0
  • te time  · 技术社区  · 2 年前

    编辑将扩大范围以包括R标记

    你好

    我有一张按id和月份排序的表格,与下面的表格相似,只是它有数百万 id's 和数百 var month 列可以是2023年起的任何一个月。我需要填写每个id缺少的月份,并添加一列,指示该id是否在特定月份处于活动状态。假设 id 如果表中没有特定月份的记录,则处于非活动状态。所需表中的总行数为12*个不同的id。我还为示例输入包含了我想要的输出。对于R解决方案,我更喜欢 tidyverse 解决方案但是 data.table 选项也会起作用,因为它会更快,我只是可能不太擅长修改它。任何帮助都将不胜感激。

    这是R的dputs

    input =structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 3L, 3L), month = c(202301L, 202304L, 202305L, 
    202301L, 202302L, 202303L, 202304L, 202305L, 202306L, 202307L, 
    202308L, 202309L, 202310L, 202311L, 202312L, 202307L, 202308L
    ), var1 = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
    14L, 15L, 16L, 17L, 18L), var2 = 1:17), class = "data.frame", row.names = c(NA, 
    -17L))
    
    output = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), month = c(202301L, 202302L, 
    202303L, 202304L, 202305L, 202306L, 202307L, 202308L, 202309L, 
    202310L, 202311L, 202312L, 202301L, 202302L, 202303L, 202304L, 
    202305L, 202306L, 202307L, 202308L, 202309L, 202310L, 202311L, 
    202312L, 202301L, 202302L, 202303L, 202304L, 202305L, 202306L, 
    202307L, 202308L, 202309L, 202310L, 202311L, 202312L), var1 = c(1L, 
    NA, NA, 2L, 3L, NA, NA, NA, NA, NA, NA, NA, 5L, 6L, 7L, 8L, 9L, 
    10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA, NA, NA, 17L, 
    18L, NA, NA, NA, NA), var2 = c(1L, NA, NA, 2L, 3L, NA, NA, NA, 
    NA, NA, NA, NA, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
    15L, NA, NA, NA, NA, NA, NA, 16L, 17L, NA, NA, NA, NA), active = c(1L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 
    0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -36L))
    

    样本输入:

    +----+--------+------+------+
    | id | month  | var1 | var2 |
    +----+--------+------+------+
    |  1 | 202301 |    1 |    1 |
    |  1 | 202304 |    2 |    2 |
    |  1 | 202305 |    3 |    3 |
    |  2 | 202301 |    5 |    4 |
    |  2 | 202302 |    6 |    5 |
    |  2 | 202303 |    7 |    6 |
    |  2 | 202304 |    8 |    7 |
    |  2 | 202305 |    9 |    8 |
    |  2 | 202306 |   10 |    9 |
    |  2 | 202307 |   11 |   10 |
    |  2 | 202308 |   12 |   11 |
    |  2 | 202309 |   13 |   12 |
    |  2 | 202310 |   14 |   13 |
    |  2 | 202311 |   15 |   14 |
    |  2 | 202312 |   16 |   15 |
    |  3 | 202307 |   17 |   16 |
    |  3 | 202308 |   18 |   17 |
    +----+--------+------+------+
    

    所需输出:

    +----+--------+------+------+--------+
    | id | month  | var1 | var2 | active |
    +----+--------+------+------+--------+
    |  1 | 202301 |    1 |    1 |      1 |
    |  1 | 202302 |      |      |      0 |
    |  1 | 202303 |      |      |      0 |
    |  1 | 202304 |    2 |    2 |      1 |
    |  1 | 202305 |    3 |    3 |      1 |
    |  1 | 202306 |      |      |      0 |
    |  1 | 202307 |      |      |      0 |
    |  1 | 202308 |      |      |      0 |
    |  1 | 202309 |      |      |      0 |
    |  1 | 202310 |      |      |      0 |
    |  1 | 202311 |      |      |      0 |
    |  1 | 202312 |      |      |      0 |
    |  2 | 202301 |    5 |    4 |      1 |
    |  2 | 202302 |    6 |    5 |      1 |
    |  2 | 202303 |    7 |    6 |      1 |
    |  2 | 202304 |    8 |    7 |      1 |
    |  2 | 202305 |    9 |    8 |      1 |
    |  2 | 202306 |   10 |    9 |      1 |
    |  2 | 202307 |   11 |   10 |      1 |
    |  2 | 202308 |   12 |   11 |      1 |
    |  2 | 202309 |   13 |   12 |      1 |
    |  2 | 202310 |   14 |   13 |      1 |
    |  2 | 202311 |   15 |   14 |      1 |
    |  2 | 202312 |   16 |   15 |      1 |
    |  3 | 202301 |      |      |      0 |
    |  3 | 202302 |      |      |      0 |
    |  3 | 202303 |      |      |      0 |
    |  3 | 202304 |      |      |      0 |
    |  3 | 202305 |      |      |      0 |
    |  3 | 202306 |      |      |      0 |
    |  3 | 202307 |   17 |   16 |      1 |
    |  3 | 202308 |   18 |   17 |      1 |
    |  3 | 202309 |      |      |      0 |
    |  3 | 202310 |      |      |      0 |
    |  3 | 202311 |      |      |      0 |
    |  3 | 202312 |      |      |      0 |
    +----+--------+------+------+--------+
    
    0 回复  |  直到 2 年前
        1
  •  2
  •   shawnt00    2 年前
    with ids as (
        select distinct id from T
    ), months as (
        -- assuming there's at least one active id every month
        select distinct month from T
        -- where month between '202301' and '202312'
    )
    select i.id, m.month, t.var1, t.var2,
        case when t.id is null then 1 else 0 end as active
    from ids i cross join months m left outer join T t
        on t.id = i.id and t.month = m.month;
    

    老实说,我对Hive并没有什么特别的了解,但这是“加密”你的结果的一般想法。缺少行的输出将为null。

        2
  •  -1
  •   xynoDigi    2 年前

    要将缺失的月份添加到表中,可以将INSERT语句与生成缺失月份的SELECT语句一起使用。以下是将缺失的月份插入DOCTOR_CHECKUP表的示例:

    INSERT INTO DOCTOR_CHECKUP (CHECKED_DATE)
    SELECT TO_CHAR(ADD_MONTHS(DATE '2016-01-01', LEVEL - 1), 'MM-YYYY') MTH
    FROM DUAL
    CONNECT BY LEVEL <= 24
    MINUS
    SELECT TO_CHAR(TO_DATE(CHECKED_DATE, 'DD-MON-YY'), 'MM-YYYY')
    FROM DOCTOR_CHECKUP;
    

    此查询将丢失的月份插入DOCTOR_CHECKUP表的CHECKED_DATE列。您可能需要调整列名和日期格式以匹配您的表结构。

    可以使用子查询从同一表或不同表中的另一列中选择数据,并将其包含在INSERT语句中。以下是一个示例:

    INSERT INTO DOCTOR_CHECKUP (CHECKED_DATE, ANOTHER_COLUMN)
    SELECT TO_CHAR(ADD_MONTHS(DATE '2016-01-01', LEVEL - 1), 'MM-YYYY') MTH, ANOTHER_COLUMN
    FROM ANOTHER_TABLE
    CONNECT BY LEVEL <= 24
    MINUS
    SELECT TO_CHAR(TO_DATE(CHECKED_DATE, 'DD-MON-YY'), 'MM-YYYY'), ANOTHER_COLUMN
    FROM DOCTOR_CHECKUP;
    

    在本例中,来自ANOTHER_TABLE表中ANOTHER_COLUMN列的数据包含在INSERT语句中。可以将ANOTHER_TABLE替换为包含要包含列的表的名称,将ANOTHER_column替换为要包含的列的名称。