代码之家  ›  专栏  ›  技术社区  ›  jakes

如何定义具有附加列条件的运行长度序列

  •  0
  • jakes  · 技术社区  · 7 年前

    this thread . 与此类似,我需要定义一个group column的运行长度类型id(忽略 NA seq_break 表示序列应该在 seq_break = TRUE ,而事实上,它应该作为上一个序列的最后一个事件包含。示例数据附在下面。这种差异可以在一行中观察到 46 13 在这里,我需要把它按顺序包括进去 12

    df <- structure(list(group = c(NA, NA, "home", "home", "home", "home", 
    "home", "home", "away", NA, NA, "home", "home", "home", NA, NA, 
    NA, "home", "away", "away", NA, "away", "away", "away", "home", 
    "away", "away", "away", NA, "home", "home", NA, NA, "away", NA, 
    NA, "home", NA, NA, "home", "home", "home", "home", "home", "home", 
    "home", "away", "away", NA, NA), seq_break = c(FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, 
    FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
    FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, 
    TRUE), expected_output = c(NA, NA, 1, 1, 1, 1, 1, 1, 2, NA, NA, 
    3, 3, 3, NA, NA, NA, 4, 5, 5, NA, 6, 6, 6, 7, 8, 8, 8, NA, 9, 
    9, NA, NA, 10, NA, NA, 11, NA, NA, 12, 12, 12, 12, 12, 12, 12, 
    13, 13, NA, NA)), .Names = c("group", "seq_break", "expected_output"
    ), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
    -50L))
    

    你有没有什么办法 tidyverse cumsum 可以在这里替换。。。

    2 回复  |  直到 7 年前
        1
  •  0
  •   www    7 年前

    我们可以创建一个新列来调用 seq_break2 并添加到管道中,如下所示。这将创建与预期输出相同的结果。

    library(tidyverse)
    library(data.table)
    
    df2 <- df %>% 
      select(-expected_output) %>%
      rowid_to_column() 
    
    df3 <- df2 %>%
      mutate(seq_break2 = ifelse(seq_break & !is.na(group), FALSE, seq_break)) %>%
      mutate(ID = rleid(group, seq_break2)) %>%
      group_by(group, seq_break2, ID) %>%
      filter(!(is.na(group) & seq_break2 & row_number() > 1)) %>%
      ungroup() %>%
      mutate(ID2 = cumsum(seq_break2)) %>%
      drop_na(group) %>%
      mutate(expected_output = rleid(group, ID2)) %>%
      select(rowid, expected_output) %>%
      left_join(df2, ., by = "rowid") %>%
      select(-rowid)
    
        2
  •  1
  •   Frank    7 年前

    正在使用rleid和shift from data.table。。。

    library(data.table)
    setDT(df)
    
    # make groups
    df[, v := rleid(group, shift(cumsum(seq_break)))]
    
    # drop if group is NA
    df[is.na(group), v := NA]
    
    # renumber the others
    df[!is.na(group), v := .GRP, by=v]
    
    # check
    stopifnot( df[, all.equal(v, expected_output)] )
    

    seq_break 列在示例中实际上是不相关的,因此我不确定是否正确使用了它:

    df[, v2 := rleid(group)][is.na(group), v2 := NA][!is.na(group), v2 := .GRP, by=v2]
    
    # check
    stopifnot( df[, all.equal(v2, expected_output)] )
    

    library(dplyr)
    res = df  %>% mutate(
      v2 = data.table::rleid(group) %>% replace(is.na(group), NA),
      v2 = match(v2, na.omit(unique(v2)))
    ) 
    
    # check
    stopifnot( with(res, all.equal(v2, expected_output)) )