代码之家  ›  专栏  ›  技术社区  ›  Nir Regev

使用汇总函数添加条件组标识符

  •  0
  • Nir Regev  · 技术社区  · 9 年前

    我有一个具有子序列(行组)的数据帧 识别这些子序列的条件是观察列差异的激增。这就是数据的样子:

    > dput(test)
    structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
        .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"), 
        events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 
        2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click", 
        "mousedown", "mousemove", "mouseup"), class = "factor"), 
        deltas = structure(6:25, .Label = c("154875", "154878", "154880", 
        "155866", "155870", "38479", "38488", "38492", "38775", "45595", 
        "45602", "45606", "45987", "50280", "50285", "50288", "50646", 
        "54995", "55001", "55005", "55317", "59528", "59533", "59537", 
        "59921", "63392", "63403", "63408", "63822", "66706", "66710", 
        "66716", "67002", "73750", "73755", "73759", "74158", "77999", 
        "78003", "78006", "78076", "81360", "81367", "81371", "82381", 
        "93365", "93370", "93374", "93872"), class = "factor"), 
        serial = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
        19, 20), diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 3, 358, 4349, 6, 4,
        312, 4211, 5, 4, 384)), 
        .Names = c("vid", "events", "deltas", "serial", "diff"),
        row.names = c(NA, 20L), class = "data.frame")
    

    我正在尝试添加一个列,该列将指示何时识别新的子序列,并为整个子序列分配一个唯一的id。我将用以下示例演示分组的标准:
    第5行的diff值为6829,比该行之前的最大值高10倍(283)。 结果应该是这样的df:

    structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
        .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"), 
        events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 
        2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click", 
        "mousedown", "mousemove", "mouseup"), class = "factor"), 
        deltas = structure(6:25, .Label = c("154875", "154878", "154880", 
        "155866", "155870", "38479", "38488", "38492", "38775", "45595", 
        "45602", "45606", "45987", "50280", "50285", "50288", "50646", 
        "54995", "55001", "55005", "55317", "59528", "59533", "59537", 
        "59921", "63392", "63403", "63408", "63822", "66706", "66710", 
        "66716", "67002", "73750", "73755", "73759", "74158", "77999", 
        "78003", "78006", "78076", "81360", "81367", "81371", "82381", 
        "93365", "93370", "93374", "93872"), class = "factor"), serial = c(1, 
        2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
        19, 20), 
        diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 
        3, 358, 4349, 6, 4, 312, 4211, 5, 4, 384), 
        group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)), 
        .Names =  c("vid", "events", "deltas", "serial", "diff", "group"), 
        row.names = c(NA, 20L), class = "data.frame")
    

    非常感谢任何帮助

    2 回复  |  直到 9 年前
        1
  •  0
  •   Gopala    9 年前

    让我更详细地介绍一下它的工作原理和工作原理。

    首先,让我们添加一个没有 cumsum 部分:

    df$tag <- df$diff > 500
    head(df)
                                       vid    events deltas serial diff   tag
    1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE
    2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE
    3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE
    4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE
    5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE
    6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE
    

    如您所见,它只是在标记列中创建一个TRUE/FALSE值的逻辑值,表示差异是否“足够大”(基于所选阈值)。

    现在,当你这样做的时候 累加 并将其存储在 group 列,它将继续累加。每一个TRUE值将使累计和增加1,每一个FALSE值将使累积和保持与该行被命中之前相同。

    因此,这将为您提供所需的增量 值:

    df$group <- cumsum(df$tag)
    head(df)
                                       vid    events deltas serial diff   tag group
    1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE     0
    2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE     0
    3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE     0
    4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE     0
    5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE     1
    6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE     1
    

    请注意,组值从零开始。由于前几个FALSE值的累积和为零。但是,您可能希望组标识符以1开头。所以,我在 累加 ,但您也可以按以下方式执行,作为额外步骤。

    df$group <- df$group + 1
    head(df)
                                       vid    events deltas serial diff   tag group
    1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE     1
    2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE     1
    3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE     1
    4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE     1
    5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE     2
    6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE     2
    

    希望这有帮助。

        2
  •  0
  •   Nir Regev    9 年前

    由用户Gopala提供: df$group<-怎么样cumsum(df$diff>500)+1(您指定的任何标准)。戈帕拉31分钟前