代码之家  ›  专栏  ›  技术社区  ›  arnyeinstein

对数据帧中的重复值进行编号

r
  •  0
  • arnyeinstein  · 技术社区  · 1 年前

    我有以下问题:

    mydata <- structure(list(Nr = 1:10, sgv = c(72L, 72L, 68L, 62L, 83L, 83L, 
    86L, 86L, 85L, 85L), Date = structure(c(1605969695, 1605969700.306, 
    1605970000.593, 1605970300.593, 1605970595, 1605970600.594, 1605970895, 
    1605970900.417, 1605971195, 1605971200.243), tzone = "CET", class = c("POSIXct", 
    "POSIXt")), Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 
    2020, 2020, 2020), Weekday = c(7, 7, 7, 7, 7, 7, 7, 7, 7, 7), 
        Week = c(47, 47, 47, 47, 47, 47, 47, 47, 47, 47), mmol = c(3.996, 
        3.996, 3.774, 3.441, 4.6065, 4.6065, 4.773, 3.8, 4.7175, 
        4.7175), check_time = structure(c(294.695000171661, 5.30599999427795, 
        300.286999940872, 300, 294.40700006485, 5.5939998626709, 
        294.406000137329, 5.41700005531311, 294.582999944687, 5.24300003051758
        ), class = "difftime", units = "secs"), below = c(FALSE, 
        FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE
        )), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
    ))
    
    # A tibble: 10 × 9
          Nr   sgv Date                 Year Weekday  Week  mmol check_time   below
       <int> <int> <dttm>              <dbl>   <dbl> <dbl> <dbl> <drtn>       <lgl>
     1     1    72 2020-11-21 15:41:35  2020       7    47  4.00 294.695 secs FALSE
     2     2    72 2020-11-21 15:41:40  2020       7    47  4.00   5.306 secs FALSE
     3     3    68 2020-11-21 15:46:40  2020       7    47  3.77 300.287 secs TRUE 
     4     4    62 2020-11-21 15:51:40  2020       7    47  3.44 300.000 secs TRUE 
     5     5    83 2020-11-21 15:56:35  2020       7    47  4.61 294.407 secs FALSE
     6     6    83 2020-11-21 15:56:40  2020       7    47  4.61   5.594 secs FALSE
     7     7    86 2020-11-21 16:01:35  2020       7    47  4.77 294.406 secs FALSE
     8     8    86 2020-11-21 16:01:40  2020       7    47  3.8    5.417 secs TRUE 
     9     9    85 2020-11-21 16:06:35  2020       7    47  4.72 294.583 secs FALSE
    10    10    85 2020-11-21 16:06:40  2020       7    47  4.72   5.243 secs FALSE
    

    我的目标是计算每组TRUE值的总时间(check_time的总和)。我的数据帧中大约有600000行,TRUE值以1、2、3甚至更多的组出现。 为此,我想用标识符对TRUE值进行编号,其中所有分组的TRUE值都具有相同的标识符。上面的例子应该是这样的:

          Nr   sgv Date                 Year Weekday  Week  mmol check_time   below    ID
       <int> <int> <dttm>              <dbl>   <dbl> <dbl> <dbl> <drtn>       <lgl> <dbl>
     1     1    72 2020-11-21 15:41:35  2020       7    47  4.00 294.695 secs FALSE    NA
     2     2    72 2020-11-21 15:41:40  2020       7    47  4.00   5.306 secs FALSE    NA
     3     3    68 2020-11-21 15:46:40  2020       7    47  3.77 300.287 secs TRUE      1
     4     4    62 2020-11-21 15:51:40  2020       7    47  3.44 300.000 secs TRUE      1
     5     5    83 2020-11-21 15:56:35  2020       7    47  4.61 294.407 secs FALSE    NA
     6     6    83 2020-11-21 15:56:40  2020       7    47  4.61   5.594 secs FALSE    NA
     7     7    86 2020-11-21 16:01:35  2020       7    47  4.77 294.406 secs FALSE    NA
     8     8    86 2020-11-21 16:01:40  2020       7    47  3.8    5.417 secs TRUE      2
     9     9    85 2020-11-21 16:06:35  2020       7    47  4.72 294.583 secs FALSE    NA
    10    10    85 2020-11-21 16:06:40  2020       7    47  4.72   5.243 secs FALSE    NA
    
    1 回复  |  直到 1 年前
        1
  •  3
  •   Ronak Shah    1 年前

    这里有一个使用基数R的选项 rle :

    transform(mydata, ID = replace(with(rle(below), rep(cumsum(values), lengths)), !below, NA))
    
    #   Nr sgv                Date Year Weekday Week   mmol   check_time below ID
    #1   1  72 2020-11-21 15:41:35 2020       7   47 3.9960 294.695 secs FALSE NA
    #2   2  72 2020-11-21 15:41:40 2020       7   47 3.9960   5.306 secs FALSE NA
    #3   3  68 2020-11-21 15:46:40 2020       7   47 3.7740 300.287 secs  TRUE  1
    #4   4  62 2020-11-21 15:51:40 2020       7   47 3.4410 300.000 secs  TRUE  1
    #5   5  83 2020-11-21 15:56:35 2020       7   47 4.6065 294.407 secs FALSE NA
    #6   6  83 2020-11-21 15:56:40 2020       7   47 4.6065   5.594 secs FALSE NA
    #7   7  86 2020-11-21 16:01:35 2020       7   47 4.7730 294.406 secs FALSE NA
    #8   8  86 2020-11-21 16:01:40 2020       7   47 3.8000   5.417 secs  TRUE  2
    #9   9  85 2020-11-21 16:06:35 2020       7   47 4.7175 294.583 secs FALSE NA
    #10 10  85 2020-11-21 16:06:40 2020       7   47 4.7175   5.243 secs FALSE NA
    

    解释-

    具有 rle 我们创建连续的数字,每 TRUE 价值观

    a <- with(rle(mydata$below), rep(cumsum(values), lengths))
    a
    #[1] 0 0 1 1 1 1 1 2 2 2
    

    既然我们想要 NA 对于 FALSE 值,我们使用 replace

    replace(a, !mydata$below, NA)
    #[1] NA NA  1  1 NA NA NA  2 NA NA