代码之家  ›  专栏  ›  技术社区  ›  Japhir

在dplyr管道中按组线性插值(近似值)

  •  1
  • Japhir  · 技术社区  · 7 年前

    我有一个问题,我觉得很难用MRE和简单的语言来解释 答案,主要是因为我不完全理解问题所在

    我有一个tibble与许多样品和参考测量,这是我想要的 对每个样本进行线性插值。我现在是通过 approx 不能很好地在一个小组里用dplyr管道方式。现在我用一个 对tibble进行采样,然后使用for循环。

    进管道,这样我就可以在小组里做任何事?我做过实验 dplyr::do() ,并遇到了“用dplyr编程”的小插曲,但是 搜索主要给我 broom::augment lm 不同的是(e、 g.见 Using approx() with groups in dplyr ). 这条线索似乎也很有希望: How do you use approx() inside of mutate_at()?

    irc的人建议使用条件突变 case_when ,但我

    对于以下的变异操作,但是变异操作依赖于 我刚刚筛选出来的分组数据,如果有意义的话。

    library(tidyverse) # or just dplyr, tibble
    
    # create fake data
    data <- data.frame(
      # in reality a dttm with the measurement time
      timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
      # measurement cycle, normally 40 for sample, 41 for reference
      cycle = rep(c(rep(1:3, 2), 4), 3),
      # wheather the measurement is a reference or a sample
      isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
      # measurement intensity for mass 44
      r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
      # measurement intensity for mass 45, normally also masses up to mass 49
      mutate(r45 = r44 + rnorm(21, 20))
    # of course this could be tidied up to "intensity" with a new column "mass"
    # (44, 45, ...), but that would make making comparisons even harder...
    
    # overview plot
    data %>%
      ggplot(aes(x = cycle, y = r44, colour = isref)) +
      geom_line() +
      geom_line(aes(y = r45), linetype = 2) +
      geom_point() +
      geom_point(aes(y = r45), shape = 1) +
      facet_grid(~ timestamp)
    
    # what I would like to do
    data %>%
      group_by(timestamp) %>%
      do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
        y = data %>% filter(isref) %>% pull(cycle),
        xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
      unnest()
    # immediately append this new column to the original dataframe for all the
    # samples (!isref) and then apply another approx for those values.
    
    # here's my current attempt for one of the timestamps
    matchref <- function(dat) {
      # split the data into sample gas and reference gas
      ref <- filter(dat, isref)
      smp <- filter(dat, !isref)
    
      # calculate the "target cycle", the points at which the reference intensity
      # 44 matches the sample intensity 44 with linear interpolation
      target_cycle <- approx(x = ref$r44,
        y = ref$cycle, xout = smp$r44)
    
      # append the target cycle to the sample gas
      smp <- smp %>%
        group_by(timestamp) %>%
        mutate(target = target_cycle$y)
    
      # linearly interpolate each reference gas to the target cycle
      ref <- ref %>%
        group_by(timestamp) %>%
        # this is needed because the reference has one more cycle
        mutate(target = c(target_cycle$y, NA)) %>%
        # filter out all the failed ones (no interpolation possible)
        filter(!is.na(target)) %>%
        # calculate interpolated value based on r44 interpolation (i.e., don't
        # actually interpolate this value but shift it based on the 44
        # interpolation)
        mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
          r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
        select(timestamp, target, r44:r45)
    
      # add new reference gas intensities to the correct sample gasses by the target cycle
      left_join(smp, ref, by = c("time", "target"))
    }
    
    matchref(data)
    # and because now "target" must be length 3 (the group size) or one, not 9
    # I have to create this ugly for-loop
    
    # for which I create a copy of data that has the new columns to be created
    mr <- data %>%
      # filter the sample gasses (since we convert ref to sample)
      filter(!isref) %>%
      # add empty new columns
      mutate(target = NA, r44 = NA, r45 = NA)
    
    # apply matchref for each group timestamp
    for (grp in unique(data$timestamp)) {
      mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
    }
    
    1 回复  |  直到 7 年前
        1
  •  2
  •   Dan    7 年前

    下面是一种将引用和示例传播到新列的方法。我放弃了 r45

      data %>% 
        select(-r45) %>% 
        mutate(isref = ifelse(isref, "REF", "SAMP")) %>% 
        spread(isref, r44) %>% 
        group_by(timestamp) %>% 
        mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>% 
        ungroup
    

    给予,

      # timestamp      cycle  REF  SAMP target_cycle
      # <fct>     <dbl> <dbl> <dbl>        <dbl>
      # 1  a             1    30    28          3  
      # 2  a             2    29    27          4  
      # 3  a             3    28    26         NA  
      # 4  a             4    27    NA         NA  
      # 5  b             1    31    26         NA  
      # 6  b             2    38    36          2.5
      # 7  b             3    34    33          4  
      # 8  b             4    33    NA         NA  
      # 9  c             1    15    31         NA  
      # 10 c             2    19    18          3  
      # 11 c             3    18    16          2.5
      # 12 c             4    17    NA         NA  
    

    编辑以下评论

    保留 r45型 您可以使用如下聚集-联合-扩散方法:

    df %>% 
      mutate(isref = ifelse(isref, "REF", "SAMP")) %>% 
      gather(r, value, r44:r45) %>% 
      unite(ru, r, isref, sep = "_") %>% 
      spread(ru, value) %>%
      group_by(timestamp) %>% 
      mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>% 
      ungroup
    

    # # A tibble: 12 x 7
    #    timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
    # <fct>        <dbl>   <dbl>    <dbl>   <dbl>    <dbl>        <dbl>
    # 1  a             1      30       28    49.5     47.2          3  
    # 2  a             2      29       27    48.8     48.7          4  
    # 3  a             3      28       26    47.2     46.8         NA  
    # 4  a             4      27       NA    47.9     NA           NA  
    # 5  b             1      31       26    51.4     45.7         NA  
    # 6  b             2      38       36    57.5     55.9          2.5
    # 7  b             3      34       33    54.3     52.4          4  
    # 8  b             4      33       NA    52.0     NA           NA  
    # 9  c             1      15       31    36.0     51.7         NA  
    # 10 c             2      19       18    39.1     37.9          3  
    # 11 c             3      18       16    39.2     35.3          2.5
    # 12 c             4      17       NA    39.0     NA           NA