代码之家  ›  专栏  ›  技术社区  ›  Eric Green

按组设置的比例将NA值随机添加到数据帧中

  •  0
  • Eric Green  · 技术社区  · 4 年前

    我想按照组设置的比例,将NA值随机添加到我的数据帧中。

    library(tidyverse)
    set.seed(1)
    dat <- tibble(group = c(rep("A", 100),
                            rep("B", 100)),
                  value = rnorm(200))
    
    pA <- 0.5
    pB <- 0.2
    
    # does not work
    # was trying to create another column that i could use with
    # case_when to set value to NA if missing==1
    dat %>%
      group_by(group) %>%
      mutate(missing = rbinom(n(), 1, c(pA, pB))) %>%
      summarise(mean = mean(missing))
    
    1 回复  |  直到 4 年前
        1
  •  1
  •   dipetkov    4 年前

    我会创建一个小的tibble来跟踪预期的丢失率,并将其加入到第一个数据帧中。然后逐行检查以决定是否将值设置为missing。

    这也很容易推广到两组以上。

    library("tidyverse")
    
    set.seed(1)
    
    dat <- tibble(
      group = c(
        rep("A", 100),
        rep("B", 100)
      ),
      value = rnorm(200)
    )
    
    expected_nans <- tibble(
      group = c("A", "B"),
      p = c(0.5, 0.2)
    )
    
    dat_with_nans <- dat %>%
      inner_join(
        expected_nans,
        by = "group"
      ) %>%
      mutate(
        r = runif(n()),
        value = if_else(r < p, NA_real_, value)
      ) %>%
      select(
        -p, -r
      )
    
    dat_with_nans %>%
      group_by(
        group
      ) %>%
      summarise(
        mean(is.na(value))
      )
    #> # A tibble: 2 × 2
    #>   group `mean(is.na(value))`
    #>   <chr>                <dbl>
    #> 1 A                     0.53
    #> 2 B                     0.17
    

    创建于2022-03-11由 reprex package (v2.0.1)

        2
  •  0
  •   Eric Green    4 年前

    嵌套和不嵌套工程

    library(tidyverse)
    dat <- tibble(group = c(rep("A", 1000),
                            rep("B", 1000)),
                  value = rnorm(2000))
    
    pA <- .1
    pB <- 0.5
    
    set.seed(1)
    dat %>%
      group_by(group) %>%
      nest() %>%
      mutate(p = case_when(
        group=="A" ~ pA,
        TRUE ~ pB
      )) %>%
      mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>% 
      unnest() %>%
      summarise(mean = mean(missing))
    
    # A tibble: 2 × 2
      group  mean
      <chr> <dbl>
    1 A     0.11 
    2 B     0.481
    
    set.seed(1)
    dat %>%
      group_by(group) %>%
      nest() %>%
      mutate(p = case_when(
        group=="A" ~ pA,
        TRUE ~ pB
      )) %>%
      mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>% 
      unnest() %>%
      ungroup() %>%
      mutate(value = case_when(
        missing == 1 ~ NA_real_,
        TRUE ~ value
      )) %>%
      select(-p, -missing)
    
    推荐文章