代码之家  ›  专栏  ›  技术社区  ›  tic-toc-choc

在`dplyr中高效使用列表进行过滤`

  •  4
  • tic-toc-choc  · 技术社区  · 11 月前

    我的 filter_list 具有大量元素。下面的过滤是有效的,但如何使 dplyr::filter 更简洁?

    我做不到 all_of 工作。

    filter_list <- list(
      hair_color = c("blond", "brown"),
      skin_color = "light"
    )
    
    dplyr::starwars |> 
      dplyr::filter(
        hair_color %in% filter_list[["hair_color"]],
        skin_color %in% filter_list[["skin_color"]]
      )
    
    3 回复  |  直到 11 月前
        1
  •  7
  •   Axeman    11 月前

    我们可以用 reduce2 迭代应用 filter 陈述,例如:

    library(purrr); library(dplyr)
    
    out <- starwars |> 
      reduce2(
        .x = filter_list, .y = names(filter_list), .init = _,
        .f = \(df, x, y) filter(df, .data[[y]] %in% x)
      )
    
    # A tibble: 8 × 14
      name     height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles
      <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
    1 Leia Or…    150    49 brown      light      brown             19 fema… femin… Alderaan  Human   <chr> <chr>   
    2 Beru Wh…    165    75 brown      light      blue              47 fema… femin… Tatooine  Human   <chr> <chr>   
    3 Padmé A…    185    45 brown      light      brown             46 fema… femin… Naboo     Human   <chr> <chr>   
    4 Cordé       157    NA brown      light      brown             NA NA    NA     Naboo     NA      <chr> <chr>   
    5 Dormé       165    NA brown      light      brown             NA fema… femin… Naboo     Human   <chr> <chr>   
    6 Raymus …    188    79 brown      light      brown             NA male  mascu… Alderaan  Human   <chr> <chr>   
    7 Rey          NA    NA brown      light      hazel             NA fema… femin… NA        Human   <chr> <chr>   
    8 Poe Dam…     NA    NA brown      light      brown             NA male  mascu… NA        Human   <chr> <chr>
    

    检查是否正确:

    all.equal(
      out, 
      dplyr::starwars |> 
        dplyr::filter(
          hair_color %in% filter_list[["hair_color"]],
          skin_color %in% filter_list[["skin_color"]]
        )
    )
    
        2
  •  4
  •   I_O    11 月前

    使用基座 Map ing和 Reduce 惯性导航与制导:

    
        names(filter_list) |> 
          Map(f = \(varname) starwars |> filter(.data[[varname]] %in% filter_list[[varname]])) |> 
          Reduce(f = \(stack, piece) inner_join(stack, piece))
    
    

    请注意,已接受的解决方案 purrr::reduce2 速度是原来的两倍多。

        3
  •  4
  •   ThomasIsCoding    11 月前

    你可以试试 rowMeans + mapply 如下图所示

    starwars %>%
      filter(
        rowMeans(mapply(`%in%`, select(., names(filter_list)), filter_list)) == 1
      )
    

    Reduce + Map

    starwars %>%
      filter(Reduce(`&`, Map(`%in%`, select(., names(filter_list)), filter_list)))
    

    或者只是一个基本的R组合 subset + 减少 + 地图

    subset(starwars, Reduce(`&`, Map(`%in%`, starwars[names(filter_list)], filter_list)))
    

    这给了

    # A tibble: 8 × 14
      name      height  mass hair_color skin_color eye_color birth_year sex   gender
      <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
    1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
    2 Beru Whi…    165    75 brown      light      blue              47 fema… femin…
    3 Padmé Am…    185    45 brown      light      brown             46 fema… femin…
    4 Cordé        157    NA brown      light      brown             NA NA    NA
    5 Dormé        165    NA brown      light      brown             NA fema… femin…
    6 Raymus A…    188    79 brown      light      brown             NA male  mascu…
    7 Rey           NA    NA brown      light      hazel             NA fema… femin…
    8 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
    # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
    #   vehicles <list>, starships <list>
    

    基准

    如果标题中的“效率”指的是速度,你可以在这里查看基准测试

    axeman <- \() {
      starwars |>
        reduce2(
          .x = filter_list, .y = names(filter_list), .init = _,
          .f = \(df, x, y) filter(df, .data[[y]] %in% x)
        )
    }
    
    i_o <- \() {
      names(filter_list) |>
        Map(f = \(varname) starwars |> filter(.data[[varname]] %in% filter_list[[varname]])) |>
        Reduce(f = \(stack, piece) inner_join(stack, piece))
    }
    
    tic1 <- \() {
      starwars %>%
        filter(
          rowMeans(mapply(`%in%`, select(., names(filter_list)), filter_list)) == 1
        )
    }
    
    tic2 <- \() {
      starwars %>%
        filter(Reduce(`&`, Map(`%in%`, select(., names(filter_list)), filter_list)))
    }
    
    tic3 <- \() {
      subset(starwars, Reduce(`&`, Map(`%in%`, starwars[names(filter_list)], filter_list)))
    }
    
    microbenchmark(
      axeman(),
      i_o(),
      tic1(),
      tic2(),
      tic3(),
      unit = "relative",
      check = "equal"
    )
    

    其中显示

    Unit: relative
         expr       min         lq      mean    median        uq       max neval
     axeman()  11.98158   9.977999  9.679677  10.74786  9.652521  4.009427   100
        i_o() 172.43091 130.316298 96.607907 121.01399 96.094325 15.142344   100
       tic1()  12.45654  11.237299 11.433905  12.15965 12.796552  2.417425   100
       tic2()  12.14343  10.864622 10.723350  11.55505 11.580282  4.656169   100
       tic3()   1.00000   1.000000  1.000000   1.00000  1.000000  1.000000   100