代码之家  ›  专栏  ›  技术社区  ›  stephr

循环为多个变量选择最接近另一个日期的日期

  •  0
  • stephr  · 技术社区  · 10 月前

    对于每个MRN ID,我想选择与多个其他列(即基准、六个月、十二个月和二十四个月)中列出的日期最接近的各个列(即hdl、ldl和vldl)相关联的日期(在OBObservation_date中)。因此,例如,对于报告HDL数据的行,我想评估OBSEVATION_DATE中的日期,以找到最接近Base、SixMonths等的日期。应该为该示例创建4个新列(列出的4个时间点各一个)。此外,对于该示例,我希望创建另一组4列,其值与其他4列新列中标识的日期相关联。因此,对于我的示例数据,总共应该有24个新列(见下文)。

    该算法应仅查找给定列中有数据的观测值,作为最接近基准、六个月、十二个月和二十四个月的日期的潜在选择。例如,如果一行没有HDL的数据,则不应将其视为从中选择日期的选择池的一部分。并非所有为某人提供的日期都会在我关心的所有列中都有数据。因此,仅仅依靠一种算法为一个人选择日期是没有用的,而不管该日期是否有该变量的数据。

    HDL OBSERVATION_DATE date closest to Base,
    HDL OBSERVATION_DATE date closest to SixMonths,
    HDL OBSERVATION_DATE date closest to TwelveMonths,
    HDL OBSERVATION_DATE date closest to TwentyFourMonths,
    HDL value associated with OBSERVATION_DATE date closest to Base,
    HDL value associated with OBSERVATION_DATE date closest to SixMonths,
    HDL value associated with OBSERVATION_DATE date closest to TwelveMonths,
    HDL value associated with OBSERVATION_DATE date closest to TwentyFourMonths,
    
    LDL OBSERVATION_DATE date closest to Base,
    LDL OBSERVATION_DATE date closest to SixMonths,
    LDL OBSERVATION_DATE date closest to TwelveMonths,
    LDL OBSERVATION_DATE date closest to TwentyFourMonths,
    LDL value associated with OBSERVATION_DATE date closest to Base,
    LDL value associated with OBSERVATION_DATE date closest to SixMonths,
    LDL value associated with OBSERVATION_DATE date closest to TwelveMonths,
    LDL value associated with OBSERVATION_DATE date closest to TwentyFourMonths,
    
    VLDL OBSERVATION_DATE date closest to Base,
    VLDL OBSERVATION_DATE date closest to SixMonths,
    VLDL OBSERVATION_DATE date closest to TwelveMonths,
    VLDL OBSERVATION_DATE date closest to TwentyFourMonths,
    VLDL value associated with OBSERVATION_DATE date closest to Base,
    VLDL value associated with OBSERVATION_DATE date closest to SixMonths,
    VLDL value associated with OBSERVATION_DATE date closest to TwelveMonths,
    VLDL value associated with OBSERVATION_DATE date closest to TwentyFourMonths,
    

    以下是我的数据:

    structure(list(MRN = c(15842, 15842, 15842, 19463, 19463, 19463, 
    19463, 19463, 19463, 19463, 19463, 19463, 19463, 19463, 19463, 
    19463, 19463, 19463, 19463, 19463, 19463, 34025, 34025, 34025, 
    34025, 34025, 34025, 37465, 37465, 37465, 68874, 68874, 68874, 
    76133, 76133, 76133, 76133, 76133, 76133, 76133, 76133, 76133, 
    76133, 76133, 76133, 76133, 76133, 76133, 76133, 76133), OBSERVATION_DATE = structure(c(18289, 
    18289, 18289, 16073, 16073, 16073, 16434, 16434, 16434, 16536, 
    16536, 16536, 16821, 16821, 16821, 17196, 17196, 17196, 17604, 
    17604, 17604, 19114, 19114, 19114, 19338, 19338, 19338, 19060, 
    19060, 19060, 19730, 19730, 19730, 17326, 17326, 17326, 17331, 
    17331, 17331, 17333, 17333, 17333, 17336, 17336, 17336, 17339, 
    17339, 17339, 17347, 17347), class = "Date"), HDL = c(NA, 47, 
    NA, 40, NA, NA, NA, 43, NA, 38, NA, NA, NA, 41, NA, NA, 48, NA, 
    NA, 45, NA, NA, 44, NA, NA, 42, NA, NA, NA, 56, 16, NA, NA, NA, 
    34, NA, 34, NA, NA, 31, NA, NA, 33, NA, NA, NA, 32, NA, NA, NA
    ), LDL = c(NA, NA, 83, NA, 92, NA, 107, NA, NA, NA, NA, 112, 
    93, NA, NA, 96, NA, NA, 109, NA, NA, NA, NA, 76, 56, NA, NA, 
    141, NA, NA, NA, NA, 49, 55, NA, NA, NA, NA, 57, NA, 53, NA, 
    NA, NA, 59, 55, NA, NA, NA, 55), VLDL = c(14, NA, NA, NA, NA, 
    46, NA, NA, 30, NA, 30, NA, NA, NA, 28, NA, NA, 20, NA, NA, 28, 
    17, NA, NA, NA, NA, 21, NA, 35, NA, NA, 15, NA, NA, NA, 24, NA, 
    24, NA, NA, NA, 20, NA, 23, NA, NA, NA, 26, 22, NA), Base = structure(c(17647, 
    17647, 17647, 17032, 17032, 17032, 17032, 17032, 17032, 17032, 
    17032, 17032, 17032, 17032, 17032, 17032, 17032, 17032, 17032, 
    17032, 17032, 18577, 18577, 18577, 18577, 18577, 18577, 18894, 
    18894, 18894, 19431, 19431, 19431, 16751, 16751, 16751, 16751, 
    16751, 16751, 16751, 16751, 16751, 16751, 16751, 16751, 16751, 
    16751, 16751, 16751, 16751), class = "Date"), SixMonths = structure(c(17830, 
    17830, 17830, 17215, 17215, 17215, 17215, 17215, 17215, 17215, 
    17215, 17215, 17215, 17215, 17215, 17215, 17215, 17215, 17215, 
    17215, 17215, 18760, 18760, 18760, 18760, 18760, 18760, 19077, 
    19077, 19077, 19614, 19614, 19614, 16934, 16934, 16934, 16934, 
    16934, 16934, 16934, 16934, 16934, 16934, 16934, 16934, 16934, 
    16934, 16934, 16934, 16934), class = "Date"), TwelveMonths = structure(c(18012, 
    18012, 18012, 17397, 17397, 17397, 17397, 17397, 17397, 17397, 
    17397, 17397, 17397, 17397, 17397, 17397, 17397, 17397, 17397, 
    17397, 17397, 18942, 18942, 18942, 18942, 18942, 18942, 19259, 
    19259, 19259, 19796, 19796, 19796, 17116, 17116, 17116, 17116, 
    17116, 17116, 17116, 17116, 17116, 17116, 17116, 17116, 17116, 
    17116, 17116, 17116, 17116), class = "Date"), TwentyFourMonths = structure(c(18377, 
    18377, 18377, 17762, 17762, 17762, 17762, 17762, 17762, 17762, 
    17762, 17762, 17762, 17762, 17762, 17762, 17762, 17762, 17762, 
    17762, 17762, 19307, 19307, 19307, 19307, 19307, 19307, 19624, 
    19624, 19624, 20161, 20161, 20161, 17481, 17481, 17481, 17481, 
    17481, 17481, 17481, 17481, 17481, 17481, 17481, 17481, 17481, 
    17481, 17481, 17481, 17481), class = "Date")), row.names = c(NA, 
    -50L), class = "data.frame")
    
    1 回复  |  直到 10 月前
        1
  •  0
  •   Jon Spring    10 月前

    我不确定我是否遵守了所有要求,但我认为这基本上得到了你想要的。

    我制作了一张包含所有有效非NA度量的表格。我为每个患者的所有里程碑日期制作了另一个。

    然后我 full_join 这些,按照日期差异的顺序排列,并保留每个MRN/里程碑/测量的第一个观察值。

    (对于每个患者有数千个观察值的大数据,这种完全连接可能会产生太大而无法处理的输出。截至 dplyr 1.1.4在2023年年中,我们无法直接通过加入获得“最接近”的匹配,但我们可以在最近的日期加入,即 <= 并分别加入最近的日期 >= ,并在这些之间进行选择作为最后一步。)

    library(tidyverse)
    df_measurements <- df |>
      select(MRN:VLDL) |>
      pivot_longer(HDL:VLDL, names_to = "measure", values_drop_na = TRUE)
    
    df_patient_timeline <- df |>
      select(-(OBSERVATION_DATE:VLDL)) |>
      distinct() |>
      pivot_longer(-MRN, names_to = "milestone", values_to = "DATE")
    
    df_patient_timeline |>
      full_join(df_measurements, join_by(MRN)) |>
      arrange(MRN, milestone, measure, abs(DATE - OBSERVATION_DATE)) |>
      slice(1, .by = c(MRN, milestone, measure))
    

    结果:

         MRN milestone        DATE       OBSERVATION_DATE measure value
       <dbl> <chr>            <date>     <date>           <chr>   <dbl>
     1 15842 Base             2018-04-26 2020-01-28       HDL        47
     2 15842 Base             2018-04-26 2020-01-28       LDL        83
     3 15842 Base             2018-04-26 2020-01-28       VLDL       14
     4 15842 SixMonths        2018-10-26 2020-01-28       HDL        47
     5 15842 SixMonths        2018-10-26 2020-01-28       LDL        83
     6 15842 SixMonths        2018-10-26 2020-01-28       VLDL       14
     7 15842 TwelveMonths     2019-04-26 2020-01-28       HDL        47
     8 15842 TwelveMonths     2019-04-26 2020-01-28       LDL        83
     9 15842 TwelveMonths     2019-04-26 2020-01-28       VLDL       14
    10 15842 TwentyFourMonths 2020-04-25 2020-01-28       HDL        47
    # ℹ 62 more rows
    

    我推测你的预期输出是这些数据的更广泛版本。如果我们在下面添加代码。。。

    ... |> 
    pivot_wider(names_from = c(measure, milestone), 
                values_from = c(OBSERVATION_DATE, value))
    

    我们得到了这个超宽的桌子,里面(我认为)有你想要的列:

    # A tibble: 24 × 26
         MRN DATE       OBSERVATION_DATE_HDL_Base OBSERVATION_DATE_LDL_Base OBSERVATION_DATE_VLDL_Base OBSERVATION_DATE_HDL_SixMonths OBSERVATION_DATE_LDL_S…¹ OBSERVATION_DATE_VLD…² OBSERVATION_DATE_HDL…³ OBSERVATION_DATE_LDL…⁴ OBSERVATION_DATE_VLD…⁵
       <dbl> <date>     <date>                    <date>                    <date>                     <date>                         <date>                   <date>                 <date>                 <date>                 <date>                
     1 15842 2018-04-26 2020-01-28                2020-01-28                2020-01-28                 NA                             NA                       NA                     NA                     NA                     NA                    
     2 15842 2018-10-26 NA                        NA                        NA                         2020-01-28                     2020-01-28               2020-01-28             NA                     NA                     NA                    
     3 15842 2019-04-26 NA                        NA                        NA                         NA                             NA                       NA                     2020-01-28             2020-01-28             2020-01-28            
     4 15842 2020-04-25 NA                        NA                        NA                         NA                             NA                       NA                     NA                     NA                     NA                    
     5 19463 2016-08-19 2017-01-30                2017-01-30                2017-01-30                 NA                             NA                       NA                     NA                     NA                     NA                    
     6 19463 2017-02-18 NA                        NA                        NA                         2017-01-30                     2017-01-30               2017-01-30             NA                     NA                     NA                    
     7 19463 2017-08-19 NA                        NA                        NA                         NA                             NA                       NA                     2017-01-30             2017-01-30             2017-01-30            
     8 19463 2018-08-19 NA                        NA                        NA                         NA                             NA                       NA                     NA                     NA                     NA                    
     9 34025 2020-11-11 2022-05-02                2022-05-02                2022-05-02                 NA                             NA                       NA                     NA                     NA                     NA                    
    10 34025 2021-05-13 NA                        NA                        NA                         2022-05-02                     2022-05-02               2022-05-02             NA                     NA                     NA                    
    # ℹ 14 more rows
    # ℹ abbreviated names: ¹​OBSERVATION_DATE_LDL_SixMonths, ²​OBSERVATION_DATE_VLDL_SixMonths, ³​OBSERVATION_DATE_HDL_TwelveMonths, ⁴​OBSERVATION_DATE_LDL_TwelveMonths, ⁵​OBSERVATION_DATE_VLDL_TwelveMonths
    # ℹ 15 more variables: OBSERVATION_DATE_HDL_TwentyFourMonths <date>, OBSERVATION_DATE_LDL_TwentyFourMonths <date>, OBSERVATION_DATE_VLDL_TwentyFourMonths <date>, value_HDL_Base <dbl>, value_LDL_Base <dbl>, value_VLDL_Base <dbl>,
    #   value_HDL_SixMonths <dbl>, value_LDL_SixMonths <dbl>, value_VLDL_SixMonths <dbl>, value_HDL_TwelveMonths <dbl>, value_LDL_TwelveMonths <dbl>, value_VLDL_TwelveMonths <dbl>, value_HDL_TwentyFourMonths <dbl>, value_LDL_TwentyFourMonths <dbl>,
    #   value_VLDL_TwentyFourMonths <dbl>
    # ℹ Use `print(n = ...)` to see more rows