代码之家  ›  专栏  ›  技术社区  ›  bear_525

从列中删除中间名和首字母,并保存在单独的列中

  •  2
  • bear_525  · 技术社区  · 6 月前

    我有一列名字;有些有中间名或中间首字母。我想把那些中间的首字母从 fullname 列,并在该列旁边创建一个新列来存储这些中间名/首字母。

    根据我的研究 this post 提供了一些删除中间名的解决方案。如何将这些中间名/首字母移动到新列?

    以下是我使用的40行数据示例 dput() 数据有4300万行,大约3.8 GB。

    data<- structure(list(id = c(439116595, 439317458, 439373574, 439434694, 
    439508848, 439632143, 439778306, 439917155, 440009485, 440147556, 
    440207880, 441479247, 442115059, 441569787, 438192228, 438215998, 
    438307365, 438317476, 438389110, 438409963, 438479736, 438509859, 
    438634859, 438662407, 438764944, 438846700, 438884094, 438954147, 
    439227370, 439243020, 439248564, 439272667, 439357884, 439403127, 
    439446363, 439511276, 439546441, 439586141, 439804213, 439862550, 
    439889286), fullname = c("Shawn Chase", "Steven Hofer", "Stephen Paradise", 
    "Shengho Yang", "Nelson Carvalho", "RICK GILLICK", "Marie Jhoanne Morilla", 
    "Sanjay Kulkarni", "Sam Bunn", "Iran Murphy", "Kathryn Cutler", 
    "Diane Sik", "Donna Yee", "Christine Coltrain", "Maher Dakkouri", 
    "Ray Perl", "Abid Khalil", "Ian Crombie", "Allen Carr", "Daniel Angeline", 
    "Jimmy Tan", "Thierry LAMBERT", "Diene Faye", "Greg Greene", 
    "Laura Holsopple", "Roberta Minkus", "Bridget Chenette", "Joshua Polite", 
    "John Liberty", "David Smith", "Igor Baratta", "Pierre Schmitz", 
    "alejandra salvanes", "Malcolm K Knight", "Xiaoyan Hu", "Joe Pawl", 
    "Bryan Armstrong", "Christina Spezio", "Robert Gibson", "Peter Head", 
    "Mike Russo"), degree = c("Bachelor", "Bachelor", "MBA", 
    "", "", "", "", "Master", "", "MBA", "", "", "Bachelor", "Doctor", 
    "Bachelor", "Master", "Master", "", "", "", "", "", "", "Bachelor", 
    "Master", "", "Master", "", "Bachelor", "Master", "", "", "", 
    "", "Bachelor", "Associate", "", "Bachelor", "", "", "Associate"
    )), row.names = 20:60, class = "data.frame")
    

    我在一小部分数据上尝试了以下方法。但是,出现了一条错误消息:

    警告信息: 在stri_match_first_regex(字符串,模式,opts_regex=opts(模式))中:参数不是原子向量;胁迫

    str_match(data, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]
    
    2 回复  |  直到 6 月前
        1
  •  2
  •   ThomasIsCoding    6 月前

    也许你可以试试这个

    transform(data,
      no_mid_name = sub("^(\\w+).*?(\\w+)$", "\\1 \\2", fullname),
      mid_name = trimws(sub("^\\w+(.*?)\\w+$", "\\1", fullname))
    )
    

    这给了

              id              fullname    degree        no_mid_name mid_name
    20 439116595           Shawn Chase  Bachelor        Shawn Chase
    21 439317458          Steven Hofer  Bachelor       Steven Hofer
    22 439373574      Stephen Paradise       MBA   Stephen Paradise
    23 439434694          Shengho Yang                 Shengho Yang
    24 439508848       Nelson Carvalho              Nelson Carvalho
    25 439632143          RICK GILLICK                 RICK GILLICK
    26 439778306 Marie Jhoanne Morilla                Marie Morilla  Jhoanne
    27 439917155       Sanjay Kulkarni    Master    Sanjay Kulkarni
    28 440009485              Sam Bunn                     Sam Bunn
    29 440147556           Iran Murphy       MBA        Iran Murphy
    30 440207880        Kathryn Cutler               Kathryn Cutler
    31 441479247             Diane Sik                    Diane Sik
    32 442115059             Donna Yee  Bachelor          Donna Yee
    33 441569787    Christine Coltrain    Doctor Christine Coltrain
    34 438192228        Maher Dakkouri  Bachelor     Maher Dakkouri
    35 438215998              Ray Perl    Master           Ray Perl
    36 438307365           Abid Khalil    Master        Abid Khalil
    37 438317476           Ian Crombie                  Ian Crombie
    38 438389110            Allen Carr                   Allen Carr
    39 438409963       Daniel Angeline              Daniel Angeline
    40 438479736             Jimmy Tan                    Jimmy Tan
    41 438509859       Thierry LAMBERT              Thierry LAMBERT
    42 438634859            Diene Faye                   Diene Faye
    43 438662407           Greg Greene  Bachelor        Greg Greene
    44 438764944       Laura Holsopple    Master    Laura Holsopple
    45 438846700        Roberta Minkus               Roberta Minkus
    46 438884094      Bridget Chenette    Master   Bridget Chenette
    47 438954147         Joshua Polite                Joshua Polite
    48 439227370          John Liberty  Bachelor       John Liberty
    49 439243020           David Smith    Master        David Smith
    50 439248564          Igor Baratta                 Igor Baratta
    51 439272667        Pierre Schmitz               Pierre Schmitz
    52 439357884    alejandra salvanes           alejandra salvanes
    53 439403127      Malcolm K Knight               Malcolm Knight        K
    54 439446363            Xiaoyan Hu  Bachelor         Xiaoyan Hu
    55 439511276              Joe Pawl Associate           Joe Pawl
    56 439546441       Bryan Armstrong              Bryan Armstrong
    57 439586141      Christina Spezio  Bachelor   Christina Spezio
    58 439804213         Robert Gibson                Robert Gibson
    59 439862550            Peter Head                   Peter Head
    60 439889286            Mike Russo Associate         Mike Russo
    
        2
  •  1
  •   Ben Bolker    6 月前

    OP str_match 解决方案实际上是稍微 更快 与@ThomasIsCoding相比(也许是因为它避免了两次进行正则表达式搜索)。完整的问题比这里给出的例子大约大一百万倍(4300万行对41行);由于迭代只需要200秒,我们预计整个过程将在200秒内完成 假设 行数的线性缩放(并且我正确地完成了算术运算)。因此,可能还有其他事情正在发生。。。人们可能需要(1)做一些实验,看看这些解决方案如何随着问题规模的增加而扩展,(2)在内存管理等方面可能需要更加小心,例如使用 data.table ...

    thomas <- function(data) with(data,
          data.frame(no_mid_name = sub("^(\\w+).*?(\\w+)$", "\\1 \\2", fullname),
                     mid_name = trimws(sub("^\\w+(.*?)\\w+$", "\\1", fullname))))
    
    OP <- function(data) {
        ss <- stringr::str_match(data$fullname, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]
        data.frame(no_mid_name = paste(ss[,1], ss[,3]),
                   mid_name = ss[,2])
    }
    
    bench::mark(thomas(data), OP(data))  |>
     dplyr::select(expression, median, `itr/sec`)
    
    ##   expression   median `itr/sec`
    ##  <bch:expr> <bch:tm>     <dbl>
    ## 1 thomas(dd)    289µs     3351.
    ## 2 OP(dd)        187µs     5266.