代码之家  ›  专栏  ›  技术社区  ›  Danny

基于条件连接中的行中的值

  •  0
  • Danny  · 技术社区  · 7 年前

    我有一个数据框 df (见下面的代码),其中包含近100000行,显示我的程序联系人列表。列表中有一列显示程序 program 和组织 O_ID 联系人与之关联,以及联系人在程序中的角色列。当联系人在多个程序中或在程序中有多个角色时,会为该联系人创建另一行,程序和联系人角色字段值会发生变化。

    First   Last    C_ID    OrgName O_ID Program    Role
    John    Smith   10045   Acme    901 Buildings   Primary
    John    Smith   10045   Acme    901 Buildings   Communications
    John    Smith   10045   Acme    901 Homes       Primary
    Teddy   Bush    10046   Acme    901 Buildings   Primary
    Teddy   Bush    10046   Acme    901 Buildings   Signatory
    Jess    Clinton 10050   Consult 904 Homes       Signatory
    Jess    Clinton 10050   Consult 904 Homes       Primary
    Jess    Clinton 10050   Consult 904 Homes       Communications
    

    出于演示目的,我尽量减少行数。具体来说,如果一个联系人在同一个组织和同一个程序中,我只希望该联系人出现在一行上(而不是现在的几行),并将联系人角色组合成一个字符串。

    我尝试了这段代码,它部分有效: ddply(df,.(df$C_ID, df$Program, df$O_ID), paste, sep=",")

    以下是结果:

    df$C_ID df$Program df$O_ID                        V1                                 V2
    1       10045      Buildings         901         c("John", "John")                c("Smith", "Smith")
    2       10045          Homes         901                      John                              Smith
    3       10046      Buildings         901       c("Teddy", "Teddy")                  c("Bush", "Bush")
    4       10050          Homes         904 c("Jess", "Jess", "Jess") c("Clinton", "Clinton", "Clinton")
                          V3                                 V4               V5                           V6
    1        c(10045, 10045)                  c("Acme", "Acme")      c(901, 901)  c("Buildings", "Buildings")
    2                  10045                               Acme              901                        Homes
    3        c(10046, 10046)                  c("Acme", "Acme")      c(901, 901)  c("Buildings", "Buildings")
    4 c(10050, 10050, 10050) c("Consult", "Consult", "Consult") c(904, 904, 904) c("Homes", "Homes", "Homes")
                                               V7
    1              c("Primary", "Communications")
    2                                     Primary
    3                   c("Primary", "Signatory")
    4 c("Signatory", "Primary", "Communications")
    

    问题是

    1) 列被重新排列(注意,在我的实际数据集中有更多的列),列名消失了

    2) 值发生更改的唯一列应位于 Role 柱但是,即使合并的值相同,结果也会合并大多数列的值。例如,在“结果”列中 V1 (名字列),返回 c("John", "John") .应该是“约翰”。唯一应该具有不同值的列是column V7 c("Primary", "Communications")

    df<-structure(list(First = c("John", "John", "John", "Teddy", "Teddy", 
    "Jess", "Jess", "Jess"), Last = c("Smith", "Smith", "Smith", 
    "Bush", "Bush", "Clinton", "Clinton", "Clinton"), C_ID = c(10045L, 
    10045L, 10045L, 10046L, 10046L, 10050L, 10050L, 10050L), OrgName = c("Acme", 
    "Acme", "Acme", "Acme", "Acme", "Consult", "Consult", "Consult"
    ), O_ID = c(901L, 901L, 901L, 901L, 901L, 904L, 904L, 904L), 
        Program = c("Buildings", "Buildings", "Homes", "Buildings", 
        "Buildings", "Homes", "Homes", "Homes"), Role = c("Primary", 
        "Communications", "Primary", "Primary", "Signatory", "Signatory", 
        "Primary", "Communications")), .Names = c("First", "Last", 
    "C_ID", "OrgName", "O_ID", "Program", "Role"), class = "data.frame", row.names = c(NA, 
    -8L))
    
    2 回复  |  直到 7 年前
        1
  •  1
  •   camille    7 年前

    你需要什么 paste collapse = ", " sep .使用 collapse 从所有输入创建一个字符串。我是通过对所有识别栏、组织、程序等进行分组,然后分解其中的角色来实现这一点的 summarise .

    library(tidyverse)
    
    df %>%
      group_by(First, Last, C_ID, OrgName, O_ID, Program) %>%
      summarise(roles_mult = paste(Role, collapse = ", "))
    #> # A tibble: 4 x 7
    #> # Groups:   First, Last, C_ID, OrgName, O_ID [?]
    #>   First Last     C_ID OrgName  O_ID Program   roles_mult                  
    #>   <chr> <chr>   <int> <chr>   <int> <chr>     <chr>                       
    #> 1 Jess  Clinton 10050 Consult   904 Homes     Signatory, Primary, Communi…
    #> 2 John  Smith   10045 Acme      901 Buildings Primary, Communications     
    #> 3 John  Smith   10045 Acme      901 Homes     Primary                     
    #> 4 Teddy Bush    10046 Acme      901 Buildings Primary, Signatory
    
        2
  •  0
  •   Kamil    7 年前

    你可以使用 dplyr

    > df %>% distinct(First, Last, .keep_all=T)
      First    Last  C_ID OrgName O_ID   Program      Role
    1  John   Smith 10045    Acme  901 Buildings   Primary
    2 Teddy    Bush 10046    Acme  901 Buildings   Primary
    3  Jess Clinton 10050 Consult  904     Homes Signatory