代码之家  ›  专栏  ›  技术社区  ›  Shubham

如何在R中查找数据表的两个多类别列之间的差异/setdiff()

  •  0
  • Shubham  · 技术社区  · 3 年前

    我有两个数据表中的数据,如下所示 (比此处显示的列多得多) -

    数据表1 =data\u销售

    站点Id 产品ID
    1000375476 加拿大 UG10000-WISD
    1000375476 加拿大 UGD12895
    1000706152 瑞士 UG10000-WISD
    1000706152 瑞士 UG80000-NTCD-G
    1000797366 意大利 UG10000-WISD
    1000797366 意大利 UG12210

    数据表2 =data\u许可证

    站点Id 产品ID
    1000375476 加拿大 UG10000-WISD
    1000375476 加拿大 UGD12895
    1000797366 意大利 UG12785
    1000797366 意大利 UG12210

    我想计算data\u sale中所有站点ID的唯一产品ID的设置差异,保留所有行。

    这是我到目前为止所做的-

    1. 对于这两个数据表,我创建了一个包含所有独特产品的新列。
    data_sale <-
      data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","), 
                  keyby = c("Site Id")]
    
    data_licenses <-
      data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")), 
                  keyby = c("Site Id")]
    
    1. 左侧合并了data\u销售和data\u许可证
    merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)
    

    现在合并的数据表如下所示-

    站点Id 产品ID 独特的产品。data\u销售 独特的产品。data\u许可证
    1000375476 加拿大 UG10000-WISD UG10000-WISD,UGD12895 UG10000-WISD,UGD12895
    1000375476 加拿大 UGD12895 UG10000-WISD,UGD12895 UG10000-WISD,UGD12895
    1000706152 瑞士 UG10000-WISD UG10000-WISD,UG80000-NTCD-G 不适用
    1000706152 瑞士 UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G 不适用
    1000797366 意大利 UG10000-WISD UG10000-WISD,UG12210 UG12785、UG12210
    1000797366 意大利 UG12210 UG10000-WISD,UG12210 UG12785、UG12210

    问题是在我的最后一步中,我希望有一个新的列显示data\u sale和data\u许可证产品之间的差异,它应该如下所示-

    站点Id 产品ID 独特的产品。data\u销售 独特的产品。data\u许可证 差别
    1000375476 加拿大 UG10000-WISD UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 不适用
    1000375476 加拿大 UGD12895 UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 不适用
    1000706152 瑞士 UG10000-WISD UG10000-WISD,UG80000-NTCD-G 不适用 UG10000-WISD,UG80000-NTCD-G
    1000706152 瑞士 UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G 不适用 UG10000-WISD,UG80000-NTCD-G
    1000797366 意大利 UG10000-WISD UG10000-WISD,UG12210 UG12785、UG12210 UG10000-WISD
    1000797366 意大利 UG12210 UG10000-WISD,UG12210 UG12785、UG12210 UG10000-WISD

    任何关于如何实现这一目标的线索都将大有裨益。谢谢

    下面是使用dput()对合并数据表进行处理的数据

    structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152", 
    "1000706152", "1000797366", "1000797366"), Country = c("Canada", 
    "Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD", 
    "UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", 
    "UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895", 
    "UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G", 
    "UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895", 
    "UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
    )), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA, 
    -6L), .internal.selfref = <pointer: 0x556bb5c10a40>)
    
    0 回复  |  直到 3 年前
        1
  •  0
  •   jblood94    3 年前

    这将使产品进入 data_sale 不在 data_license 通过 Site Id 。与串联唯一的产品ID不同,使用唯一列作为字符向量更容易。

    library(data.table)
    
    data_licenses <- data.table(`Site Id` = c("1000375476", "1000375476", "1000797366", "1000797366"),
                                Country = c("Canada", "Canada", "Italy", "Italy"),
                                `Product ID` = c("UG10000-WISD", "UGD12895", "UG12785", "UG12210"))
    data_sale <- data.table(`Site Id` = c("1000375476", "1000375476", "1000706152", "1000706152", "1000797366", "1000797366"),
                            Country = c("Canada", "Canada", "Switzerland", "Switzerland", "Italy", "Italy"),
                            `Product ID` = c("UG10000-WISD", "UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", "UG12210"))
    
    data_unique <- data_sale[
      , .(unique_products.data_sale = .(unique(`Product ID`))), c("Site Id", "Country")
    ][
      data_licenses[, .(unique_products = .(unique(`Product ID`))), "Site Id"],
      unique_products.data_licenses := i.unique_products,
      on = "Site Id"
    ][
      , difference := lapply(.I, function(i) setdiff(unique_products.data_sale[[i]], unique_products.data_licenses[[i]]))
    ]
    print(data_unique)
    #>       Site Id     Country   unique_products.data_sale unique_products.data_licenses                  difference
    #> 1: 1000375476      Canada       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                            
    #> 2: 1000706152 Switzerland UG10000-WISD,UG80000-NTCD-G                               UG10000-WISD,UG80000-NTCD-G
    #> 3: 1000797366       Italy        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD
    
        2
  •  0
  •   RobertoT    3 年前

    可能有一种方法可以尝试组合一些内置函数,但有一个简单自定义函数的示例:

    find_differences = function(x,y){
      # x: column list of strings we want to compare to
      # y: other column list
      x = strsplit(x,',') # transform strings to lists
      y = strsplit(y,',')
      
      differences = list()
    
      for(i in seq(1,length(x))){  # for every row (nested-list)
        
        if(identical(x[[i]],y[[i]])){
          row_diff = NA
        }
        else{
        row_diff = paste(x[[i]][ ! x[[i]] %in% y[[i]] ],collapse=',')
        }
        
        differences = c(differences,row_diff)
      }
      return(differences)
    }
    

    以您的例子:

    example = rename(example, 
                     unique_products.data_sale = unique_products.x,
                     unique_products.data_licenses = unique_products.y)
    
    example$difference = find_differences(example$unique_products.data_sale, example$unique_products.data_license)
    
    > example
          Site Id     Country     Product ID   unique_products.data_sale unique_products.data_licenses                  difference
    1: 1000375476      Canada   UG10000-WISD       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                          NA
    2: 1000375476      Canada       UGD12895       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                          NA
    3: 1000706152 Switzerland   UG10000-WISD UG10000-WISD,UG80000-NTCD-G                          <NA> UG10000-WISD,UG80000-NTCD-G
    4: 1000706152 Switzerland UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G                          <NA> UG10000-WISD,UG80000-NTCD-G
    5: 1000797366       Italy   UG10000-WISD        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD
    6: 1000797366       Italy        UG12210        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD
    
        3
  •  0
  •   langtang    3 年前

    如何快速获得每个站点的差异;然后可以将结果合并回具有 Site Id :

    data_licenses[, .(licen_p = .(.(`Product ID`))), by = `Site Id`] %>% 
      .[data_sale[, .(sale_p= .(.(`Product ID`))), by=`Site Id`],on=.(`Site Id`)] %>% 
      .[,.(difference = toString(unlist(setdiff(sale_p, licen_p)))), by=`Site Id`]
    

    输出:

          Site Id                   difference
    1: 1000375476                             
    2: 1000706152 UG10000-WISD, UG80000-NTCD-G
    3: 1000797366        UG10000-WISD, UG12210