代码之家  ›  专栏  ›  技术社区  ›  PrzeM

R中的模糊匹配(非行到行)

  •  1
  • PrzeM  · 技术社区  · 8 年前

    我需要按照以下模式进行模糊匹配:表a包含带有地址的字符串(我已经预先格式化了,比如删除空格等),我必须验证它们的正确性。我有表B,其中包含所有可能的地址(格式与表A相同),因此我不想只将表A中的第1行与表B中的第1行进行匹配,以此类推,而是将表A中的每一行与整个表B进行比较,找出最接近的匹配。

    根据我的检查, adist agrep 默认情况下,以行到行的方式工作,通过尝试使用它们,我也会立即得到内存不足的消息。在只有8 GB RAM的情况下,是否可以在R中执行此操作?

    我找到了一个类似问题的示例代码,并以此为基础提出了解决方案,但性能仍然是个问题。它在表a中的600行和表B中的2000行样本上运行良好,但完整的数据集分别是600000行和900000行。

    adresy_odl <- adist(TableA$Adres, TableB$Adres, partial=FALSE, ignore.case = TRUE)
    min_odl<-apply(adresy_odl, 1, min)
    
    match.s1.s2<-NULL  
    for(i in 1:nrow(adresy_odl))
    {
      s2.i<-match(min_odl[i],adresy_odl[i,])
      s1.i<-i
      match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=TableB[s2.i,]$Adres, s1name=TableA[s1.i,]$Adres, adist=min_odl[i]),match.s1.s2)
    }
    

    内存错误已经发生在第一行(adist函数):

    Error: cannot allocate vector of size 1897.0 Gb
    

    下面是我使用的数据示例(CSV),tableA和tableB看起来完全相同,唯一的区别是tableB具有Zipcode、Street和City的所有可能组合,而tableA中的Zipcode大多错误,或者街道拼写有一些错误。

    表A:

    "","Zipcode","Street","City","Adres"
    "33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk"
    "157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk"
    "200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk"
    "344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk"
    "355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk"
    "356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk"
    

    表B:

    "","Zipcode","Street","City","Adres"
    "47204","80-180","11Listopada","Gdańsk","80-18011ListopadaGdańsk"
    "47205","80-041","3BrygadySzczerbca","Gdańsk","80-0413BrygadySzczerbcaGdańsk"
    "47206","80-802","3Maja","Gdańsk","80-8023MajaGdańsk"
    "47207","80-299","Achillesa","Gdańsk","80-299AchillesaGdańsk"
    "47208","80-316","AdamaAsnyka","Gdańsk","80-316AdamaAsnykaGdańsk"
    "47209","80-405","AdamaMickiewicza","Gdańsk","80-405AdamaMickiewiczaGdańsk"
    "47210","80-425","AdamaMickiewicza","Gdańsk","80-425AdamaMickiewiczaGdańsk"
    "47211","80-456","AdolfaDygasińskiego","Gdańsk","80-456AdolfaDygasińskiegoGdańsk"
    

    我的代码结果的前几行:

    "","s2.i","s1.i","s2name","s1name","adist"
    "1",1333,614,"80-152PowstańcówWarszawskichGdańsk","80-158PowstańcówWarszawskichGdańsk",1
    "2",257,613,"80-180CzerskaGdańsk","80-180ZEUSAGdańsk",3
    "3",1916,612,"80-119WojskiegoGdańsk","80-355BeniowskiegoGdańsk",8
    "4",1916,611,"80-119WojskiegoGdańsk","80-180PorębskiegoGdańsk",6
    "5",181,610,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
    "6",181,609,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
    "7",21,608,"80-401alGenJózefaHalleraGdańsk","80-401GenJózefaHalleraGdańsk",2
    "8",1431,607,"80-264RomanaDmowskiegoGdańsk","80-264DmowskiegoGdańsk",6
    "9",1610,606,"80-239StefanaCzarnieckiegoGdańsk","80-239StefanaCzarnieckiegoGdańsk",0
    
    1 回复  |  直到 8 年前
        1
  •  3
  •   dmi3kno    8 年前

    我会努力的 fuzzyjoin StackOverflow的drob包装

    library(dplyr)
    
    dict_df <- tibble::tribble(
         ~ID,~Zipcode,~Street,~City,~Adres,
    "33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk",
    "157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
    "200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
    "344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk",
    "355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
    "356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk") %>% 
      select(ID, Adres)
    
        noise_df <- tibble::tribble(
      ~Zipcode,~Street,~City,~Adres,
      "80-221","Trauguta","Gdansk","80-221TraugutaGdansk",
      "80-211","Traugguta","Gdansk","80-211TrauggutaGdansk",
      "80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
      "80-267","KsBernardaSyschty","Gdańsk","80-276KsBernardaSyschtyGdańsk",
      "80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
      "80-399","Grunwaldzka","dansk","80-399Grunwaldzkadańsk",
      "80-318","Wasowicza","Gdańsk","80-318WasowiczaGdańsk",
      "80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
      "80-625","Stryewskogo","Gdansk","80-625StryewskogoGdansk",
      "80-452","Kilinskiego","Gdańsk","80-452KilinskiegoGdańsk")
    
    library(fuzzyjoin)
    
    noise_df %>% 
      # using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
      stringdist_left_join(dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5) %>%
      select(Adres.x, ID, Adres.y, dist) %>% 
      group_by(Adres.x) %>% 
      # select best fit record
      top_n(-1, dist)
    

    结果表由原始地址组成( Adres.x )和字典中的最佳匹配( ID Adres.y )以及字符串距离。

    # A tibble: 10 x 4
    # Groups:   Adres.x [10]
                             Adres.x     ID                      Adres.y       dist
                               <chr>  <chr>                        <chr>      <dbl>
     1          80-221TraugutaGdansk  33854        80-221TrauguttaGdańsk 0.11764706
     2         80-211TrauggutaGdansk  33854        80-221TrauguttaGdańsk 0.11764706
     3  80-276KsBernardaSychtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
     4 80-276KsBernardaSyschtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
     5       80-339GrunwaldzkaGdańsk 200115      80-339GrunwaldzkaGdańsk 0.00000000
     6        80-399Grunwaldzkadańsk 200115      80-339GrunwaldzkaGdańsk 0.00000000
     7         80-318WasowiczaGdańsk 344514        80-318WąsowiczaGdańsk 0.05555556
     8     80-625StryjewskiegoGdańsk 355415    80-625StryjewskiegoGdańsk 0.00000000
     9       80-625StryewskogoGdansk 355415    80-625StryjewskiegoGdańsk 0.17391304
    10       80-452KilinskiegoGdańsk 356414      80-452KilińskiegoGdańsk 0.05263158
    

    我发现将所有内容转换为小写ASCII时,模糊匹配效果最好( iconv() tolower() )

    更新:这可能会占用较小的内存:

    library(purrr)
    library(dplyr)
      noise_df %>% split(.$Adres) %>% 
      # using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
      map_df(~stringdist_left_join(.x, dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5, ignore_case = TRUE) %>%
              select(Adres.x, ID, Adres.y, dist) %>% 
              group_by(Adres.x) %>% 
              # select best fit record
              top_n(-1, dist))
    

    更新2:当使用“lv”距离算法时,您会得到太多缺少的值和NAs。在某些情况下,如果找不到匹配项, string_dist_join 删除 distance 您创建的列。这就是管道其余部分故障的原因,首先 select 稍后在 top_n . 为了查看发生了什么,请对您的数据进行小样本更改 map_df map 并浏览结果列表。