我需要按照以下模式进行模糊匹配:表a包含带有地址的字符串(我已经预先格式化了,比如删除空格等),我必须验证它们的正确性。我有表B,其中包含所有可能的地址(格式与表A相同),因此我不想只将表A中的第1行与表B中的第1行进行匹配,以此类推,而是将表A中的每一行与整个表B进行比较,找出最接近的匹配。
根据我的检查,
adist
和
agrep
默认情况下,以行到行的方式工作,通过尝试使用它们,我也会立即得到内存不足的消息。在只有8 GB RAM的情况下,是否可以在R中执行此操作?
我找到了一个类似问题的示例代码,并以此为基础提出了解决方案,但性能仍然是个问题。它在表a中的600行和表B中的2000行样本上运行良好,但完整的数据集分别是600000行和900000行。
adresy_odl <- adist(TableA$Adres, TableB$Adres, partial=FALSE, ignore.case = TRUE)
min_odl<-apply(adresy_odl, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(adresy_odl))
{
s2.i<-match(min_odl[i],adresy_odl[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=TableB[s2.i,]$Adres, s1name=TableA[s1.i,]$Adres, adist=min_odl[i]),match.s1.s2)
}
内存错误已经发生在第一行(adist函数):
Error: cannot allocate vector of size 1897.0 Gb
下面是我使用的数据示例(CSV),tableA和tableB看起来完全相同,唯一的区别是tableB具有Zipcode、Street和City的所有可能组合,而tableA中的Zipcode大多错误,或者街道拼写有一些错误。
表A:
"","Zipcode","Street","City","Adres"
"33854","80-221","Traugutta","GdaÅsk","80-221TrauguttaGdaÅsk"
"157093","80-276","KsBernardaSychty","GdaÅsk","80-276KsBernardaSychtyGdaÅsk"
"200115","80-339","Grunwaldzka","GdaÅsk","80-339GrunwaldzkaGdaÅsk"
"344514","80-318","WÄ
sowicza","GdaÅsk","80-318WÄ
sowiczaGdaÅsk"
"355415","80-625","Stryjewskiego","GdaÅsk","80-625StryjewskiegoGdaÅsk"
"356414","80-452","KiliÅskiego","GdaÅsk","80-452KiliÅskiegoGdaÅsk"
表B:
"","Zipcode","Street","City","Adres"
"47204","80-180","11Listopada","GdaÅsk","80-18011ListopadaGdaÅsk"
"47205","80-041","3BrygadySzczerbca","GdaÅsk","80-0413BrygadySzczerbcaGdaÅsk"
"47206","80-802","3Maja","GdaÅsk","80-8023MajaGdaÅsk"
"47207","80-299","Achillesa","GdaÅsk","80-299AchillesaGdaÅsk"
"47208","80-316","AdamaAsnyka","GdaÅsk","80-316AdamaAsnykaGdaÅsk"
"47209","80-405","AdamaMickiewicza","GdaÅsk","80-405AdamaMickiewiczaGdaÅsk"
"47210","80-425","AdamaMickiewicza","GdaÅsk","80-425AdamaMickiewiczaGdaÅsk"
"47211","80-456","AdolfaDygasiÅskiego","GdaÅsk","80-456AdolfaDygasiÅskiegoGdaÅsk"
我的代码结果的前几行:
"","s2.i","s1.i","s2name","s1name","adist"
"1",1333,614,"80-152PowstaÅcówWarszawskichGdaÅsk","80-158PowstaÅcówWarszawskichGdaÅsk",1
"2",257,613,"80-180CzerskaGdaÅsk","80-180ZEUSAGdaÅsk",3
"3",1916,612,"80-119WojskiegoGdaÅsk","80-355BeniowskiegoGdaÅsk",8
"4",1916,611,"80-119WojskiegoGdaÅsk","80-180PorÄbskiegoGdaÅsk",6
"5",181,610,"80-204BraciÅniadeckichGdaÅsk","80-210ÅniadeckichGdaÅsk",7
"6",181,609,"80-204BraciÅniadeckichGdaÅsk","80-210ÅniadeckichGdaÅsk",7
"7",21,608,"80-401alGenJózefaHalleraGdaÅsk","80-401GenJózefaHalleraGdaÅsk",2
"8",1431,607,"80-264RomanaDmowskiegoGdaÅsk","80-264DmowskiegoGdaÅsk",6
"9",1610,606,"80-239StefanaCzarnieckiegoGdaÅsk","80-239StefanaCzarnieckiegoGdaÅsk",0