代码之家 › 专栏 › 技术社区 › Rahul Agarwal

不同数据帧的模糊匹配列

fuzzywuzzy fuzzy-comparison fuzzy-logic pandas python

1

Rahul Agarwal · 技术社区 · 7 年前

我有两个数据帧,其中没有共同的关键,我可以合并它们。两个df都有一个包含“实体名”的列。一个df包含8000多个实体,另一个接近2000个实体。

:

vendor_df=
     Name of Vendor                             City         State  ZIP
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102
     LACKEY SHEET METAL                        St. Louis    MO     63102

regulator_df = 
     Name of Entity                    Committies
     LACKEY SHEET METAL                 Private
     PRIMUS STERILIZER COMPANY LLC      Private  
     HELGET GAS PRODUCTS INC            Autonomous
     ORTHOQUEST LLC                     Governmant

问题列表:

我必须模糊匹配这两个实体( Name of vendor Name of Entity )列并获得分数。所以,需要知道数据帧1的第一个值( vendor_df )与dataframe2的2000个实体中的任何一个匹配( 调节器

StackOverflow链接我查过了

fuzzy match between 2 columns (Python)

create new column in dataframe using fuzzywuzzy

Apply fuzzy matching across a dataframe column and save results in a new column

代码

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0)

regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0)

compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'],
                                      regulator_df['Name of Entity']]).to_series()


def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

#compare.apply(metrics) -- Either this works or the below line

result = compare.apply(metrics).unstack().idxmax().unstack(0)

如果相同的东西可以快速工作或者可以处理大型数据集,有什么解决方案吗?

更新1

如果我们通过或硬编码一个分数,比如80,它将只使用fuzzyscore>80过滤序列/数据帧,那么上面的代码能更快吗?

1 回复 | 直到 7 年前

1

2

Rahul Agarwal 7 年前

下面的解决方案比我发布的更快,但如果有人有更快的方法,请告诉:

matched_vendors = []

for row in vendor_df.index:
    vendor_name = vendor_df.get_value(row,"Name of vendor")
    for columns in regulator_df.index:
        regulated_vendor_name=regulator_df.get_value(columns,"Name of Entity")
        matched_token=fuzz.partial_ratio(vendor_name,regulated_vendor_name)
        if matched_token> 80:
            matched_vendors.append([vendor_name,regulated_vendor_name,matched_token])

2

1

Sathish Kothandam 6 年前

在我的情况下,我也只需要寻找80以上。我根据我的用途修改了你的代码凯斯.霍普这很有帮助。

compare = compare.apply(metrics)
compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)]