代码之家  ›  专栏  ›  技术社区  ›  gogasca

在Pandas数据框中匹配一到多个列

  •  1
  • gogasca  · 技术社区  · 7 年前

    我在CSV文件中有两个数据集,使用panda每个文件都转换成两个不同的数据帧。

    数据集1

    uuid, company_name, website
    YAHOO,Yahoo,yahoo.com    
    CSCO,Cisco,cisco.com
    APPL,Apple,
    

    company_name, company_website, support_website, privacy_website
    Yahoo,,yahoo.com,yahoo.com
    Google,google.com,,
    Cisco,,,cisco.com
    

    结果数据集

    company_name, company_website, support_website, privacy_website, uuid
    Yahoo,,yahoo.com,yahoo.com,YAHOO
    Google,google.com,,
    Cisco,,,cisco.com,CSCO
    
    • 数据集2包含约4百万条记录。

    1. If字段 网站 在数据集1中与字段相同 公司网站 在数据集2中,提取标识符。

    2. 如果不匹配,请检查If字段 网站 在数据集1中与字段相同 支持网站 在数据集2中,提取标识符。

    3. 如果不匹配,请检查If字段 网站 在数据集1中与字段相同 隐私网站 在数据集2中,提取标识符。

    4. 如果不匹配,请检查If字段 公司名称 公司名称 在数据集2中,提取标识符。

    这是我当前的功能:

    def MatchCompanies(
        companies: pandas.Dataframe,
        competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
      """Find Competitor companies in companies dataframe and generate a new list.
    
      Args:
        companies: A dataframe with company information from CSV file.
        competitor_companies: A dataframe with Competitor information from CSV file.
    
      Returns:
        A sequence of matched companies and their UUID.
    
      Raises:
        ValueError: No companies found.
      """
    
      if _IsEmpty(companies):
        raise ValueError('No companies found')
      # Clean up empty fields. Use extra space to avoid matching on empty TLD.
      companies.fillna({'website': ' '}, inplace=True)
      competitor_companies = competitor_companies.fillna('')
      logging.info('Found: %d records.', len(competitor_companies))
      # Rename column to TLD to compare matching companies.
      companies.rename(columns={'website': 'tld'}, inplace=True)
      logging.info('Cleaning up company name.')
      companies.company_name = companies.company_name.apply(_NormalizeText)
      competitor_companies.company_name = competitor_companies.company_name.apply(
          _NormalizeText)
      # Rename column to TLD since Competitor already contains TLD in company_website.
      competitor_companies.rename(columns={'company_website': 'tld'}, inplace=True)
      logging.info('Extracting UUID')
      merge_tld = competitor_companies.merge(
          companies[['tld', 'uuid']], on='tld', how='left')
      # Extracts UUID for company name matches.
      competitor_companies = competitor_companies.merge(
          companies[['company_name', 'uuid']], on='company_name', how='left')
      # Combines dataframes.
      competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
          merge_tld['uuid'])
      match_companies = len(
          competitor_companies[competitor_companies['uuid'].notnull()])
      total_companies = len(competitor_companies)
      logging.info('Results found: %d out of %d', match_companies, total_companies)
      competitor_companies.rename(columns={'tld': 'company_website'}, inplace=True)
      return competitor_companies
    

    2 回复  |  直到 7 年前
        1
  •  2
  •   jezrael    7 年前

    使用 map 通过 Series combine_first ,但有一个要求是必需的-在 df1['website'] df1['company_name'] :

    df1 = df1.dropna()
    s1 = df1.set_index('website')['uuid']
    s2 = df1.set_index('company_name')['uuid']
    
    w1 = df2['company_website'].map(s1)
    w2 = df2['support_website'].map(s1)
    w3 = df2['privacy_website'].map(s1)
    c = df2['company_name'].map(s2)
    
    df2['uuid'] = w1.combine_first(w2).combine_first(w3).combine_first(c)
    print (df2)
      company_name company_website support_website privacy_website   uuid
    0        Yahoo             NaN       yahoo.com       yahoo.com  YAHOO
    1       Google      google.com             NaN             NaN    NaN
    2        Cisco             NaN             NaN       cisco.com   CSCO
    
        2
  •  -1
  •   Ghasem Naddaf    7 年前

    看一看 dataframe.merge company_website 做一些类似的事情

    A.merge(B, on='company_website', indicator=True)