代码之家 › 专栏 › 技术社区 › jason

熊猫包含不匹配的完整字符串

contains pandas

jason · 技术社区 · 6 年前

我在这方面遇到了麻烦 .contains 为此发挥作用 df . 为什么它和我的线不匹配?显然 df 有绳子。它只和“酋长”匹配。

import pandas as pd
link = 'https://www.sec.gov/Archives/edgar/data/1448056/000119312518215760/d619223ddef14a.htm'
ceo = 'Chief Executive Officer'
df_list = pd.read_html(link)
df = df_list[62]
df = df.fillna('')

for column in df:
    if column == 4:
        print ('try #1', df[column].str.contains(ceo, case=True, regex=True))
        print ('try #2', df[column].str.contains(ceo, case=True, regex=False))
        print ('try #3', df[column].str.contains(ceo, regex=False))
        print ('try #4', df[column].str.contains(ceo, regex=True))
        print ('try #5', df[column].str.contains(pat=ceo, regex=False))
        print ('try #6', df[column].str.contains(pat=ceo, case=True, regex=True))

1 回复 | 直到 6 年前

Bruno Carballo 6 年前

问题在于编码,如果执行以下操作,您可以看到它:

df[4].iloc[2]

因为它打印:

'Founder,\xa0Chief\xa0Executive\xa0Officer,\xa0and\xa0Director'

要解决此问题,请使用unidecode:

import unidecode

for column in df.columns:
    if column == 4:
        print ('try #1', df[column].apply(lambda x: 
        unidecode.unidecode(x)).str.contains(ceo, case=True, regex=True))
        print ('try #2', df[column].apply(lambda x: 
        unidecode.unidecode(x)).str.contains(ceo, case=True, regex=False))
        print ('try #3', df[column].apply(lambda x: 
        unidecode.unidecode(x)).str.contains(ceo, regex=False))
        print ('try #4', df[column].apply(lambda x: 
        unidecode.unidecode(x)).str.contains(ceo, regex=True))
        print ('try #5', df[column].apply(lambda x: 
        unidecode.unidecode(x)).str.contains(pat=ceo, regex=False))
        print ('try #6', df[column].apply(lambda x: 
        unidecode.unidecode(x)).str.contains(pat=ceo, case=True, regex=True))

推荐文章