代码之家 › 专栏 › 技术社区 › Rahul Agarwal

为Python中提取的单词创建表格式Yes/No表

pandas python-3.x python

Rahul Agarwal · 技术社区 · 7 年前

我有一个文档列表和关键字列表,我需要一个表在最后告诉哪些关键字存在于哪个文件。

d={}
for path in pathlist:
    # because path is object not string
    path_in_str = str(path)
    file_name=ntpath.basename(path_in_str)

    pdf_file = open(path_in_str, 'rb')
    text =""
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    c = read_pdf.numPages
    for i in range(c):
        page = read_pdf.getPage(i)
        text+=(page.extractText())
        matches = re.findall(regex3, text, re.IGNORECASE)
        d["string{0}".format(file_name)] = [x[1] for x in matches]

因此,dict“d”的键是“文档名”,值是“水果名”。示例如下:

请注意:一个键可以有多个值。直到这里一切正常。

有人能告诉我如何把字典转换成上面的输出吗。

更清楚地说

3 回复 | 直到 7 年前

Ala Tarighati 7 年前

让我们从这里开始,在这里您创建了dataframe pd\u df:

print(pd_df)

输出:

                0       1       2
Document1   apple  banana  orange
Document2    None  orange  banana
Document3  banana   apple    None
Document4   apple    None    None

现在,尝试创建fruit\u names列(不管有多少列) pd_df ):

for fruit_name in ['apple', 'orange', 'banana']:
    pd_df.loc[:, fruit_name] = pd_df.apply(lambda x: 'y' if fruit_name in x.values.tolist() else 'n', axis=1)
print(df[['apple', 'orange', 'banana']])

输出:

          apple orange banana
Document1     y      y      y
Document2     n      y      y
Document3     y      n      y
Document4     y      n      n

wwii 7 年前

在将字典用作数据帧的输入之前,先按所需的方式创建字典。

import pandas as pd
import collections, re

d1 = 'apple banana cutie'
d2 = 'foo bar'
d3 = 'kiwi plum cherry'
d4 = 'orange fig tomato'
docs = [d1, d2, d3, d4]

对于每个文档,确定它是否有有趣的结果,将这些信息收集到字典中,并将这些结果作为每个文档的键键:值对将是数据帧中的一列)。在单独的容器中收集文档名称,并将其用作数据帧的索引。字典值中项目的位置与文档名称集合中项目的位置相对应。

fruits_i_care_about = ['apple', 'kiwi', 'banana', 'plum']
pattern = '|'.join(fruits_i_care_about)
fruit_regex = re.compile(pattern)

d = collections.defaultdict(list)

doc_names = []
for n, doc in enumerate(docs):
    doc_names.append('d{}'.format(n))
    fruits_in_doc = set(fruit_regex.findall(doc))
    print(fruits_in_doc)
    for fruit in fruits_i_care_about:
        d[fruit].append('y' if fruit in fruits_in_doc else 'n')

df = pd.DataFrame(d, index=doc_names)

doc 在我的解决方案中,是一个字符串,如果一次只读一页,那么它将类似于一页。如果可能的话,您可以考虑阅读整个pdf,这样每个文档只需执行一次regex搜索。

字典如下所示:

defaultdict(<class 'list'>,
            {'apple': ['y', 'n', 'n', 'n'],
             'banana': ['y', 'n', 'n', 'n'],
             'kiwi': ['n', 'n', 'y', 'n'],
             'plum': ['n', 'n', 'y', 'n']})

   apple kiwi banana plum
d0     y    n      y    n
d1     n    n      n    n
d2     n    y      n    y
d3     n    n      n    n

Josh Friedlander 7 年前

import pandas as pd
df = pd.DataFrame.from_dict(d, orient='index')