代码之家  ›  专栏  ›  技术社区  ›  Karn Kumar

无法用表格读取pdf

  •  1
  • Karn Kumar  · 技术社区  · 6 年前

    我在尝试用tabula(tabula-py)读取pdf文件时遇到以下错误。

    有没有一种方法可以像pandas或其他libs一样在python中读取pdf?

    请建议。

    >>> from tabula import read_pdf
    >>> df = read_pdf('OpTransactionHistory28-08-2018.pdf')
    Aug 29, 2018 10:40:27 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
    WARNING: New fonts found, font cache will be re-built
    Aug 29, 2018 10:40:27 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
    WARNING: Building on-disk font cache, this may take a while
    Aug 29, 2018 10:40:32 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
    WARNING: Finished building on-disk font cache, found 328 fonts
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/karn/.local/lib/python3.6/site-packages/tabula/wrapper.py", line 119, in read_pdf
        return pd.read_csv(io.BytesIO(output), **pandas_options)
      File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
        return _read(filepath_or_buffer, kwds)
      File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
        data = parser.read(nrows)
      File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
        ret = self._engine.read(nrows)
      File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
        data = self._reader.read(nrows)
      File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
      File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
      File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
      File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
      File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
    pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 4, saw 9
    

    我看到的一个办法是 pdftotext 转换。。

    $ pdftotext OpTransactionHistory28-08-2018.pdf
    

    刚才看了@ace的provide链接,发现了一些相关的东西:

    >>> from tabula import read_pdf
    >>> df = read_pdf('OpTransactionHistory28-08-2018.pdf', pages='all', encoding='ISO-8859-1', multiple_tables=True)
    
    1 回复  |  直到 4 年前
        1
  •  1
  •   chezou    6 年前

    由于pandas试图从tabulajava输出中提取一个数据帧,因此表间列数的差异常常会导致pandas层的错误。使用 multiple_tables=True 可以避免此限制,因为表的边界是可识别的。

    我也注意到这个相关的错误,但似乎与我看到的不同。 https://github.com/chezou/tabula-py#i-faced-cparsererror-how-can-i-extract-multiple-tables

    如果您能提供您的熊猫版本,我们将不胜感激。