代码之家 › 专栏 › 技术社区 › Neo

imagemagick&pypdf2一起使用时将python崩溃

pypdf2 tesseract python-imaging-library imagemagick python

Neo · 技术社区 · 7 年前

我有一个大约20-25页的PDF文件。此工具的目的是将PDF文件拆分为页面(使用pypdf2),将每个PDF页面保存在目录中(使用pypdf2),将PDF页面转换为图像(使用imagemagick),然后使用Tesseract(使用pil和pyocr)对其执行一些OCR以提取数据。该工具最终将是一个通过tkinter的图形用户界面,因此用户可以通过单击一个按钮多次执行相同的操作。在我的大量测试过程中,我注意到,如果整个过程重复6-7次,那么工具/Python脚本会因为在Windows上显示不响应而崩溃。我已经执行了一些调试,但不幸的是没有抛出错误。内存和CPU都很好,所以也没有问题。我可以通过观察发现,在到达Tesseract部分之前,Pypdf2和ImageMagick在一起运行时失败,从而缩小问题的范围。我可以通过将问题简化为以下python代码来复制问题:

from wand.image import Image as Img
from PIL import Image as PIL
import pyocr
import pyocr.builders
import io, sys, os 
from PyPDF2 import PdfFileWriter, PdfFileReader


def splitPDF (pdfPath):
    #Read the PDF file that needs to be parsed.
    pdfNumPages =0
    with open(pdfPath, "rb") as pdfFile:
        inputpdf = PdfFileReader(pdfFile)

        #Iterate on every page of the PDF.
        for i in range(inputpdf.numPages):
            #Create the PDF Writer Object
            output = PdfFileWriter()
            output.addPage(inputpdf.getPage(i))
            with open("tempPdf%s.pdf" %i, "wb") as outputStream:
                output.write(outputStream)

        #Get the number of pages that have been split.
        pdfNumPages = inputpdf.numPages

    return pdfNumPages

pdfPath = "Test.pdf"
for i in range(1,20):
    print ("Run %s\n--------" %i)
    #Split the PDF into Pages & Get PDF number of pages.
    pdfNumPages = splitPDF (pdfPath)
    print(pdfNumPages)
    for i in range(pdfNumPages):
        #Convert the split pdf page to image to run tesseract on it.
        with Img(filename="tempPdf%s.pdf" %i, resolution=300) as pdfImg:
            print("Processing Page %s" %i)

我已经使用WITH语句正确地处理了文件的打开和关闭,因此不应该存在内存泄漏。我试过分别运行分割部分和图像转换部分,单独运行时它们工作正常。但是,当代码组合在一起时,它将在迭代5-6次后失败。我使用了Try和Exception块,但没有捕获错误。另外,我正在使用所有库的最新版本。感谢您的帮助或指导。

谢谢您。

1 回复 | 直到 7 年前

Neo 7 年前

为了将来参考,问题是由于其中一条评论中提到的32位版本的ImageMagick(感谢Emcconville)。卸载python和imagemagick 32位版本以及安装这两个64位版本都解决了这个问题。希望这有帮助。

推荐文章

Giannis Tsakas · 无法在一个tkinter窗口中放置多个图像

3 年前

micharaze Charley Cui · 使用Selenium+Python滚动到元素后的元素屏幕截图?

7 年前

Victor Ciobanu · 如何处理随机值以获得xy坐标?

7 年前

matousc · 为什么PIL draw polygon不接受numpy数组?

7 年前

Aadit · 在python中使用PIL修剪图像中的空白

7 年前

user9503597 · 无法在模板上呈现PIL对象base64图像

7 年前

user7345804 · 如何查看此内容。是否在不调整大小和颜色的情况下适合图像?

7 年前

Sid Devic · 将零填充nd数组添加到现有图像会返回修改后的图像

7 年前

Ossama · 在PIL中调整调整大小的PNG图像的质量

8 年前

G B · Python PIL显示灰色窗口

8 年前