代码之家 › 专栏 › 技术社区 › Rany Fahed

如何使用Python脚本从PDF中读取阿拉伯语文本

arabic character-encoding utf-8 pdf python

Rany Fahed · 技术社区 · 7 年前

我有一个用Python编写的代码,可以读取PDF文件并将其转换为文本文件。

当我试图阅读时出现了问题 阿拉伯文文本 来自PDF文件。我知道错误在 编码和编码过程 但我不知道如何修复它。

系统转换阿拉伯文PDF文件,但文本文件为空。并显示此错误:

回溯(最近一次调用last):文件 “C:\Users\test\Downloads\pdf txt\text maker.py”,第68行,in f、写入(内容)UnicodeEncodeError:“ascii”编解码器无法对位置50处的字符u“\xa9”进行编码:序号不在范围内(128)

代码:

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime

def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    

print "\n"

folder = check_path("Provide absolute path for the folder: ")

list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)

            list.append(t)

m=len(list)
print (m)
i=0
while i<=m-1:

    path=list[i]
    print(path)
    head,tail=os.path.split(path)
    var="\\"

    tail=tail.replace(".pdf",".txt")
    name=head+var+tail

    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
            # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    content.encode('utf-8')
    f.write(content)
    f.close
    i=i+1

2 回复 | 直到 7 年前

Mark Tolonen 7 年前

您有几个问题:

content.encode('utf-8') 什么都不做。返回值是编码的内容,但必须将其分配给变量。更好的方法是,使用编码打开文件,并将Unicode字符串写入该文件。 content 似乎是Unicode数据。

示例(适用于Python 2和Python 3):

 import io
 f = io.open(name,'w',encoding='utf8')
 f.write(content)

如果未正确关闭文件,则可能看不到任何内容,因为文件未刷新到磁盘。你有 f.close 不 f.close() . 最好使用 with ,以确保在块退出时关闭文件。

例子:

import io
with io.open(name,'w',encoding='utf8') as f:
    f.write(content)

在Python 3中,不需要导入和使用 io.open 但它仍然有效。 open 是等效的。Python 2需要 io。打开 类型

Ameen Reda 3 年前

您可以使用名为pdfplumber的anthor库,而不是使用pypdf或PyPDF2

import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
    my_page = pdf.pages[10]
    thepages=my_page.extract_text()
    reshaped_text = arabic_reshaper.reshape(thepages)
    bidi_text = get_display(reshaped_text)
    print(bidi_text)

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

3 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

3 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

3 月前

user29715306 · from_users=和chats=电视节目中的差异

3 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

4 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

4 月前

prayner · 更新嵌套字典包含列表中的项

4 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

4 月前

Dave · 如何在for循环中修改列表值

4 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

4 月前