代码之家 › 专栏 › 技术社区 › Bram Vanroy

在Python3中将unicode序列转换为字符串,但允许在字符串中使用路径

unicode python-3.x python

Bram Vanroy · 技术社区 · 7 年前

至少有 one related question on SO 这在尝试解码unicode序列时被证明是有用的。

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek.

这样的字符串需要转换为实际字符:

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says VojtÄch Äamek.

可以这样做:

s = "'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek."
s = s.encode('utf-8').decode('unicode-escape')

(至少在 s utf-8 编码文本文件。我似乎无法在REPL.it这样的在线服务上使用它,因为它的输出编码/解码方式不同。)

在大多数情况下,这样做很好。但是,当在输入字符串中看到目录结构路径时(通常是我的数据集中的技术文档的情况),那么 UnicodeDecodeError

给定以下数据 unicode.txt :

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).

用bytestring表示:

b"'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\\u0115ch \\u010camek, Financial Director and Director of Controlling.\r\nVoor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\\udfs\\math.dll (op Windows))."

解码输入文件中的第二行时,以下脚本将失败:

with open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
    lines = ''.join(fin.readlines())
    lines = lines.encode('utf-8').decode('unicode-escape')

    fout.write(lines)

带痕迹:

Traceback (most recent call last):
  File "C:/Python/files/fast_aligning/unicode-encoding.py", line 3, in <module>
    lines = lines.encode('utf-8').decode('unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 275-278: truncated \uXXXX escape

Process finished with exit code 1

我如何确保第一句话仍然是正确的'翻译',如前所示,但第二个仍然保持不变?因此,给定的两行的预期输出如下,其中第一行已更改,第二行未更改。

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says VojtÄch Äamek.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).

3 回复 | 直到 7 年前

jfs 7 年前

输入不明确。一般情况下不存在正确答案。我们可以使用启发式算法来产生在大多数情况下看起来正确的输出,例如,我们可以使用如下规则 \uxxxx “序列(6个字符)是现有路径的一部分,因此不要将其解释为Unicode转义” \Uxxxxxxxx b"c:\\U0001f60f\\math.dll" 根据是否 c:\U0001f60f\math.dll 文件实际存在于磁盘上:

#!/usr/bin/env python3
import re
from pathlib import Path


def decode_unicode_escape_if_path_doesnt_exist(m):
    path = m.group(0)
    return path if Path(path).exists() else replace_unicode_escapes(path)


def replace_unicode_escapes(text):
    return re.sub(
        fr"{unicode_escape}+",
        lambda m: m.group(0).encode("latin-1").decode("raw-unicode-escape"),
        text,
    )


input_text = Path('broken.txt').read_text(encoding='ascii')
hex = "[0-9a-fA-F]"
unicode_escape = fr"(?:\\u{hex}{{4}}|\\U{hex}{{8}})"
drive_letter = "[a-zA-Z]"
print(
    re.sub(
        fr"{drive_letter}:\S*{unicode_escape}\S*",
        decode_unicode_escape_if_path_doesnt_exist,
        input_text,
    )
)

broken.txt 文件在 read_text() 如果编码文本中有非ascii字符。

您可以尝试一次替换一个可能的Unicode序列,从而使代码复杂化(在这种情况下,替换的数量随着候选序列的数量呈指数增长,例如,如果路径中有10个可能的Unicode转义序列,那么 2**10 解码路径)。

AKX Bryan Oakley 7 年前

这个 raw_unicode_escape ignore 模式似乎起到了作用。我在这里将输入作为原始字节longstring内联,根据我的推理,这应该相当于从二进制文件中读取它。

input = br"""
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).
"""

print(input.decode('raw_unicode_escape', 'ignore'))

科拉多的财务总监兼控制总监阿梅克(vojtchamek)说,科拉多的散热器产量已从每年18万台增加到目前的近170万台。

\udf 在里面 d:\udfs \uXXXX 但最后还是放弃了 s .

另一种方法(可能更慢)是使用regexp在解码数据中查找有效的Unicode序列。这是假设 .decode() 不过,将完整的输入字符串转换为UTF-8是可能的(这个 .encode().decode() chr(int(m.group(0)[2:], 16)) .)

escape_re = re.compile(r'\\u[0-9a-f]{4}')
output = escape_re.sub(lambda m: m.group(0).encode().decode('unicode_escape'), input.decode()))

输出

科拉多的财务总监兼控制总监阿梅克(vojtchamek)说,科拉多的散热器产量已从每年18万台增加到目前的近170万台。

自 \自定义项 没有4个十六进制字符 d:\自定义项

Bram Vanroy 7 年前

当AKX发布他的答案时,我已经写了这个代码。我仍然认为这是适用的。

这个想法是用正则表达式(regex)捕获unicode序列候选,并尝试排除路径,例如前面有任何字母和冒号的部分(例如。 c:\udfff ). 如果解码失败,我们将返回原始字符串。

with open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
    lines = ''.join(fin.readlines())
    lines = lines.strip()
    lines = unicode_replace(lines)
    fout.write(lines)


def unicode_replace(s):
    # Directory paths in a text are seen as unicode sequences but will fail to decode, e.g. d:\udfs\math.dll
    # In case of such failure, we'll pass on these sentences - we don't try to decode them but leave them
    # as-is. Note that this may leave some unicode sequences alive in your text.
    def repl(match):
        match = match.group()
        try:
            return match.encode('utf-8').decode('unicode-escape')
        except UnicodeDecodeError:
            return match

    return re.sub(r'(?<!\b[a-zA-Z]:)(\\u[0-9A-Fa-f]{4})', repl, s)