至少有
one related question on SO
这在尝试解码unicode序列时被证明是有用的。
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek.
这样的字符串需要转换为实际字符:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says VojtÄch Äamek.
可以这样做:
s = "'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek."
s = s.encode('utf-8').decode('unicode-escape')
(至少在
s
utf-8
编码文本文件。我似乎无法在REPL.it这样的在线服务上使用它,因为它的输出编码/解码方式不同。)
在大多数情况下,这样做很好。但是,当在输入字符串中看到目录结构路径时(通常是我的数据集中的技术文档的情况),那么
UnicodeDecodeError
给定以下数据
unicode.txt
:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).
用bytestring表示:
b"'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\\u0115ch \\u010camek, Financial Director and Director of Controlling.\r\nVoor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\\udfs\\math.dll (op Windows))."
解码输入文件中的第二行时,以下脚本将失败:
with open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
lines = ''.join(fin.readlines())
lines = lines.encode('utf-8').decode('unicode-escape')
fout.write(lines)
带痕迹:
Traceback (most recent call last):
File "C:/Python/files/fast_aligning/unicode-encoding.py", line 3, in <module>
lines = lines.encode('utf-8').decode('unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 275-278: truncated \uXXXX escape
Process finished with exit code 1
我如何确保第一句话仍然是正确的'翻译',如前所示,但第二个仍然保持不变?因此,给定的两行的预期输出如下,其中第一行已更改,第二行未更改。
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says VojtÄch Äamek.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).