代码之家 › 专栏 › 技术社区 › Lukas

带“xe2x80x93”“-”字符的普通字符串

unicode-escapes bytestring python-3.x string python

Lukas · 技术社区 · 6 年前

我有一个字符串的问题 蟒蛇3 克 “xe2x80x93” ,因为它来自web解析器。我想把它转换成合适的字符

content = str(urllib.request.urlopen(site, timeout=10).read())
g = content.split('<h1 itemprop="name"')[1].split('</span></h1>')[0].split('<span>')[1].replace("\\", "")

print(type(g)) --> string
print(g)  --> "Flash xe2x80x93 der rote Blitz"

print(g.encode('latin-1').decode('utf-8')) --> AttributeError: 'str' object has no attribute 'decode'
print(repr(g.decode('unicode-escape'))) --> AttributeError: 'str' object has no attribute 'decode'
print(g.encode('ascii','replace')) --> b'Flash xe2x80x93 der rote Blitz'
print(bytes(g, "utf-8").decode()) --> "Flash xe2x80x93 der rote Blitz"
print(bytes(g, "utf-8").decode("unicode_escape")) --> "Flash Ã¢ der rote Blitz"

它是如何工作的?我再也没有了。

1 回复 | 直到 4 年前

jedwards 6 年前

你的想法是对的 decode .

通过将输出包装在 str(...)

content = str(urllib.request.urlopen(site, timeout=10).read())

您可以将一个bytes对象转换为一个字符串(这将由一个前导 b' 和尾随 ' 在 content ),或者,如果它已经被解码为ISO-8859-1,什么也不做。

无论哪种情况,都不要这样做——去掉包装 str 打电话。

现在,内容要么是 bytes str公司 对象。

因此,如果它是一个字符串,它将被解码(错误地)为ISO-8859-1。您需要将其编码回bytes对象,然后正确解码:

content = urllib.request.urlopen(site, timeout=10).read()

if isinstance(content, str):
    content = content.encode('iso-8859-1')
content = content.decode('utf8')

\xe2\x80\x93

更新 :

content = urllib.request.urlopen(site, timeout=10).read().decode('utf8')

推荐文章

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

6 月前

Cam · Pandas列表日期到日期时间

6 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

6 月前

jjkennedy · Pandas文本文件导入:当每个文件中存在多个表时,自动选择1个表

6 月前

LMC · Numpy数组布尔索引以获取包含元素

7 月前

vr8ce · 非成对标记中特定字符的正则表达式

7 月前

Kernel · 如果指定了crs参数,shapefile的geopandas.read_file将出错

7 月前

ShaAnder · 为什么sqllachemy返回的是类而不是字符串

7 月前

sixtytrees · detectron2软件包未安装(没有名为“torch”的模块),但我安装了torch

8 月前

Pernoctador · Python映射可以复制吗?我需要参考地图

8 月前