代码之家 › 专栏 › 技术社区 › vikingosegundo

如何处理utf-8编码的字符串和漂亮的汤?

beautifulsoup python

vikingosegundo · 技术社区 · 14 年前

如何用正确的Unicode替换Unicode字符串中的HTML实体?

u'&quot;HAUS Kleider&quot; - &Uuml;ber das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln'

到

u'"HAUS-Kleider" - Ãber das Bekleiden und Entkleiden, das VerhÃ¼llen und Veredeln'

编辑
实际上,实体是错误的。在它看起来像美丽的汤F…起来。

所以问题是: 如何处理utf-8编码的字符串和漂亮的汤?

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
            l += [r.string] # here things seem to go wrong
    allArticles+=[l]

Ã -> &Yuml; 而不是 Ü 但实际上,我不想改变编码方式。

>>> soup.originalEncoding
'utf-8'

但是我不能生成一个合适的Unicode字符串

3 回复 | 直到 13 年前

towi 14 年前

ICU transliterators

Hex/XML-Any this

BlueTrance 14 年前

htmlentitydefs.entitydefs["quot"] '"'

vikingosegundo 14 年前

rows

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
rows = soup.findAll('tr')
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
        l += [r.string]
    allArticles+=[l]