代码之家  ›  专栏  ›  技术社区  ›  David J.

美丽汤的编码错误:字符映射到未定义(python)

  •  0
  • David J.  · 技术社区  · 7 年前

    我已经编写了一个脚本,该脚本应该从一个站点检索HTML页面并更新其内容。以下函数在我的系统中查找某个文件,然后尝试打开并编辑该文件:

    def update_sn(files_to_update, sn, table, title):
        paths = files_to_update['files']
        print('updating the sn')
        try:
            sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
            notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]
    
        except Exception:
            print('no sns were found')
            pass
    
        new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
        new_sn_number = sn
    
        htm_text = open(sn_htm, 'rb').read().decode('cp1252')
        content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S) 
        minus_content = htm_text.replace(content[0], '')
        table_soup = BeautifulSoup(table, 'html.parser')
        new_soup = BeautifulSoup(minus_content, 'html.parser')
        head_title = new_soup.title.string.replace_with(new_sn_number)
        new_soup.link.insert_after(table_soup.div.next)
    
        with open(new_path_name, "w+") as file:
            result = str(new_soup)
            try:
                file.write(result)
            except Exception:
                print('Met exception.  Changing encoding to cp1252')
                try:
                    file.write(result('cp1252'))
                except Exception:
                    print('cp1252 did\'nt work.  Changing encoding to utf-8')
                    file.write(result.encode('utf8'))
                    try:
                        print('utf8 did\'nt work.  Changing encoding to utf-16')
                        file.write(result.encode('utf16'))
                    except Exception:
                        pass
    

    updating the sn
    Met exception.  Changing encoding to cp1252
    cp1252 did'nt work.  Changing encoding to utf-8
    Traceback (most recent call last):
      File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
        file.write(result)
      File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
        file.write(result('cp1252'))
    TypeError: 'str' object is not callable
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "scraper.py", line 79, in <module>
        get_latest(entries[0], int(num), entries[1])
      File "scraper.py", line 56, in get_latest
        update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
      File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
        file.write(result.encode('utf8'))
    TypeError: write() argument must be str, not bytes
    

    对于如何更好地处理可能具有不一致编码的HTML数据,有人能给我一些提示吗?

    2 回复  |  直到 7 年前
        1
  •  1
  •   t.m.adam    7 年前

    在代码中,以文本模式打开文件,但随后尝试写入字节( str.encode 返回字节),因此python抛出异常:

    TypeError: write() argument must be str, not bytes
    

    如果要写入字节,则应以二进制模式打开文件。

    BeautifulSoup检测文档编码(如果是字节),并自动将其转换为字符串。我们可以使用 .original_encoding ,并在写入文件时使用它对内容进行编码。例如,

    soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
    data = soup.tag.text
    encoding = soup.original_encoding or 'utf-8'
    print(encoding)
    #ascii
    
    with open('my.file', 'wb+') as file:
        file.write(data.encode(encoding))
    

    为了使其工作,您应该将HTML作为字节传递给 BeautifulSoup ,所以不要解码响应内容。

    如果BeautifulSoup由于某种原因未能检测到正确的编码,那么您可以尝试一个可能的编码列表,就像您在代码中所做的那样。

    data = 'Somé téxt'
    encodings = ['ascii', 'utf-8', 'cp1252']
    
    with open('my.file', 'wb+') as file:
        for encoding in encodings:
            try:
                file.write(data.encode(encoding))
                break
            except UnicodeEncodeError:
                print(encoding + ' failed.')
    

    或者,您可以以文本模式打开文件并将编码设置为 open (而不是对内容进行编码),但请注意,此选项在python2中不可用。

        2
  •  1
  •   Toto Lele    7 年前

    只是出于好奇,这行代码是打字错误吗? file.write(result('cp1252')) ?好像不见了 .encode 方法。

    Traceback (most recent call last):
      File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
        file.write(result('cp1252'))
    TypeError: 'str' object is not callable
    

    如果您将代码修改为: file.write(result.encode('cp1252'))

    我曾经有过这个关于编码问题的写入文件,并通过以下线程开发了自己的解决方案:

    Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence .

    我的问题通过改变 html.parser 分析模式到 html5lib . 由于HTML标记格式不正确,导致了我的问题,并用 HTML5库 解析器。作为参考,这是 documentation 对于由提供的每个分析器 BeautifulSoup .

    希望这有帮助