代码之家  ›  专栏  ›  技术社区  ›  bogdan

如何使python在读写unicode文本文件方面更加友好?

  •  1
  • bogdan  · 技术社区  · 14 年前

    我发现即使是现代的Python版本(如3.x)也无法检测文本文件上的BOM。我想知道是否有任何模块可以通过替换 open() codecs.open() 用于读取和写入文本文件的函数。

    3 回复  |  直到 11 年前
        1
  •  2
  •   Alex Martelli    14 年前

    建议的解决方案 here 对我来说仍然很好(下面是代码的修改版本,仍然是在python 2中,而不是python 3中,还有一个用法示例):

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    
    import codecs, logging, sys
    logging.basicConfig(level=logging.INFO)
    bomdict = {
        codecs.BOM_UTF8 : 'UTF8',
        codecs.BOM_UTF16_BE : 'UTF-16BE',
        codecs.BOM_UTF16_LE : 'UTF-16LE' }
    
    def read_unicode(filename):
      the_text = open(filename, 'r').read()
      for bom, encoding in bomdict.items():
          if the_text.startswith(bom):
              logging.info('BOM found, using %s', encoding)
              the_text = the_text[len(bom):]
              break
      else:
          logging.info('No BOM, using utf8')
          encoding = 'UTF8'
      return the_text.decode(encoding)
    
    f = open('x.txt', 'wb')
    f.write(codecs.BOM_UTF16_LE)
    f.write(u'zeé fóo!'.encode('UTF-16LE'))
    f.close()
    
    print read_unicode('x.txt')
    
        2
  •  1
  •   sorin    14 年前

    以下是文件.open()的部分工作替换。它确实适用于python 2.6,但在python 3.1上,我得到一个错误:

    Traceback (most recent call last):
      File "unicode-file.py", line 15, in <module>
        old_file_write = file.write
    NameError: name 'file' is not defined
    

    unicode友好文件。open()替换

    #!/usr/bin/python
    import codecs, sys, types
    
    # we save the file function handler because we want to override it
    open_old = open
    
    # on Python 3.x we overwrite write method in order to make it accept bytes in addition to str
    old_file_write = file.write
    
    class file():
        def write(self, d):
            if isinstance(d, types.bytes):
                self.buffer.write(d)
            else:
                old_file_write(d)
    
    def open(filename, mode=None, bufsize=None):
        #try:
            # we read the first 4 bytes just to be sure we use the right encoding
            if(mode == "r"): # we are interested of detecting the mode only for read text
                f = open_old(filename, "rb")
                aBuf = f.read(4)
                if aBuf[:3] ==   '\xEF\xBB\xBF' :
                    f = codecs.open(filename, mode, "utf_8")
                    f.seek(3,0)
                elif aBuf[:4] == '\xFF\xFE\x00\x00':
                    f = codecs.open(filename, mode, "utf_32_le")
                    f.seek(4,0)
                elif aBuf[:4] == '\x00\x00\xFE\xFF': 
                    f = codecs.open(filename, mode, "utf_32_be")
                    f.seek(4,0)
                elif aBuf[:2] == '\xFF\xFE':
                    f = codecs.open(filename, mode, "utf_16_le")
                    f.seek(2,0)
                elif aBuf[:2] == '\xFE\xFF':
                    f = codecs.open(filename, mode, "utf_16_be")
                    f.seek(2,0)
                else:  # we assume that if there is no BOM, the encoding is UTF-8
                    f.close()
                    f = codecs.open(filename, mode, "utf-8")
                    f.seek(0)
                return f
            else:
                return open_old(filename, mode, bufsize)
    
    # now use the open(file, "r")
    
        3
  •  1
  •   Anton Backer    11 年前

    我已经润色了亚历克斯和索林的例子来研究python3和python2:

    import codecs
    import io
    
    _boms = [
        (codecs.BOM_UTF8, 'utf-8-sig', 0),
        (codecs.BOM_UTF32_LE, 'utf-32le', 4),
        (codecs.BOM_UTF32_BE, 'utf-32be', 4),
        (codecs.BOM_UTF16_LE, 'utf-16le', 2),
        (codecs.BOM_UTF16_BE, 'utf-16be', 2)]
    
    
    def read_unicode(file_path):
        with io.open(file_path, 'rb') as f:
            data = f.read(4)
        for bom, encoding, seek_to in _boms:
            if data.startswith(bom):
                break
        else:
            encoding, seek_to = 'utf-8', 0
        with io.open(file_path, 'r', encoding=encoding) as f:
            f.seek(seek_to)
            return f.read()