代码之家  ›  专栏  ›  技术社区  ›  Evandro Coan

如何替换/忽略C stdio中无效的Unicode/UTF8字符。h getline()?

  •  1
  • Evandro Coan  · 技术社区  · 7 年前

    在Python上,有这个选项 errors='ignore' 对于 open Python函数:

    open( '/filepath.txt', 'r', encoding='UTF-8', errors='ignore' )
    

    这样,读取包含无效UTF8字符的文件时,这些字符将被替换为零,即被忽略。例如,一个带有字符的文件 Føö»BÃ¥r 将被解读为 FøöBÃ¥r .

    如果一行 Fƒ¸ƒƒƒ»BÃr 读起来很有趣 getline() 从…起 stdio.h ,它将被解读为 Føö�BÃ¥r :

    FILE* cfilestream = fopen( "/filepath.txt", "r" );
    int linebuffersize = 131072;
    char* readline = (char*) malloc( linebuffersize );
    
    while( true )
    {
        if( getline( &readline, &linebuffersize, cfilestream ) != -1 ) {
            std::cerr << "readline=" readline << std::endl;
        }
        else {
            break;
        }
    }
    

    我该怎么做 斯特迪奥。H getline() 读作 F¸ÃÃBÃ¥r 而不是 FBr 我e、 忽略无效的UTF8字符?

    我能想到的一个压倒性的解决方案是,在每一行的所有字符中迭代读取并构建一个新的 readline 没有这些角色。例如:

    FILE* cfilestream = fopen( "/filepath.txt", "r" );
    int linebuffersize = 131072;
    char* readline = (char*) malloc( linebuffersize );
    char* fixedreadline = (char*) malloc( linebuffersize );
    
    int index;
    int charsread;
    int invalidcharsoffset;
    
    while( true )
    {
        if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
        {
            invalidcharsoffset = 0;
            for( index = 0; index < charsread; ++index )
            {
                if( readline[index] != '�' ) {
                    fixedreadline[index-invalidcharsoffset] = readline[index];
                } 
                else {
                    ++invalidcharsoffset;
                }
            }
            std::cerr << "fixedreadline=" << fixedreadline << std::endl;
        }
        else {
            break;
        }
    }
    

    相关问题:

    1. Fixing invalid UTF8 characters
    2. Replacing non UTF8 characters
    3. python replace unicode characters
    4. Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?
    0 回复  |  直到 7 年前
        1
  •  4
  •   Community Mohan Dere    6 年前

    你混淆了你所看到的和真实发生的事情。这个 getline 函数不进行任何字符替换。[注1]

    您将看到一个替换字符(U+FFFD),因为当控制台被要求呈现无效的UTF-8代码时,它会输出该字符。如果处于UTF-8模式,大多数控制台都会这样做;也就是说,当前的语言环境是UTF-8。

    另外,表示文件包含“字符” Føö»BÃ¥r “充其量是不精确的。一个文件实际上并不包含字符。它包含字节序列,这些字节序列可能被解释为字符——例如,由控制台或其他用户演示软件根据某种编码将它们呈现为字形。不同的编码产生不同的结果;在这种特殊情况下,您有一个它是由使用Windows-1252编码(或者大致相当于ISO 8859-15)的软件创建的,您可以使用UTF-8在控制台上渲染它。

    这意味着getline读取的数据包含无效的UTF-8序列,但它(可能)不包含替换字符代码。根据您提供的字符串,它包含十六进制字符 \xbb ,这是一个吉勒莫( » )在Windows代码页1252中。

    在由读取的字符串中查找所有无效的UTF-8序列 getline (或任何其他读取文件的C库函数)需要扫描字符串,但不需要扫描特定的代码序列。相反,你需要一次解码一个UTF-8序列,寻找那些无效的。这不是一项简单的任务,但是 mbtowc 函数可以提供帮助(如果启用了UTF-8语言环境)。正如您将在链接的主页中看到的, mbtowc 返回有效的“多字节序列”(在UTF-8语言环境中为UTF-8)中包含的字节数,或返回-1以指示无效或不完整的序列。在扫描中,您应该以有效的顺序传递字节,或者删除/忽略开始无效序列的单个字节,然后继续扫描,直到到达字符串末尾。

    下面是一些经过简单测试的示例代码(C语言):

    #include <stdlib.h>
    #include <string.h>
    
    /* Removes in place any invalid UTF-8 sequences from at most 'len' characters of the
     * string pointed to by 's'. (If a NUL byte is encountered, conversion stops.)
     * If the length of the converted string is less than 'len', a NUL byte is
     * inserted.
     * Returns the length of the possibly modified string (with a maximum of 'len'),
     * not including the NUL terminator (if any).
     * Requires that a UTF-8 locale be active; since there is no way to test for
     * this condition, no attempt is made to do so. If the current locale is not UTF-8,
     * behaviour is undefined.
     */
    size_t remove_bad_utf8(char* s, size_t len) {
      char* in = s;
      /* Skip over the initial correct sequence. Avoid relying on mbtowc returning
       * zero if n is 0, since Posix is not clear whether mbtowc returns 0 or -1.
       */
      int seqlen;
      while (len && (seqlen = mbtowc(NULL, in, len)) > 0) { len -= seqlen; in += seqlen; }
      char* out = in;
    
      if (len && seqlen < 0) {
        ++in;
        --len;
        /* If we find an invalid sequence, we need to start shifting correct sequences.  */
        for (; len; in += seqlen, len -= seqlen) {
          seqlen = mbtowc(NULL, in, len);
          if (seqlen > 0) {
            /* Shift the valid sequence (if one was found) */
            memmove(out, in, seqlen);
            out += seqlen;
          }
          else if (seqlen < 0) seqlen = 1;
          else /* (seqlen == 0) */ break;
        }
        *out++ = 0;
      }
      return out - s;
    }
    

    笔记

    1. 除了底层I/O库可能的行端转换之外,它将用一个 \n 在Windows这样的系统上,两个字符的CR-LF序列用作线端指示。
        2
  •  3
  •   Stephan Schlecht    7 年前

    正如@rici在他的回答中所解释的,一个字节序列中可能有几个无效的UTF-8序列。

    可能iconv(3)值得一看,例如 https://linux.die.net/man/3/iconv_open .

    当字符串“//IGNORE”附加到 编码 ,无法在目标字符集中表示的字符将被自动丢弃。

    实例

    如果将此字节序列解释为UTF-8,则包含一些无效的UTF-8:

    "some invalid\xFE\xFE\xFF\xFF stuff"
    

    如果你展示这个,你会看到

    some invalid���� stuff
    

    当该字符串通过以下C程序中的remove_invalid_utf8函数时,将使用上述iconv函数删除无效的UTF-8字节。

    所以结果是:

    some invalid stuff
    

    C程序

    #include <stdio.h>
    #include <iconv.h>
    #include <string.h>
    #include <stdlib.h>
    #include <stdbool.h>
    #include <errno.h>
    
    char *remove_invalid_utf8(char *utf8, size_t len) {
        size_t inbytes_len = len;
        char *inbuf = utf8;
    
        size_t outbytes_len = len;
        char *result = calloc(outbytes_len + 1, sizeof(char));
        char *outbuf = result;
    
        iconv_t cd = iconv_open("UTF-8//IGNORE", "UTF-8");
        if(cd == (iconv_t)-1) {
            perror("iconv_open");
        }
        if(iconv(cd, &inbuf, &inbytes_len, &outbuf, &outbytes_len)) {
            perror("iconv");
        }
        iconv_close(cd);
        return result;
    }
    
    int main() {
        char *utf8 = "some invalid\xFE\xFE\xFF\xFF stuff";
        char *converted = remove_invalid_utf8(utf8, strlen(utf8));
        printf("converted: %s to %s\n", utf8, converted);
        free(converted);
        return 0;
    }
    
        3
  •  2
  •   Evandro Coan    7 年前

    我还通过拖尾/删减所有非ASCII字符来修复它。

    这一次大约需要10分钟 2.6 解析319MB的秒数:

    #include <stdlib.h>
    #include <iostream>
    
    int main(int argc, char const *argv[])
    {
        FILE* cfilestream = fopen( "./test.txt", "r" );
        size_t linebuffersize = 131072;
    
        if( cfilestream == NULL ) {
            perror( "fopen cfilestream" );
            return -1;
        }
    
        char* readline = (char*) malloc( linebuffersize );
        char* fixedreadline = (char*) malloc( linebuffersize );
    
        if( readline == NULL ) {
            perror( "malloc readline" );
            return -1;
        }
    
        if( fixedreadline == NULL ) {
            perror( "malloc fixedreadline" );
            return -1;
        }
    
        char* source;
        if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
            perror( "setlocale" );
        }
        else {
            std::cerr << "locale='" << source << "'" << std::endl;
        }
    
        int index;
        int charsread;
        int invalidcharsoffset;
        unsigned int fixedchar;
    
        while( true )
        {
            if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
            {
                invalidcharsoffset = 0;
                for( index = 0; index < charsread; ++index )
                {
                    fixedchar = static_cast<unsigned int>( readline[index] );
                    // std::cerr << "index " << std::setw(3) << index
                    //         << " readline " << std::setw(10) << fixedchar
                    //         << " -> '" << readline[index] << "'" << std::endl;
    
                    if( 31 < fixedchar && fixedchar < 128 ) {
                        fixedreadline[index-invalidcharsoffset] = readline[index];
                    }
                    else {
                        ++invalidcharsoffset;
                    }
                }
    
                fixedreadline[index-invalidcharsoffset] = '\0';
                // std::cerr << "fixedreadline=" << fixedreadline << std::endl;
            }
            else {
                break;
            }
        }
        std::cerr << "fixedreadline=" << fixedreadline << std::endl;
    
        free( readline );
        free( fixedreadline );
    
        fclose( cfilestream );
        return 0;
    }
    

    使用 memcpy

    使用 menmove 速度提高不了多少,所以你可以选择其中一种。

    这一次大约需要10分钟 3.1 解析319MB的秒数:

    #include <stdlib.h>
    #include <iostream>
    #include <cstring>
    #include <iomanip>
    
    int main(int argc, char const *argv[])
    {
        FILE* cfilestream = fopen( "./test.txt", "r" );
        size_t linebuffersize = 131072;
    
        if( cfilestream == NULL ) {
            perror( "fopen cfilestream" );
            return -1;
        }
    
        char* readline = (char*) malloc( linebuffersize );
        char* fixedreadline = (char*) malloc( linebuffersize );
    
        if( readline == NULL ) {
            perror( "malloc readline" );
            return -1;
        }
    
        if( fixedreadline == NULL ) {
            perror( "malloc fixedreadline" );
            return -1;
        }
    
        char* source;
        char* destination;
        char* finalresult;
    
        int index;
        int lastcopy;
        int charsread;
        int charstocopy;
        int invalidcharsoffset;
    
        bool hasignoredbytes;
        unsigned int fixedchar;
    
        if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
            perror( "setlocale" );
        }
        else {
            std::cerr << "locale='" << source << "'" << std::endl;
        }
    
        while( true )
        {
            if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
            {
                hasignoredbytes = false;
                source = readline;
                destination = fixedreadline;
                lastcopy = 0;
                invalidcharsoffset = 0;
    
                for( index = 0; index < charsread; ++index )
                {
                    fixedchar = static_cast<unsigned int>( readline[index] );
                    // std::cerr << "fixedchar " << std::setw(10)
                    //           << fixedchar << " -> '"
                    //           << readline[index] << "'" << std::endl;
    
                    if( 31 < fixedchar && fixedchar < 128 ) {
                        if( hasignoredbytes ) {
                            charstocopy = index - lastcopy - invalidcharsoffset;
                            memcpy( destination, source, charstocopy );
    
                            source += index - lastcopy;
                            lastcopy = index;
                            destination += charstocopy;
    
                            invalidcharsoffset = 0;
                            hasignoredbytes = false;
                        }
                    }
                    else {
                        ++invalidcharsoffset;
                        hasignoredbytes = true;
                    }
                }
    
                if( destination != fixedreadline ) {
                    charstocopy = charsread - static_cast<int>( source - readline )
                                   - invalidcharsoffset;
    
                    memcpy( destination, source, charstocopy );
                    destination += charstocopy - 1;
    
                    if( *destination == '\n' ) {
                        *destination = '\0';
                    }
                    else {
                        *++destination = '\0';
                    }
                    finalresult = fixedreadline;
                }
                else {
                    finalresult = readline;
                }
    
                // std::cerr << "finalresult=" << finalresult << std::endl;
            }
            else {
                break;
            }
        }
        std::cerr << "finalresult=" << finalresult << std::endl;
    
        free( readline );
        free( fixedreadline );
    
        fclose( cfilestream );
        return 0;
    }
    

    使用 iconv

    这需要大约 4.6 秒来解析319MB的文本。

    #include <iconv.h>
    #include <string.h>
    #include <stdlib.h>
    #include <iostream>
    
    // Compile it with:
    //     g++ -o main test.cpp -O3 -liconv
    int main(int argc, char const *argv[])
    {
        FILE* cfilestream = fopen( "./test.txt", "r" );
        size_t linebuffersize = 131072;
    
        if( cfilestream == NULL ) {
            perror( "fopen cfilestream" );
            return -1;
        }
    
        char* readline = (char*) malloc( linebuffersize );
        char* fixedreadline = (char*) malloc( linebuffersize );
    
        if( readline == NULL ) {
            perror( "malloc readline" );
            return -1;
        }
    
        if( fixedreadline == NULL ) {
            perror( "malloc fixedreadline" );
            return -1;
        }
    
        char* source;
        char* destination;
    
        int charsread;
        size_t inchars;
        size_t outchars;
    
        if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
            perror( "setlocale" );
        }
        else {
            std::cerr << "locale='" << source << "'" << std::endl;
        }
    
        iconv_t conversiondescriptor = iconv_open("UTF-8//IGNORE", "UTF-8");
        if( conversiondescriptor == (iconv_t)-1 ) {
            perror( "iconv_open conversiondescriptor" );
        }
    
        while( true )
        {
            if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
            {
                source = readline;
                inchars = charsread;
    
                destination = fixedreadline;
                outchars = charsread;
    
                if( iconv( conversiondescriptor, &source, &inchars, &destination, &outchars ) )
                {
                    perror( "iconv" );
                }
    
                // Trim out the new line character
                if( *--destination == '\n' ) {
                    *--destination = '\0';
                }
                else {
                    *destination = '\0';
                }
    
                // std::cerr << "fixedreadline='" << fixedreadline << "'" << std::endl;
            }
            else {
                break;
            }
        }
        std::cerr << "fixedreadline='" << fixedreadline << "'" << std::endl;
    
        free( readline );
        free( fixedreadline );
    
        if( fclose( cfilestream ) ) {
            perror( "fclose cfilestream" );
        }
    
        if( iconv_close( conversiondescriptor ) ) {
            perror( "iconv_close conversiondescriptor" );
        }
    
        return 0;
    }
    

    有史以来最慢的解决方案 mbtowc

    这需要大约 24.2 秒来解析319MB的文本。

    如果你把话说完 fixedchar = mbtowc(NULL, source, charsread); 并取消对该行的注释 charsread -= fixedchar; (破坏无效字符删除)这将需要 1.9 秒而不是 24.2 秒(也用 -O3 优化级别)。

    #include <stdlib.h>
    #include <string.h>
    
    #include <iostream>
    #include <cstring>
    #include <iomanip>
    
    int main(int argc, char const *argv[])
    {
        FILE* cfilestream = fopen( "./test.txt", "r" );
        size_t linebuffersize = 131072;
    
        if( cfilestream == NULL ) {
            perror( "fopen cfilestream" );
            return -1;
        }
    
        char* readline = (char*) malloc( linebuffersize );
        if( readline == NULL ) {
            perror( "malloc readline" );
            return -1;
        }
    
        char* source;
        char* lineend;
        char* destination;
        int charsread;
        int fixedchar;
    
        if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
            perror( "setlocale" );
        }
        else {
            std::cerr << "locale='" << source << "'" << std::endl;
        }
    
        while( true )
        {
            if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
            {
                lineend = readline + charsread;
                destination = readline;
                for( source = readline; source != lineend; )
                {
                    // fixedchar = 1;
                    fixedchar = mbtowc(NULL, source, charsread);
                    charsread -= fixedchar;
    
                    // std::ostringstream contents;
                    // for( int index = 0; index < fixedchar; ++index )
                    //         contents << source[index];
    
                    // std::cerr << "fixedchar=" << std::setw(10)
                    //         << fixedchar << " -> '"
                    //         << contents.str().c_str() << "'" << std::endl;
    
                    if( fixedchar > 0 ) {
                        memmove( destination, source, fixedchar );
                        source += fixedchar;
                        destination += fixedchar;
                    }
                    else if( fixedchar < 0 ) {
                        source += 1;
                        // std::cerr << "errno=" << strerror( errno ) << std::endl;
                    }
                    else {
                        break;
                    }
                }
    
                // Trim out the new line character
                if( *--destination == '\n' ) {
                    *--destination = '\0';
                }
                else {
                    *destination = '\0';
                }
    
                // std::cerr << "readline='" << readline << "'" << std::endl;
            }
            else {
                break;
            }
        }
        std::cerr << "readline='" << readline << "'" << std::endl;
    
        if( fclose( cfilestream ) ) {
            perror( "fclose cfilestream" );
        }
    
        free( readline );
        return 0;
    }
    

    最快的版本,从我所有其他以上使用 memmove

    你不能使用 memcpy 这里是因为记忆区域重叠!

    这需要大约 2.4 秒解析319MB。

    如果你把台词注释掉 *destination = *source memmove( destination, source, 1 ) (破坏无效字符删除)性能仍然与 memmove 正在打电话。进来打电话 memmove(目的地、来源、1) 比直接做要慢一点 *destination = *source;

    #include <stdlib.h>
    #include <iostream>
    #include <cstring>
    #include <iomanip>
    
    int main(int argc, char const *argv[])
    {
        FILE* cfilestream = fopen( "./test.txt", "r" );
        size_t linebuffersize = 131072;
    
        if( cfilestream == NULL ) {
            perror( "fopen cfilestream" );
            return -1;
        }
    
        char* readline = (char*) malloc( linebuffersize );
        if( readline == NULL ) {
            perror( "malloc readline" );
            return -1;
        }
    
        char* source;
        char* lineend;
        char* destination;
    
        int charsread;
        unsigned int fixedchar;
    
        if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
            perror( "setlocale" );
        }
        else {
            std::cerr << "locale='" << source << "'" << std::endl;
        }
    
    
        while( true )
        {
            if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
            {
                lineend = readline + charsread;
                destination = readline;
                for( source = readline; source != lineend; ++source )
                {
                    fixedchar = static_cast<unsigned int>( *source );
                    // std::cerr << "fixedchar=" << std::setw(10)
                    //         << fixedchar << " -> '" << *source << "'" << std::endl;
    
                    if( 31 < fixedchar && fixedchar < 128 ) {
                        *destination = *source;
                        ++destination;
                    }
                }
    
                // Trim out the new line character
                if( *source == '\n' ) {
                    *--destination = '\0';
                }
                else {
                    *destination = '\0';
                }
    
                // std::cerr << "readline='" << readline << "'" << std::endl;
            }
            else {
                break;
            }
        }
        std::cerr << "readline='" << readline << "'" << std::endl;
    
        if( fclose( cfilestream ) ) {
            perror( "fclose cfilestream" );
        }
    
        free( readline );
        return 0;
    }
    

    奖金

    还可以使用Python C扩展(API)。

    大约需要 2.3 秒解析319MB而不将其转换为缓存版本 UTF-8 char*

    大约 3.2 秒解析319MB,将其转换为 UTF-8 字符*。 而且还需要大约 3.2 秒解析319MB将其转换为缓存 ASCII 字符*。

    #define PY_SSIZE_T_CLEAN
    #include <Python.h>
    #include <iostream>
    
    typedef struct
    {
        PyObject_HEAD
    }
    PyFastFile;
    
    static PyModuleDef fastfilepackagemodule =
    {
        // https://docs.python.org/3/c-api/module.html#c.PyModuleDef
        PyModuleDef_HEAD_INIT,
        "fastfilepackage", /* name of module */
        "Example module that wrapped a C++ object", /* module documentation, may be NULL */
        -1, /* size of per-interpreter state of the module, or 
                    -1 if the module keeps state in global variables. */
    
        NULL, /* PyMethodDef* m_methods */
        NULL, /* inquiry m_reload */
        NULL, /* traverseproc m_traverse */
        NULL, /* inquiry m_clear */
        NULL, /* freefunc m_free */
    };
    
    // initialize PyFastFile Object
    static int PyFastFile_init(PyFastFile* self, PyObject* args, PyObject* kwargs) {
        char* filepath;
    
        if( !PyArg_ParseTuple( args, "s", &filepath ) ) {
            return -1;
        }
    
        int linecount = 0;
        PyObject* iomodule;
        PyObject* openfile;
        PyObject* fileiterator;
    
        iomodule = PyImport_ImportModule( "builtins" );
        if( iomodule == NULL ) {
            std::cerr << "ERROR: FastFile failed to import the io module '"
                    "(and open the file " << filepath << "')!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
        PyObject* openfunction = PyObject_GetAttrString( iomodule, "open" );
    
        if( openfunction == NULL ) {
            std::cerr << "ERROR: FastFile failed get the io module open "
                    << "function (and open the file '" << filepath << "')!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
        openfile = PyObject_CallFunction( 
                openfunction, "ssiss", filepath, "r", -1, "ASCII", "ignore" );
    
        if( openfile == NULL ) {
            std::cerr << "ERROR: FastFile failed to open the file'"
                    << filepath << "'!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
        PyObject* iterfunction = PyObject_GetAttrString( openfile, "__iter__" );
        Py_DECREF( openfunction );
    
        if( iterfunction == NULL ) {
            std::cerr << "ERROR: FastFile failed get the io module iterator" 
                    << "function (and open the file '" << filepath << "')!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
        PyObject* openiteratorobject = PyObject_CallObject( iterfunction, NULL );
        Py_DECREF( iterfunction );
    
        if( openiteratorobject == NULL ) {
            std::cerr << "ERROR: FastFile failed get the io module iterator object"
                    << " (and open the file '" << filepath << "')!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
        fileiterator = PyObject_GetAttrString( openfile, "__next__" );
        Py_DECREF( openiteratorobject );
    
        if( fileiterator == NULL ) {
            std::cerr << "ERROR: FastFile failed get the io module iterator "
                    << "object (and open the file '" << filepath << "')!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
    
        PyObject* readline;
        while( ( readline = PyObject_CallObject( fileiterator, NULL ) ) != NULL ) {
            linecount += 1;
            PyUnicode_AsUTF8( readline );
            Py_DECREF( readline );
            // std::cerr << "linecount " << linecount << " readline '" << readline
            //         << "' '" << PyUnicode_AsUTF8( readline ) << "'" << std::endl;
        }
        std::cerr << "linecount " << linecount << std::endl;
    
        // PyErr_PrintEx(100);
        PyErr_Clear();
        PyObject* closefunction = PyObject_GetAttrString( openfile, "close" );
    
        if( closefunction == NULL ) {
            std::cerr << "ERROR: FastFile failed get the close file function for '"
                    << filepath << "')!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
    
        PyObject* closefileresult = PyObject_CallObject( closefunction, NULL );
        Py_DECREF( closefunction );
    
        if( closefileresult == NULL ) {
            std::cerr << "ERROR: FastFile failed close open file '"
                    << filepath << "')!" << std::endl;
            PyErr_PrintEx(100);
            return -1;
        }
        Py_DECREF( closefileresult );
    
        Py_XDECREF( iomodule );
        Py_XDECREF( openfile );
        Py_XDECREF( fileiterator );
    
        return 0;
    }
    
    // destruct the object
    static void PyFastFile_dealloc(PyFastFile* self) {
        Py_TYPE(self)->tp_free( (PyObject*) self );
    }
    
    static PyTypeObject PyFastFileType =
    {
        PyVarObject_HEAD_INIT( NULL, 0 )
        "fastfilepackage.FastFile" /* tp_name */
    };
    
    // create the module
    PyMODINIT_FUNC PyInit_fastfilepackage(void)
    {
        PyObject* thismodule;
    
        // https://docs.python.org/3/c-api/typeobj.html
        PyFastFileType.tp_new = PyType_GenericNew;
        PyFastFileType.tp_basicsize = sizeof(PyFastFile);
        PyFastFileType.tp_dealloc = (destructor) PyFastFile_dealloc;
        PyFastFileType.tp_flags = Py_TPFLAGS_DEFAULT;
        PyFastFileType.tp_doc = "FastFile objects";
        PyFastFileType.tp_init = (initproc) PyFastFile_init;
    
        if( PyType_Ready( &PyFastFileType) < 0 ) {
            return NULL;
        }
    
        thismodule = PyModule_Create(&fastfilepackagemodule);
        if( thismodule == NULL ) {
            return NULL;
        }
    
        // Add FastFile class to thismodule allowing the use to create objects
        Py_INCREF( &PyFastFileType );
        PyModule_AddObject( thismodule, "FastFile", (PyObject*) &PyFastFileType );
        return thismodule;
    }
    

    要构建它,请创建文件 source/fastfilewrappar.cpp 与上述文件的内容和 setup.py 包括以下内容:

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    from setuptools import setup, Extension
    
    myextension = Extension(
        language = "c++",
        extra_link_args = ["-std=c++11"],
        extra_compile_args = ["-std=c++11"],
        name = 'fastfilepackage',
        sources = [
            'source/fastfilewrapper.cpp'
        ],
        include_dirs = [ 'source' ],
    )
    
    setup(
            name = 'fastfilepackage',
            ext_modules= [ myextension ],
        )
    

    要运行示例,请使用以下命令 Python 脚本:

    import time
    import datetime
    import fastfilepackage
    
    testfile = './test.txt'
    timenow = time.time()
    iterable = fastfilepackage.FastFile( testfile )
    
    fastfile_time = time.time() - timenow
    timedifference = datetime.timedelta( seconds=fastfile_time )
    print( 'FastFile timedifference', timedifference, flush=True )
    

    例子:

    user@user-pc$ /usr/bin/pip3.6 install .
    Processing /fastfilepackage
    Building wheels for collected packages: fastfilepackage
      Building wheel for fastfilepackage (setup.py) ... done
      Stored in directory: /pip-ephem-wheel-cache-j313cpzc/wheels/e5/5f/bc/52c820
    Successfully built fastfilepackage
    Installing collected packages: fastfilepackage
      Found existing installation: fastfilepackage 0.0.0
        Uninstalling fastfilepackage-0.0.0:
          Successfully uninstalled fastfilepackage-0.0.0
    Successfully installed fastfilepackage-0.0.0
    
    user@user-pc$ /usr/bin/python3.6 fastfileperformance.py
    linecount 820800
    FastFile timedifference 0:00:03.204614
    

    使用std::getline

    这需要大约 4.7 秒解析319MB。

    如果你移除 UTF-8 删除算法借用了最快的基准测试,使用 stdlib.h getline() ,需要 1.7 还有几秒钟。

    #include <stdlib.h>
    #include <iostream>
    #include <locale>
    #include <fstream>
    #include <iomanip>
    
    int main(int argc, char const *argv[])
    {
        unsigned int fixedchar;
        int linecount = -1;
    
        char* source;
        char* lineend;
        char* destination;
    
        if( ( source = setlocale( LC_ALL, "en_US.ascii" ) ) == NULL ) {
            perror( "setlocale" );
            return -1;
        }
        else {
            std::cerr << "locale='" << source << "'" << std::endl;
        }
    
        std::ifstream fileifstream{ "./test.txt" };
        if( fileifstream.fail() ) {
            std::cerr << "ERROR: FastFile failed to open the file!" << std::endl;
            return -1;
        }
    
        size_t linebuffersize = 131072;
        char* readline = (char*) malloc( linebuffersize );
    
        if( readline == NULL ) {
            perror( "malloc readline" );
            return -1;
        }
    
        while( true )
        {
            if( !fileifstream.eof() )
            {
                linecount += 1;
                fileifstream.getline( readline, linebuffersize );
                lineend = readline + fileifstream.gcount();
                destination = readline;
    
                for( source = readline; source != lineend; ++source )
                {
                    fixedchar = static_cast<unsigned int>( *source );
                    // std::cerr << "fixedchar=" << std::setw(10)
                    //         << fixedchar << " -> '" << *source << "'" << std::endl;
    
                    if( 31 < fixedchar && fixedchar < 128 ) {
                        *destination = *source;
                        ++destination;
                    }
                }
    
                // Trim out the new line character
                if( *source == '\n' ) {
                    *--destination = '\0';
                }
                else {
                    *destination = '\0';
                }
    
                // std::cerr << "readline='" << readline << "'" << std::endl;
            }
            else {
                break;
            }
        }
        std::cerr << "linecount='" << linecount << "'" << std::endl;
    
        if( fileifstream.is_open() ) {
            fileifstream.close();
        }
    
        free( readline );
        return 0;
    }
    

    简历

    1. 2.6 秒使用两个带索引的缓冲区修剪UTF-8
    2. 3.1 使用带有memcpy的两个缓冲区秒微调UTF-8
    3. 4.6 用iconv移除无效的UTF-8需要几秒钟
    4. 24.2 用mbtowc移除无效的UTF-8需要几秒钟
    5. 2.4 使用一个带指针直接赋值的缓冲区,秒微调UTF-8

    奖金

    1. 2.3 秒删除无效的UTF-8,而不将其转换为缓存 UTF-8字符*
    2. 3.2 秒删除无效的UTF-8,将其转换为缓存 UTF-8字符*
    3. 3.2 秒修剪UTF-8并缓存为 ASCII char*
    4. 4.7 秒修剪UTF-8与 std::getline() 使用带有指针直接赋值的一个缓冲区

    旧档案 ./text.txt 820.800 每一条线等于:

    id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char\r\n

    以及使用

    1. g++ (GCC) 7.4.0
    2. iconv (GNU libiconv 1.14)
    3. g++ -o main test.cpp -O3 -liconv && time ./main
    推荐文章