代码之家  ›  专栏  ›  技术社区  ›  mscha

读取文件时自动检测字符编码[复制]

  •  0
  • mscha  · 技术社区  · 6 年前

    有没有一种方法可以方便地读取这些文件,像Vim那样自动检测编码?

    我希望有一个简单的东西:

    open(my $f, '<:encoding(autodetect)', 'foo.txt') or die 'Oops: $!';
    

    请注意 Encode::Guess

    例子:

    #!/usr/bin/env perl
    
    use 5.020;
    use warnings;
    
    use Encode;
    use Encode::Guess qw(utf-8 cp1252);
    
    binmode STDOUT => 'utf8';
    
    my $utf8 = "H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!"; # "Héllo, Wørld!" in UTF-8
    my $latin = "H\x{E9}llo, W\x{F8}rld!";            # "Héllo, Wørld!" in CP-1252
    
    # Version 1
    my $enc1 = Encode::Guess->guess($latin);
    if (ref($enc1)) {
        say $enc1->name, ': ', $enc1->decode($latin);
    }
    else {
        say "Oops: $enc1";
    }
    my $enc2 = Encode::Guess->guess($utf8);
    if (ref($enc2)) {
        say $enc2->name, ': ', $enc2->decode($utf8);
    }
    else {
        say "Oops: $enc2";
    }
    
    # Version 2
    say decode("Guess", $latin);
    say decode("Guess", $utf8);
    

    输出:

    cp1252: Héllo, Wørld!
    Oops: utf-8-strict or utf8 or cp1252
    Héllo, Wørld!
    cp1252 or utf-8-strict or utf8 at ./guesstest line 32.
    

    Borodin的答案中“更新”下的版本只适用于UTF-8数据,但对拉丁语1数据无效。 Encode::Guess

    这不是同一个问题 this one :我正在寻找一种自动检测何时 开放 一个文件。

    2 回复  |  直到 6 年前
        1
  •  2
  •   mscha    6 年前

    这是我目前的解决方法。至少对于UTF-8和Latin-1(或Windows-1252)文件来说是这样。

    use 5.024;
    use experimental 'signatures';
    use Encode qw(decode);
    
    sub slurp($file)
    {
        # Read the raw bytes
        local $/;
        open (my $fh, '<:raw', $file) or return undef();
        my $raw = <$fh>;
        close($fh);
    
        my $content;
    
        # Try to interpret the content as UTF-8
        eval { my $text = decode('utf-8', $raw, Encode::FB_CROAK); $content = $text };
    
        # If this failed, interpret as windows-1252 (a superset of iso-8859-1 and ascii)
        if (!$content) {
            eval { my $text = decode('windows-1252', $raw, Encode::FB_CROAK); $content = $text };
        }
    
        # If this failed, give up and use the raw bytes
        if (!$content) {
            $content = $raw;
        }
    
        return $content;
    }
    
        2
  •  2
  •   Borodin    6 年前

    看看 Encode::Guess

    不要 有一个ASCII文件,因为代码点是7位的,所以任何超过127位的都意味着它不是ASCII。还可以可靠地判断您的文件是否 不是吗 UTF-8作为多字节字符,其最高有效位有一个特定的序列。其他任何事情都不那么可靠,但可能

    编码::猜测 Encode 模块,所以不需要安装

    use Encode::Guess;
    
    my $enc = guess_encoding($data, qw/ ascii cp1252 iso-8859-1 utf-8 /);
    say ref $enc? $enc->name : $enc, "\n";
    

    或者,您可以在不检查模块所选内容的情况下执行最佳猜测解码

      use Encode::Guess qw/ ascii cp1252 iso-8859-1 utf-8 /;
    
      my $chars = decode("Guess", $data);
    

    较少的 你提供的编码越多,猜测就越准确。您应该仔细阅读模块文档


    更新

    “不起作用”

    注意,如文件所述, guess_encoding 有时可能返回一个字符串 utf-8 or iso-8859-1 guess_编码 decode('guess', ...) 返回正确结果

    对于您选择的任何字节字符串:只需修改 $raw

    use strict;
    use warnings 'all';
    use feature 'say';
    use open qw/ :std encoding(UTF-8) /;
    
    use Encode;
    use Encode::Guess;
    use Data::Dump;
    
    my $raw = qq/H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!/;
    
    my $enc = guess_encoding($raw);
    
    if ( my $class = ref $enc ) {
        printf qq{Guessed encoding \$enc is an %s object "%s"\n}, $class, $enc->name
    }
    else {
        printf qq{Guessed encoding \$enc is a scalar "%s"\n}, $enc;
    }
    
    my $chars = decode('guess', $raw);
    
    printf "Decoded characters: %s\n", $chars;
    dd $chars;
    

    输出

    Guessed encoding $enc is an Encode::utf8 object "utf8"
    Decoded characters: Héllo, Wørld!
    "H\xE9llo, W\xF8rld!"