代码之家 › 专栏 › 技术社区 › mscha

读取文件时自动检测字符编码[复制]

character-encoding io perl

mscha · 技术社区 · 7 年前

有没有一种方法可以方便地读取这些文件,像Vim那样自动检测编码?

我希望有一个简单的东西:

open(my $f, '<:encoding(autodetect)', 'foo.txt') or die 'Oops: $!';

请注意 Encode::Guess

例子:

#!/usr/bin/env perl

use 5.020;
use warnings;

use Encode;
use Encode::Guess qw(utf-8 cp1252);

binmode STDOUT => 'utf8';

my $utf8 = "H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!"; # "HÃ©llo, WÃ¸rld!" in UTF-8
my $latin = "H\x{E9}llo, W\x{F8}rld!";            # "HÃ©llo, WÃ¸rld!" in CP-1252

# Version 1
my $enc1 = Encode::Guess->guess($latin);
if (ref($enc1)) {
    say $enc1->name, ': ', $enc1->decode($latin);
}
else {
    say "Oops: $enc1";
}
my $enc2 = Encode::Guess->guess($utf8);
if (ref($enc2)) {
    say $enc2->name, ': ', $enc2->decode($utf8);
}
else {
    say "Oops: $enc2";
}

# Version 2
say decode("Guess", $latin);
say decode("Guess", $utf8);

输出:

cp1252: HÃ©llo, WÃ¸rld!
Oops: utf-8-strict or utf8 or cp1252
HÃ©llo, WÃ¸rld!
cp1252 or utf-8-strict or utf8 at ./guesstest line 32.

Borodin的答案中“更新”下的版本只适用于UTF-8数据,但对拉丁语1数据无效。 Encode::Guess

这不是同一个问题 this one :我正在寻找一种自动检测何时开放一个文件。

2 回复 | 直到 7 年前

mscha 7 年前

这是我目前的解决方法。至少对于UTF-8和Latin-1(或Windows-1252)文件来说是这样。

use 5.024;
use experimental 'signatures';
use Encode qw(decode);

sub slurp($file)
{
    # Read the raw bytes
    local $/;
    open (my $fh, '<:raw', $file) or return undef();
    my $raw = <$fh>;
    close($fh);

    my $content;

    # Try to interpret the content as UTF-8
    eval { my $text = decode('utf-8', $raw, Encode::FB_CROAK); $content = $text };

    # If this failed, interpret as windows-1252 (a superset of iso-8859-1 and ascii)
    if (!$content) {
        eval { my $text = decode('windows-1252', $raw, Encode::FB_CROAK); $content = $text };
    }

    # If this failed, give up and use the raw bytes
    if (!$content) {
        $content = $raw;
    }

    return $content;
}

Borodin 7 年前

看看 Encode::Guess

不要有一个ASCII文件,因为代码点是7位的,所以任何超过127位的都意味着它不是ASCII。还可以可靠地判断您的文件是否 不是吗 UTF-8作为多字节字符,其最高有效位有一个特定的序列。其他任何事情都不那么可靠,但可能

编码::猜测 Encode 模块,所以不需要安装

use Encode::Guess;

my $enc = guess_encoding($data, qw/ ascii cp1252 iso-8859-1 utf-8 /);
say ref $enc? $enc->name : $enc, "\n";

或者,您可以在不检查模块所选内容的情况下执行最佳猜测解码

  use Encode::Guess qw/ ascii cp1252 iso-8859-1 utf-8 /;

  my $chars = decode("Guess", $data);

较少的 你提供的编码越多,猜测就越准确。您应该仔细阅读模块文档

更新

“不起作用”

注意,如文件所述, guess_encoding 有时可能返回一个字符串 utf-8 or iso-8859-1 guess_编码 decode('guess', ...) 返回正确结果

对于您选择的任何字节字符串:只需修改 $raw

use strict;
use warnings 'all';
use feature 'say';
use open qw/ :std encoding(UTF-8) /;

use Encode;
use Encode::Guess;
use Data::Dump;

my $raw = qq/H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!/;

my $enc = guess_encoding($raw);

if ( my $class = ref $enc ) {
    printf qq{Guessed encoding \$enc is an %s object "%s"\n}, $class, $enc->name
}
else {
    printf qq{Guessed encoding \$enc is a scalar "%s"\n}, $enc;
}

my $chars = decode('guess', $raw);

printf "Decoded characters: %s\n", $chars;
dd $chars;

输出

Guessed encoding $enc is an Encode::utf8 object "utf8"
Decoded characters: HÃ©llo, WÃ¸rld!
"H\xE9llo, W\xF8rld!"