代码之家  ›  专栏  ›  技术社区  ›  David Tonhofer

Perl:将字节序列打包为字符串

  •  0
  • David Tonhofer  · 技术社区  · 5 年前

    我正在尝试运行一个简单的测试,其中我希望有不同格式的二进制字符串并将其打印出来。事实上,我正在调查一个问题 sprintf 无法处理为占位符传入的宽字符串 %s

    在这种情况下,二进制字符串应仅包含西里尔文“”(因为它高于ISO-8859-1)

    当我直接在源代码中使用字符时,下面的代码可以工作。

    但什么都不能通过 pack 作品

    • 对于UTF-8情况,我需要在字符串上设置UTF-8标志 $ch ,但如何。
    • UCS-2案例失败了,我想这是因为ISO-8859-1中的Perl UCS-2没有办法,所以测试可能是胡说八道,对吧?

    代码:

    #!/usr/bin/perl
    
    use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
    
    # https://perldoc.perl.org/open.html
    
    use open qw(:std :encoding(UTF-8));
    
    sub showme {
       my ($name,$ch) = @_;
       print "-------\n";
       print "This is test: $name\n";
    
       my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint
    
       {
          # https://perldoc.perl.org/bytes.html
          use bytes;
          my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
          my $txt  = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
          print $txt,"\n";
       }
    
       print $ch, "\n";
       print "Combine: $ch\n";
       print "Concat: " . $ch . "\n";
       print "Sprintf: " . sprintf("%s",$ch) . "\n";
       print "-------\n";
    }
    
    
    showme("Cryillic direct" , "д");
    showme("Cyrillic UTF-8"  , pack("HH","D0","B4"));  # UTF-8 of д is D0B4
    showme("Cyrillic UCS-2"  , pack("HH","04","34"));  # UCS-2 of д is 0434
    

    电流输出:

    看起来不错

    -------
    This is test: Cryillic direct
    Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes
    
    д
    Combine: д
    Concat: д
    Sprintf: д
    -------
    

    没有。176是从哪里来的??

    -------
    This is test: Cyrillic UTF-8
    Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no
    
    а
    Combine: а
    Concat: а
    Sprintf: а
    -------
    

    这更糟。

    -------
    This is test: Cyrillic UCS-2
    Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no
    
    0
    Combine: 0
    Concat: 0
    Sprintf: 0
    -------
    
    0 回复  |  直到 5 年前
        1
  •  4
  •   ikegami Gilles Quénot    5 年前

    你有两个问题。


    您的呼叫 pack 不正确

    每个 H 表示一个十六进制数字。

    $ perl -e'printf "%vX\n", pack("HH", "D0", "B4")'       # XXX
    D0.B0
    
    $ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")'     # Ok
    D0.B4
    
    $ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")'    # Ok
    D0.B4
    
    $ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")'    # Better
    D0.B4
    
    $ perl -e'printf "%vX\n", pack("H*", "D0B4")'           # Alternative
    D0.B4
    

    标准输出需要解码文本,但您提供的是编码文本

    首先,让我们看看您正在生成的字符串(一旦上述问题得到解决)。你所需要的就是 %vX 格式,以十六进制提供每个字符的句点分隔值。

    • "д" 生成一个单字符字符串。此字符是的Unicode代码点 д

      $ perl -e'use utf8; printf("%vX\n", "д");'
      434
      
    • pack("H*", "D0B4") 生成两个字符的字符串。这些字符是的UTF-8编码 д´

      $ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
      D0.B4
      
    • pack("H*", "0434") 生成两个字符的字符串。这些字符是的UCS-2be和UTF-16be编码 д´

      $ perl -e'printf("%vX\n", pack("H*", "0434"));'
      4.34
      

    通常,文件句柄需要打印一个字节字符串(值为0..255的字符)。这些字节是逐字输出的。 [1] [2]

    当编码层(例如。 :encoding(UTF-8) )如果将添加到文件句柄中,则需要将一串Unicode代码点(又名解码文本)打印到该句柄中。

    您的程序将编码层添加到 STDOUT (通过使用 use open pragma),因此您必须向 print say 。您可以使用例如Encode的 decode 作用

    use utf8;
    use open qw( :std :encoding(UTF-8) );
    use feature qw( say );
    
    use Encode qw( decode );
    
    say "д";                   # ok  (UCP of "д")
    say pack("H*", "D0B4");    # XXX (UTF-8 encoding of "д")
    say pack("H*", "0434");    # XXX (UCS-2be and UTF-16be encoding of "д")
    
    say decode("UTF-8",    pack("H*", "D0B4"));   # ok (UCP of "д")
    say decode("UCS-2be",  pack("H*", "0434"));   # ok (UCP of "д")
    say decode("UTF-16be", pack("H*", "0434"));   # ok (UCP of "д")
    

    对于UTF-8情况,我需要将UTF-8标志设置为

    不,您需要解码字符串。

    UTF-8标志不相关。最初是否设置了标志无关紧要。解码字符串后是否设置标志无关紧要。该标志指示字符串在内部的存储方式,这是您不应该关心的。

    例如,以

    use strict;
    use warnings;
    use open qw( :std :encoding(UTF-8) );
    use feature qw( say );
    
    my $x = chr(0xE9);
    
    utf8::downgrade($x);   # Tell Perl to use the UTF8=0 storage format.
    say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
    
    utf8::upgrade($x);   # Tell Perl to use the UTF8=1 storage format.
    say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
    

    It输出

    UTF8=0 E9 é
    UTF8=1 E9 é
    

    无论 UTF8 标志,UTF-8编码( C3 A9 )输出所提供UCP(U+00E9)的。


    我想这是因为ISO-8859-1中的Perl UCS-2无法实现,所以测试可能是胡说八道,对吧?

    充其量,可以使用启发式来猜测字符串是使用iso-latin-1还是UCS-2be编码的。我怀疑可以得到相当准确的结果(比如 those 您将获得iso-latin-1和UTF-8。)

    我不知道你为什么要提出iso-latin-1,因为你的问题中没有其他与iso-latin-1相关的内容。


    1. 除了在Windows上,其中 :crlf 默认情况下,层添加到句柄。

    2. 你得到一个 Wide character 如果提供的字符串包含非字节字符,并且 utf8 而是输出字符串的编码。

        2
  •  1
  •   David Tonhofer    5 年前

    请查看下面的演示代码是否有任何帮助

    use strict;
    use warnings;
    use feature 'say';
    
    use utf8;     # https://perldoc.perl.org/utf8.html
    use Encode;   # https://perldoc.perl.org/Encode.html
    
    my $str;
    
    my $utf8   = 'Привет Москва';
    my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
    my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
    my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
    my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
    
    # https://perldoc.perl.org/functions/binmode.html
    
    binmode STDOUT, ':utf8'; 
    
    # https://perldoc.perl.org/feature.html#The-'say'-feature
    
    say 'UTF-8:   ' . $utf8;  
    
    # https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API
    
    $str = pack('H*',$ucs2be);
    say 'UCS-2BE: ' . decode('UCS-2BE',$str);  
    
    $str = pack('H*',$ucs2le);
    say 'UCS-2LE: ' . decode('UCS-2LE',$str);
    
    $str = pack('H*',$utf16);
    say 'UTF-16:  '. decode('UTF16',$str);
    
    $str = pack('H*',$utf32);
    say 'UTF-32:  ' . decode('UTF32',$str);
    

    输出

    UTF-8:   Привет Москва
    UCS-2BE: Привет Москва
    UCS-2LE: Привет Москва
    UTF-16:  Привет Москва
    UTF-32:  Привет Москва
    

    支持的西里尔文编码

    use strict;
    use warnings;
    use feature 'say';
    
    use Encode;
    use utf8;
    
    binmode STDOUT, ':utf8';
    
    my $utf8 = 'Привет Москва';
    my @encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;
    
    say '
    :: Supported Cyrillic encoding
    ---------------------------------------------
    UTF-8       ', $utf8;
    
    for (@encodings) {
        printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
    }
    

    输出

    :: Supported Cyrillic encoding
    ---------------------------------------------
    UTF-8       Привет Москва
    UCS-2       041f044004380432043504420020041c043e0441043a04320430
    UCS-2LE     1f044004380432043504420420001c043e0441043a0432043004
    UCS-2BE     041f044004380432043504420020041c043e0441043a04320430
    UTF-16      feff041f044004380432043504420020041c043e0441043a04320430
    UTF-32      0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
    ISO-8859-5  bfe0d8d2d5e220bcdee1dad2d0
    CP855       dde1b7eba8e520d3d6e3c6eba0
    CP1251      cff0e8e2e5f220cceef1eae2e0
    KOI8-F      f0d2c9d7c5d420edcfd3cbd7c1
    KOI8-R      f0d2c9d7c5d420edcfd3cbd7c1
    KOI8-U      f0d2c9d7c5d420edcfd3cbd7c1
    

    文档 Encode::Supported

        3
  •  1
  •   David Tonhofer    5 年前

    两者都是很好的答案。以下是北极熊代码的一个小小扩展,用于打印字符串的详细信息:

    use strict;
    use warnings;
    use feature 'say';
    
    use utf8;
    use Encode;
    
    sub about {
       my($str) = @_;
       # https://perldoc.perl.org/bytes.html
       my $charlen = length($str);
       my $txt;
       {
          use bytes;
          my $mark = (utf8::is_utf8($str) ? "yes" : "no");
          my $bytelen = length($str);
          $txt  = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n", 
                          $bytelen,$charlen,$mark,$str);
       }
       return $txt;
    }
    
    my $str;
    
    my $utf8   = 'Привет Москва';
    my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
    my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
    my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
    my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
    
    binmode STDOUT, ':utf8';
    
    say 'UTF-8:   ' . $utf8;
    say about($utf8);
    
    {
       my $str = pack('H*',$ucs2be);
       say 'UCS-2BE: ' . decode('UCS-2BE',$str);
       say about($str);
    }
    
    {
       my $str = pack('H*',$ucs2le);
       say 'UCS-2LE: ' . decode('UCS-2LE',$str);
       say about($str);
    }
    
    {
       my $str = pack('H*',$utf16);
       say 'UTF-16:  '. decode('UTF16',$str);
       say about($str);
    }
    
    {
       my $str = pack('H*',$utf32);
       say  'UTF-32:  ' . decode('UTF32',$str);
       say about($str);
    }
    
    # Try identity transcoding
    
    {
       my $str_encoded_in_utf16 = encode('UTF16',$utf8);
       my $str = decode('UTF16',$str_encoded_in_utf16);
       say 'The same: ' . $str;
       say about($str);
    }
    

    运行此选项可提供:

    UTF-8:   Привет Москва
    Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
    
    UCS-2BE: Привет Москва
    Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
    
    UCS-2LE: Привет Москва
    Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4
    
    UTF-16:  Привет Москва
    Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
    
    UTF-32:  Привет Москва
    Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48
    
    The same: Привет Москва
    Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
    

    我做了一个小图作为下一次的概述,包括 encode ,则, decode pack .因为最好为下次做好准备。

    perl_strings_and_encode_decode

    (上图及其 graphml 文件可用 here )