代码之家 › 专栏 › 技术社区 › David Tonhofer

Perl:将字节序列打包为字符串

pack utf-8 perl

David Tonhofer · 技术社区 · 5 年前

我正在尝试运行一个简单的测试,其中我希望有不同格式的二进制字符串并将其打印出来。事实上,我正在调查一个问题 sprintf 无法处理为占位符传入的宽字符串 %s 。

在这种情况下,二进制字符串应仅包含西里尔文“”(因为它高于ISO-8859-1)

当我直接在源代码中使用字符时,下面的代码可以工作。

但什么都不能通过 pack 作品

对于UTF-8情况,我需要在字符串上设置UTF-8标志 $ch ,但如何。
UCS-2案例失败了,我想这是因为ISO-8859-1中的Perl UCS-2没有办法,所以测试可能是胡说八道,对吧?

代码:

#!/usr/bin/perl

use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"

# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

sub showme {
   my ($name,$ch) = @_;
   print "-------\n";
   print "This is test: $name\n";

   my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint

   {
      # https://perldoc.perl.org/bytes.html
      use bytes;
      my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
      my $txt  = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
      print $txt,"\n";
   }

   print $ch, "\n";
   print "Combine: $ch\n";
   print "Concat: " . $ch . "\n";
   print "Sprintf: " . sprintf("%s",$ch) . "\n";
   print "-------\n";
}


showme("Cryillic direct" , "Ð´");
showme("Cyrillic UTF-8"  , pack("HH","D0","B4"));  # UTF-8 of Ð´ is D0B4
showme("Cyrillic UCS-2"  , pack("HH","04","34"));  # UCS-2 of Ð´ is 0434

电流输出:

看起来不错

-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes

Ð´
Combine: Ð´
Concat: Ð´
Sprintf: Ð´
-------

没有。176是从哪里来的??

-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no

ÃÂ°
Combine: ÃÂ°
Concat: ÃÂ°
Sprintf: ÃÂ°
-------

这更糟。

-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no

0
Combine: 0
Concat: 0
Sprintf: 0
-------

0 回复 | 直到 5 年前

ikegami Gilles Quénot 5 年前

你有两个问题。

您的呼叫 pack 不正确

每个 H 表示一个十六进制数字。

$ perl -e'printf "%vX\n", pack("HH", "D0", "B4")'       # XXX
D0.B0

$ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")'     # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")'    # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")'    # Better
D0.B4

$ perl -e'printf "%vX\n", pack("H*", "D0B4")'           # Alternative
D0.B4

标准输出需要解码文本,但您提供的是编码文本

首先,让我们看看您正在生成的字符串(一旦上述问题得到解决)。你所需要的就是 %vX 格式,以十六进制提供每个字符的句点分隔值。

"Ð´" 生成一个单字符字符串。此字符是的Unicode代码点 Ð´ 。
```
$ perl -e'use utf8; printf("%vX\n", "Ð´");'
434
```
pack("H*", "D0B4") 生成两个字符的字符串。这些字符是的UTF-8编码 Ð´´ 。
```
$ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
D0.B4
```
pack("H*", "0434") 生成两个字符的字符串。这些字符是的UCS-2be和UTF-16be编码 Ð´´ 。
```
$ perl -e'printf("%vX\n", pack("H*", "0434"));'
4.34
```

通常,文件句柄需要打印一个字节字符串(值为0..255的字符)。这些字节是逐字输出的。 ^[1] ^[2]

当编码层(例如。 :encoding(UTF-8) )如果将添加到文件句柄中,则需要将一串Unicode代码点(又名解码文本)打印到该句柄中。

您的程序将编码层添加到 STDOUT (通过使用 use open pragma),因此您必须向 print 和 say 。您可以使用例如Encode的 decode 作用

use utf8;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

use Encode qw( decode );

say "Ð´";                   # ok  (UCP of "Ð´")
say pack("H*", "D0B4");    # XXX (UTF-8 encoding of "Ð´")
say pack("H*", "0434");    # XXX (UCS-2be and UTF-16be encoding of "Ð´")

say decode("UTF-8",    pack("H*", "D0B4"));   # ok (UCP of "Ð´")
say decode("UCS-2be",  pack("H*", "0434"));   # ok (UCP of "Ð´")
say decode("UTF-16be", pack("H*", "0434"));   # ok (UCP of "Ð´")

对于UTF-8情况,我需要将UTF-8标志设置为

不,您需要解码字符串。

UTF-8标志不相关。最初是否设置了标志无关紧要。解码字符串后是否设置标志无关紧要。该标志指示字符串在内部的存储方式,这是您不应该关心的。

例如,以

use strict;
use warnings;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

my $x = chr(0xE9);

utf8::downgrade($x);   # Tell Perl to use the UTF8=0 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

utf8::upgrade($x);   # Tell Perl to use the UTF8=1 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

It输出

UTF8=0 E9 Ã©
UTF8=1 E9 Ã©

无论 UTF8 标志,UTF-8编码( C3 A9 )输出所提供UCP(U+00E9)的。

我想这是因为ISO-8859-1中的Perl UCS-2无法实现,所以测试可能是胡说八道,对吧?

充其量,可以使用启发式来猜测字符串是使用iso-latin-1还是UCS-2be编码的。我怀疑可以得到相当准确的结果(比如 those 您将获得iso-latin-1和UTF-8。)

我不知道你为什么要提出iso-latin-1,因为你的问题中没有其他与iso-latin-1相关的内容。

除了在Windows上,其中 :crlf 默认情况下,层添加到句柄。
你得到一个 Wide character 如果提供的字符串包含非字节字符,并且 utf8 而是输出字符串的编码。

David Tonhofer 5 年前

请查看下面的演示代码是否有任何帮助

use strict;
use warnings;
use feature 'say';

use utf8;     # https://perldoc.perl.org/utf8.html
use Encode;   # https://perldoc.perl.org/Encode.html

my $str;

my $utf8   = 'ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

# https://perldoc.perl.org/functions/binmode.html

binmode STDOUT, ':utf8'; 

# https://perldoc.perl.org/feature.html#The-'say'-feature

say 'UTF-8:   ' . $utf8;  

# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API

$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);  

$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);

$str = pack('H*',$utf16);
say 'UTF-16:  '. decode('UTF16',$str);

$str = pack('H*',$utf32);
say 'UTF-32:  ' . decode('UTF32',$str);

输出

UTF-8:   ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
UCS-2BE: ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
UCS-2LE: ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
UTF-16:  ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
UTF-32:  ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°

支持的西里尔文编码

use strict;
use warnings;
use feature 'say';

use Encode;
use utf8;

binmode STDOUT, ':utf8';

my $utf8 = 'ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°';
my @encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;

say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       ', $utf8;

for (@encodings) {
    printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}

输出

:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
UCS-2       041f044004380432043504420020041c043e0441043a04320430
UCS-2LE     1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE     041f044004380432043504420020041c043e0441043a04320430
UTF-16      feff041f044004380432043504420020041c043e0441043a04320430
UTF-32      0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5  bfe0d8d2d5e220bcdee1dad2d0
CP855       dde1b7eba8e520d3d6e3c6eba0
CP1251      cff0e8e2e5f220cceef1eae2e0
KOI8-F      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U      f0d2c9d7c5d420edcfd3cbd7c1

文档 Encode::Supported

David Tonhofer 5 年前

两者都是很好的答案。以下是北极熊代码的一个小小扩展,用于打印字符串的详细信息:

use strict;
use warnings;
use feature 'say';

use utf8;
use Encode;

sub about {
   my($str) = @_;
   # https://perldoc.perl.org/bytes.html
   my $charlen = length($str);
   my $txt;
   {
      use bytes;
      my $mark = (utf8::is_utf8($str) ? "yes" : "no");
      my $bytelen = length($str);
      $txt  = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n", 
                      $bytelen,$charlen,$mark,$str);
   }
   return $txt;
}

my $str;

my $utf8   = 'ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

binmode STDOUT, ':utf8';

say 'UTF-8:   ' . $utf8;
say about($utf8);

{
   my $str = pack('H*',$ucs2be);
   say 'UCS-2BE: ' . decode('UCS-2BE',$str);
   say about($str);
}

{
   my $str = pack('H*',$ucs2le);
   say 'UCS-2LE: ' . decode('UCS-2LE',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf16);
   say 'UTF-16:  '. decode('UTF16',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf32);
   say  'UTF-32:  ' . decode('UTF32',$str);
   say about($str);
}

# Try identity transcoding

{
   my $str_encoded_in_utf16 = encode('UTF16',$utf8);
   my $str = decode('UTF16',$str_encoded_in_utf16);
   say 'The same: ' . $str;
   say about($str);
}

运行此选项可提供:

UTF-8:   ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

UCS-2BE: ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UCS-2LE: ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4

UTF-16:  ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UTF-32:  ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48

The same: ÐÑÐ¸Ð²ÐµÑ ÐÐ¾ÑÐºÐ²Ð°
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

我做了一个小图作为下一次的概述,包括 encode ,则, decode 和 pack .因为最好为下次做好准备。

(上图及其 graphml 文件可用 here )