代码之家 › 专栏 › 技术社区 › Gary

使用STRING MATCH和REGEXP测试从右到左语言中的字符;和编码?

tcl character-encoding

Gary · 技术社区 · 3 年前

我试着用Tcl测试从右到左的希伯来语结尾的一个字符? 我认为这可能会更加复杂,因为希伯来语是用JSON传递的;但我不确定,因为我仍然对编码感到困惑。

我已经在一些测试字符串上尝试了以下代码,尽管我认为我理解为什么底部的正则表达式“有效”,但我不明白为什么 {*Ö¾} 在里面 string match 提供了期望的结果。

set hebJoined ×¢Ö·×Ö¾×Ö¸×Ö¸×¨Ö¶×¥
set hebSep ×¢Ö·×Ö¾
chan puts stdout [string match {*Ö¾} $hebJoined]; #=> 0
chan puts stdout [string match {^*Ö¾} $hebJoined]; #=> 0
chan puts stdout [string match {*^Ö¾} $hebJoined]; #=> 0
chan puts stdout [string match {*Ö¾^} $hebJoined]; #=> 0
chan puts stdout [string match {$*Ö¾} $hebJoined]; #=> 0
chan puts stdout [string match {*Ö¾$} $hebJoined]; #=> 0
chan puts stdout [string match {*$Ö¾} $hebJoined]; #=> 0

chan puts stdout [string match {*Ö¾} $hebSep]; #=> 1
chan puts stdout [string match {^*Ö¾} $hebSep]; #=> 0
chan puts stdout [string match {*^Ö¾} $hebSep]; #=> 0
chan puts stdout [string match {*Ö¾^} $hebSep]; #=> 0
chan puts stdout [string match {$*Ö¾} $hebSep]; #=> 0
chan puts stdout [string match {*Ö¾$} $hebSep]; #=> 0
chan puts stdout [string match {*$Ö¾} $hebSep]; #=> 0

chan puts stdout [regexp {(.*Ö¾)(.*)} $hebJoined {\1\2} heb1 heb2]
chan puts stdout $heb1; # => ×¢Ö·×Ö¾
chan puts stdout $heb2; # => ×Ö¸×Ö¸×¨Ö¶×¥

chan puts stdout [regexp {(.*Ö¾)$} $hebJoined]; # 0
chan puts stdout [regexp {(.*Ö¾)$} $hebSep]; # 1

还有一个更大的问题,我正在处理以JSON形式传递的数据,上面的正则表达式不会提供所需的结果,而是对 字符串匹配 做

string match {*Ö¾} [encoding convertto iso8859-1 $hebrew] 似乎找到了所有以连字符结尾的单词;即在左侧;并且不返回字符串中间连字符的结果。我不明白它为什么这么做。我不知道如何提供一个例子,因为希伯来语的存储数据看起来像 ÃÂ¢ÃÂ·ÃÅ 。

可以 字符串匹配 或unicode值的正则表达式测试,如 \u05BE 我认为这个连字符是什么?

你能告诉我为什么我使用的代码似乎能正常工作,以及我如何纠正它才能正常工作吗?如果将编码更改为utf-8,则字符串匹配不会提供任何匹配。

非常感谢。

编辑:

我认为这是需要的。我有一段时间感到困惑,部分原因是我看到的文件故意去掉了连字符。这段代码产生了正确的结果,但很难看,可能不是最好的方法。

chan puts stdout [regexp {(ÃÂ¢ÃÂ·ÃÃÂ¾)} [encoding convertto utf-8 $hebJoined] {\1} h1]; # => 1
chan puts stdout [regexp {(ÃÂ¾)$} [encoding convertto utf-8 $hebJoined] {\1} h2]; # => 0
chan puts stdout [regexp {(ÃÂ¾)$} [encoding convertto utf-8 $hebSep] {\1} h2]; # => 1

chan configure stdout -encoding iso8859-1 -translation crlf
chan puts stdout $h1; # => ×¢Ö·×Ö¾
chan puts stdout $h2; # => Ö¾ the desired hyphen.

另一个编辑: 我犯了一个严重的错误,将这些数据作为iso8859-1而不是utf-8读取到Tcl中。如果将接收数据的信道的编码更改为utf-8,那么这些问题中的大多数将完全消失;使用类似\U05BE的unicode值进行测试效果很好。在这种特殊的情况下,我将utf-8读取为iso8859-1的错误似乎导致了多字节字符被读取为单个字节,这使匹配inn变得复杂 字符串匹配 和 regexp 。

0 回复 | 直到 3 年前

Donal Fellows 3 年前

可以 string match 或unicode值的正则表达式测试,如 \u05BE 我认为这个连字符是什么?

当然这只是这两种匹配系统的常规特征。事实上,它们处理的是字符的逻辑序列,所以这是你应该一直使用的;它们的显示方式是一个输出渲染问题,而Tcl本身对此只字未提。(你的终端,或者像Tk这样的GUI工具包,会更在意。)