代码之家 › 专栏 › 技术社区 › Neil C. Obremski

如何从.NET字符串中获取Unicode代码点数组?

astral-plane char unicode string c#

19

Neil C. Obremski · 技术社区 · 16 年前

我有一个需要检查字符串的字符范围限制列表,但是 char 烧焦 他在一个房间里 string

你将如何转换一张支票排列( int[] )32位Unicode代码点的数量?

4 回复 | 直到 10 年前

1

21

Community CDub 8 年前

你是在问 代码点 . 在UTF-16(C#’s中 char )只有两种可能性:

这个角色来自 基本多语言平面 ,并由单个代码单元进行编码。
角色不在角色范围内 骨形态发生蛋白 ,并使用代理高-低代码单元对进行编码

因此,假设字符串有效,这将返回一个代码数组要点对于给定字符串:

public static int[] ToCodePoints(string str)
{
    if (str == null)
        throw new ArgumentNullException("str");

    var codePoints = new List<int>(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        codePoints.Add(Char.ConvertToUtf32(str, i));
        if (Char.IsHighSurrogate(str[i]))
            i += 1;
    }

    return codePoints.ToArray();
}

ð 和一个沉着的角色 Ã± :

ToCodePoints("\U0001F300 El Ni\u006E\u0303o");                        // ð El NiÃ±o
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // ð   E l   N i n Ìâ o

这是另一个例子。这两个代码点代表第32个带有断音重音的音符,两个替代对:

ToCodePoints("\U0001D162\U0001D181");              // ð¢ð
// { 0x1d162, 0x1d181 }                            // ð¢ ðâ

C-normalized ,它们被分解为记事本、组合词干、组合标志和组合重音断音,所有代理项对:

ToCodePoints("\U0001D162\U0001D181".Normalize());  // ðð¥ð°ð
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          // ð ð¥ ð° ðâ

注意 leppie's solution 这是不对的。问题是关于不 文本元素 ± 在字符串中,用拉丁小写字母表示 n Ìâ . Leppie的解决方案将丢弃无法规范化为单个代码点的任何组合字符。

2

7

leppie 10 年前

这个答案是不正确的。请参阅@Virtlink的正确答案。

static int[] ExtractScalars(string s)
{
  if (!s.IsNormalized())
  {
    s = s.Normalize();
  }

  List<int> chars = new List<int>((s.Length * 3) / 2);

  var ee = StringInfo.GetTextElementEnumerator(s);

  while (ee.MoveNext())
  {
    string e = ee.GetTextElement();
    chars.Add(char.ConvertToUtf32(e, 0));
  }

  return chars.ToArray();
}

笔记 :处理复合字符需要规范化。

3

4

Nicholas Carey 10 年前

看起来不应该比这复杂得多:

public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
  bool      useBigEndian = !BitConverter.IsLittleEndian;
  Encoding  utf32        = new UTF32Encoding( useBigEndian , false , true ) ;
  byte[]    octets       = utf32.GetBytes( s ) ;

  for ( int i = 0 ; i < octets.Length ; i+=4 )
  {
    int codePoint = BitConverter.ToInt32(octets,i);
    yield return codePoint;
  }

}

4

0

Community CDub 8 年前

我想出了一个好主意 same approach Nicholas(和Jeppe)的建议,只是简短了一点:

    public static IEnumerable<int> GetCodePoints(this string s) {
        var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
        var bytes = utf32.GetBytes(s);
        return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
    }

枚举是我所需要的全部,但获取数组却非常简单:

int[] codePoints = myString.GetCodePoints().ToArray();

5

0

eikuh 5 年前

此解决方案产生的结果与 the solution by Daniel A.A. Pelsmaeker 但要短一点:

public static int[] ToCodePoints(string s)
{
    byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
    int[] codepoints = new int[utf32bytes.Length / 4];
    Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
    return codepoints;
}