代码之家 › 专栏 › 技术社区 › Michael La Voie Frederik Gheysels

基于字节长度缩短UTF8字符串的最佳方法

ora-12899 utf-8 oracle c#

Michael La Voie Frederik Gheysels · 技术社区 · 16 年前

我遇到了一个问题,在插入特定字段时收到此错误消息:

ORA-12899值对于列X太大

我曾经 Field.Substring(0, MaxLength);

最后我看到了应该很明显的东西,我的字符串是ANSI,字段是UTF8。它的长度是以字节定义的,而不是以字符定义的。

这就引出了我的问题。修剪字符串以固定最大长度的最佳方法是什么?

我的子字符串代码按字符长度工作。是否有简单的C#函数可以智能地按字节长度修剪UT8字符串(即不砍掉半个字符)?

8 回复 | 直到 15 年前

Daniel Brückner 16 年前

这里有两种可能的解决方案-一种是从左到右处理输入的LINQ单行程序,另一种是传统的 for -从右向左循环处理输入。哪个处理方向更快取决于字符串长度、允许的字节长度以及多字节字符的数量和分布,很难给出一般性建议。LINQ和传统代码之间的选择可能是一个品味问题(或者可能是速度问题)。

如果速度很重要,可以考虑将每个字符的字节长度累加到最大长度,而不是在每次迭代中计算整个字符串的字节长度。但我不确定这是否有效,因为我对UTF-8编码了解不够。我可以从理论上想象,字符串的字节长度并不等于所有字符的字节长度之和。

public static String LimitByteLength(String input, Int32 maxLength)
{
    return new String(input
        .TakeWhile((c, i) =>
            Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        .ToArray());
}

public static String LimitByteLength2(String input, Int32 maxLength)
{
    for (Int32 i = input.Length - 1; i >= 0; i--)
    {
        if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        {
            return input.Substring(0, i + 1);
        }
    }

    return String.Empty;
}

Community Mohan Dere 9 年前

我认为我们可以做得比天真地计算每一个加法的字符串总长度更好。LINQ很酷,但它会意外地鼓励低效的代码。如果我想要一个巨大的UTF字符串的前80000字节呢?那是个好主意不必要的计数。“我有1个字节。现在我有2个。现在我有13个…现在我有52384个…”

那太傻了。大多数时候,至少在英格兰,我们可以 确切地 在那上面 nth 字节。即使在另一种语言中,我们离一个好的切点也不到6字节。

所以我将从@Oren的建议开始,这是一个UTF8字符值的前导位。让我们从最右边开始 n+1th 字节,并使用Oren的技巧来确定是否需要提前削减几个字节。

三种可能性

0 在前导位中,我知道我精确地在一个单字节(常规ASCII)字符之前剪切,并且可以清晰地剪切。

11 在剪切后,剪切后的下一个字节是一个多字节字符,所以这是一个很好的地方削减了!

如果我有 10 但是,我知道我处于一个多字节字符的中间,需要回去检查看看它真正开始的地方。

也就是说,虽然我想在第n个字节之后切断字符串,但是如果n+第一字节出现在多字节字符的中间,则切割将产生无效的UTF8值。我需要备份,直到我找到一个从在它之前剪掉。

代码

Convert.ToByte("11000000", 2) here & ing返回字节前两位中的内容并返回 0 XX 从…起 XX000000 看看是不是或 ,视乎情况而定。

今天那个 C# 6.0 might actually support binary representations

这个 PadLeft 只是因为我对控制台的输出过于强迫症。

n 字节长或最大数小于

public static string CutToUTF8Length(string str, int byteLength)
{
    byte[] byteArray = Encoding.UTF8.GetBytes(str);
    string returnValue = string.Empty;

    if (byteArray.Length > byteLength)
    {
        int bytePointer = byteLength;

        // Check high bit to see if we're [potentially] in the middle of a multi-byte char
        if (bytePointer >= 0 
            && (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
        {
            // If so, keep walking back until we have a byte starting with `11`,
            // which means the first byte of a multi-byte UTF8 character.
            while (bytePointer >= 0 
                && Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
            {
                bytePointer--;
            }
        }

        // See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
        if (0 != bytePointer)
        {
            returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to @NealEhardt! Well played. ;^)
        }
    }
    else
    {
        returnValue = str;
    }

    return returnValue;
}

我最初是作为字符串扩展编写的。再加上 this 之前 string str 这样我们就可以把这个方法应用到 Program.cs

下面是一个很好的测试用例,它在下面创建了输出,编写时希望是 Main 方法在简单控制台应用程序的 Program.cs .

static void Main(string[] args)
{
    string testValue = "12345ââ67890â";

    for (int i = 0; i < 15; i++)
    {
        string cutValue = Program.CutToUTF8Length(testValue, i);
        Console.WriteLine(i.ToString().PadLeft(2) +
            ": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
            ":: " + cutValue);
    }

    Console.WriteLine();
    Console.WriteLine();

    foreach (byte b in Encoding.UTF8.GetBytes(testValue))
    {
        Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
    }

    Console.WriteLine("Return to end.");
    Console.ReadLine();
}

输出如下。请注意,中的“智能引号” testValue 在UTF8中有三个字节长(尽管当我们用ASCII将字符写入控制台时,它会输出哑引号)。还要注意 ? s输出中每个智能引号的第二和第三个字节。

本书的前五个字是UTF8中的单字节,因此0-5字节值应为0-5个字符。然后我们有一个三字节的智能报价,在5+3字节之前不能全部包含。果然,我们看到它在呼叫时突然出现 8

 0:  0::
 1:  1:: 1
 2:  2:: 12
 3:  3:: 123
 4:  4:: 1234
 5:  5:: 12345
 6:  5:: 12345
 7:  5:: 12345
 8:  8:: 12345"
 9:  8:: 12345"
10:  8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678


 49 1
 50 2
 51 3
 52 4
 53 5
226 Ã¢
128 ?
156 ?
226 Ã¢
128 ?
157 ?
 54 6
 55 7
 56 8
 57 9
 48 0
226 Ã¢
128 ?
157 ?
Return to end.

这很有意思,我就在问题的五周年纪念日之前。虽然奥伦对比特的描述有一个小错误,但这是错误的

Community Mohan Dere 9 年前

较短版本的 ruffin's answer the design of UTF8 :

    public static string LimitUtf8ByteCount(this string s, int n)
    {
        // quick test (we probably won't be trimming most of the time)
        if (Encoding.UTF8.GetByteCount(s) <= n)
            return s;
        // get the bytes
        var a = Encoding.UTF8.GetBytes(s);
        // if we are in the middle of a character (highest two bits are 10)
        if (n > 0 && ( a[n]&0xC0 ) == 0x80)
        {
            // remove all bytes whose two highest bits are 10
            // and one more (start of multi-byte sequence - highest bits should be 11)
            while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
                ;
        }
        // convert back to string (with the limit adjusted)
        return Encoding.UTF8.GetString(a, 0, n);
    }

canton7 5 年前

Encoder

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }

    var encoder = Encoding.UTF8.GetEncoder();
    byte[] buffer = new byte[maxLength];
    char[] messageChars = message.ToCharArray();
    encoder.Convert(
        chars: messageChars,
        charIndex: 0,
        charCount: messageChars.Length,
        bytes: buffer,
        byteIndex: 0,
        byteCount: buffer.Length,
        flush: false,
        charsUsed: out int charsUsed,
        bytesUsed: out int bytesUsed,
        completed: out bool completed);

    // I don't think we can return message.Substring(0, charsUsed)
    // as that's the number of UTF-16 chars, not the number of codepoints
    // (think about surrogate pairs). Therefore I think we need to
    // actually convert bytes back into a new string
    return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }

    var encoder = Encoding.UTF8.GetEncoder();
    byte[] buffer = new byte[maxLength];
    encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
    return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}

其他答案都不能解释扩展的grapheme集群,例如 ð©ð½âð . 它由4个Unicode标量组成( ð© , ð½ ð 因此,您需要了解Unicode标准,以避免在中间产生分裂。 ð© 或 ð©ð½

在里面此后,您可以这样写:

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }
    
    var enumerator = StringInfo.GetTextElementEnumerator(message);
    var result = new StringBuilder();
    int lengthBytes = 0;
    while (enumerator.MoveNext())
    {
        lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
        if (lengthBytes <= maxLength)
        {
            result.Append(enumerator.GetTextElement()); 
        }
    }
    
    return result.ToString();
}

(这段代码在早期版本的.NET上运行,但由于一个bug,它无法在.NET 5之前生成正确的结果)。

Oren Trutner 16 年前

字节有一个零值高阶位,它是字符的开头。如果其高阶位为1,则它位于字符的“中间”。检测字符开头的能力是UTF-8的明确设计目标。

请查看文档的“说明”部分 wikipedia article 更多细节。

Justin Cave 16 年前

是否有理由需要以字节为单位声明数据库列?这是默认值,但如果数据库字符集是可变宽度的,则这不是一个特别有用的默认值。我非常喜欢用字符来声明列。

CREATE TABLE length_example (
  col1 VARCHAR2( 10 BYTE ),
  col2 VARCHAR2( 10 CHAR )
);

假设默认情况下希望创建的所有表都使用字符长度语义,则可以设置初始化参数 NLS_LENGTH_SEMANTICS 烧焦。此时,如果不在字段长度中指定CHAR或byte,则创建的任何表都将默认使用字符长度语义而不是字节长度语义。

Community Mohan Dere 9 年前

下列的 Oren Trutner's comment
在这里,我们根据字符串末尾的每个字符计算要从字符串末尾删除的字节数,因此我们不会在每次迭代中计算整个字符串。

string str = "æ£æ¥¢ç´æ§æ§ ç©æµ»ç¡æ¥§ç¡°æ§æ§ç§æµ»ç¡æ¥§æ¬ç¦ ç° çµ¸æ£æ¢æ§ç§æ»æ¡æ«æ½²æ¹µ æ½£" 
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
   bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
   --lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= æ£æ¥¢ç´æ§æ§ ç©æµ»ç¡æ¥§ç¡°æ§æ§ç§æµ»ç¡æ¥§æ¬ç¦ ç° çµ¸æ£æ¢æ§ç§æ»æ¡æ«æ½²æ¹µ æ½£æ½¬æ£æ¸æ¸æ¢æ£

以及更高效(且可维护)的解决方案: 根据所需长度从字节数组中获取字符串,并剪切最后一个字符,因为它可能已损坏

string str = "æ£æ¥¢ç´æ§æ§ ç©æµ»ç¡æ¥§ç¡°æ§æ§ç§æµ»ç¡æ¥§æ¬ç¦ ç° çµ¸æ£æ¢æ§ç§æ»æ¡æ«æ½²æ¹µ æ½£" 
int maxBytesLength = 30;    
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);

幸亏 Shhade 谁想到了第二个解决方案

Afshin 9 年前

这是另一个基于二进制搜索的解决方案:

public string LimitToUTF8ByteLength(string text, int size)
{
    if (size <= 0)
    {
        return string.Empty;
    }

    int maxLength = text.Length;
    int minLength = 0;
    int length = maxLength;

    while (maxLength >= minLength)
    {
        length = (maxLength + minLength) / 2;
        int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));

        if (byteLength > size)
        {
            maxLength = length - 1;
        }
        else if (byteLength < size)
        {
            minLength = length + 1;
        }
        else
        {
            return text.Substring(0, length); 
        }
    }

    // Round down the result
    string result = text.Substring(0, length);
    if (size >= Encoding.UTF8.GetByteCount(result))
    {
        return result;
    }
    else
    {
        return text.Substring(0, length - 1);
    }
}

-1

Anwar 10 年前

public static string LimitByteLength3(string input, Int32 maxLenth)
    {
        string result = input;

        int byteCount = Encoding.UTF8.GetByteCount(input);
        if (byteCount > maxLenth)
        {
            var byteArray = Encoding.UTF8.GetBytes(input);
            result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
        }

        return result;
    }