代码之家 › 专栏 › 技术社区 › Delan Azabani

UTF-16解码器未按预期工作

utf-16 decoding c

Delan Azabani · 技术社区 · 14 年前

我的Unicode库中有一部分将UTF-16解码为原始Unicode代码点。然而,它并没有像预期的那样工作。

typedef struct string {
    unsigned long length;
    unsigned *data;
} string;

string *upush(string *s, unsigned c) {
    if (!s->length) s->data = (unsigned *) malloc((s->length = 1) * sizeof(unsigned));
    else            s->data = (unsigned *) realloc(s->data, ++s->length * sizeof(unsigned));
    s->data[s->length - 1] = c;
    return s;
}

typedef struct string16 {
    unsigned long length;
    unsigned short *data;
} string16;

string u16tou(string16 old) {
    unsigned long i, cur = 0, need = 0;
    string new;
    new.length = 0;
    for (i = 0; i < old.length; i++)
        if (old.data[i] < 0xd800 || old.data[i] > 0xdfff) upush(&new, old.data[i]);
        else
            if (old.data[i] > 0xdbff && !need) {
                cur = 0; continue;
            } else if (old.data[i] < 0xdc00) {
                need = 1;
                cur = (old.data[i] & 0x3ff) << 10;
                printf("cur 1: %lx\n", cur);
            } else if (old.data[i] > 0xdbff) {
                cur |= old.data[i] & 0x3ff;
                upush(&new, cur);
                printf("cur 2: %lx\n", cur);
                cur = need = 0;
            }
    return new;
}

它是如何工作的?

string string16 用于16位值,如UTF-16。全部 upush 是否将完整的Unicode代码点添加到 一串

u16tou 是我关注的部分。它在整个 字符串16

有什么问题吗?

让我们试试最高的代码点,好吗?

U+10FFFD 最后一个有效的Unicode码位编码为 0xDBFF 0xDFFD

string16 b;
b.length = 2;
b.data = (unsigned short *) malloc(2 * sizeof(unsigned short));
b.data[0] = 0xdbff;
b.data[1] = 0xdffd;
string a = u16tou(b);
puts(utoc(a));

使用 utoc (未显示;我知道它正在工作(见下文))函数将其转换回UTF-8 char * U+0FFFFD ,不是 U+10FFFD型 因此。

在中手动执行所有转换 gcalctool工具 结果是相同的,错误的答案。所以我的语法本身没有错,但是算法是错的。不过,这个算法对我来说似乎是正确的,但结果却是错误的。

我做错什么了?

2 回复 | 直到 14 年前

JosephH 14 年前

解码代理项对时需要添加0x10000;引用 rfc 2781 ,您缺少的步骤是第5步:

    1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
       of W1. Terminate.

    2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
       is in error and no valid character can be obtained using W1.
       Terminate.

    3) If there is no W2 (that is, the sequence ends with W1), or if W2
       is not between 0xDC00 and 0xDFFF, the sequence is in error.
       Terminate.

    4) Construct a 20-bit unsigned integer U', taking the 10 low-order
       bits of W1 as its 10 high-order bits and the 10 low-order bits of
       W2 as its 10 low-order bits.

    5) Add 0x10000 to U' to obtain the character value U. Terminate.

cur = (old.data[i] & 0x3ff) << 10;
cur += 0x10000;

Joey Gumbo 14 年前

你好像少了一个偏移量 0x10000

this WIKI page ,UTF-16代理项对的构造如下:

UTF-16表示非BMP字符 (U+10000到U+10FFFF)使用两个代码单元,称为代理项对。前10000 ₁₆ 是从然后将其分成两个10位值,每个值表示为最重要的代孕妈妈一半放在第一个代孕。

推荐文章

Community wiki · C中有哪些耗时的操作?

1 年前

Tintenfisch · 传递参数:array与C和C中的*&array和&array[0]之间的差异++

1 年前

daryldxn · Windows筛选平台计算通过TCP连接发送的字节和接收的字节

1 年前

Mike Balts · 它们将被打印多少次,为什么?我知道:“阿尔法”一次,“贝塔”两次,“欧米茄”两次但我不知道为什么

1 年前

Mohammed Eid · 数据类型“char”是否可以被视为数据类型“int”?

1 年前

Community wiki · 将所有处理器电源都投入到任务中

1 年前

Community wiki · 在C&数据结构中实现不同数据结构的聪明方法,应该更频繁地使用

1 年前

Community wiki · C++为C添加了什么?[已关闭]

1 年前

Abhinav Kumar · 如何将#define的数据类型设置为长双精度?

1 年前

Community wiki · 打印1到1000,不带循环或条件

1 年前