代码之家 › 专栏 › 技术社区 › Basilevs

明文中UTF-8的C++ cType刻面

ctype mingw utf-8 c++

Basilevs · 技术社区 · 15 年前

在一个项目中,所有的内部字符串都使用utf-8编码。该项目被移植到linux和windows。现在需要一个更低的功能。

在POSIX OS上,我可以使用STD::CyType ByNeMe(“RuuRu.UTF-8”)。但是使用g++(debian 4.3.4-1),ctype::tolower()无法识别俄语utf-8字符(拉丁文小写)。

当STD::STD::RunTimeOrthError:LoaLe::FoeTe::SyCeCytAcLayLayle名称无效时,当我尝试用“RuuRu.UTF-8”参数构造STD::CyType ByNeMy时,Windows上的异常库抛出异常。

如何在Windows上实现/查找UTF-8的STD::CyType?这个项目已经依赖于libiconv(codecvt方面是基于它的),但是我没有看到一个明显的方法来实现它。

3 回复 | 直到 15 年前

Dmitriy 15 年前

如果你只需要降低西里尔字符,你可以自己写一个函数。

ÐÐÐÐÐÐÐ in UTF8  D0 90 D0 91 D0 92 D0 93 D0 94 D0 95 D0 96 0A
Ð°Ð±Ð²Ð³Ð´ÐµÐ¶ in UTF8  D0 B0 D0 B1 D0 B2 D0 B3 D0 B4 D0 B5 D0 B6 0A

但不要忘记utf8是多字节编码。

也可以尝试将字符串从utf8转换为wchar_t(使用libiconv),并使用windows特定的函数来实现降低。

Dmitriy 15 年前

尝试使用 STLport

  Here is a description of how you can use STLport to read/write utf8 files.
utf8 is a way of encoding wide characters. As so, management of encoding in
the C++ Standard library is handle by the codecvt locale facet which is part
of the ctype category. However utf8 only describe how encoding must be
performed, it cannot be used to classify characters so it is not enough info
to know how to generate the whole ctype category facets of a locale
instance.

In C++ it means that the following code will throw an exception to
signal that creation failed:

#include 
// Will throw a std::runtime_error exception.
std::locale loc(".utf8");

For the same reason building a locale with the ctype facets based on
UTF8 is also wrong:

// Will throw a std::runtime_error exception:
std::locale loc(locale::classic(), ".utf8", std::locale::ctype);

The only solution to get a locale instance that will handle utf8 encoding
is to specifically signal that the codecvt facet should be based on utf8
encoding:

// Will succeed if there is necessary platform support.
locale loc(locale::classic(), new codecvt_byname(".utf8"));

  Once you have obtain a locale instance you can inject it in a file stream to
read/write utf8 files:

std::fstream fstr("file.utf8");
fstr.imbue(loc);

You can also access the facet directly to perform utf8 encoding/decoding operations:

typedef std::codecvt codecvt_t;
const codecvt_t& encoding = use_facet(loc);

Notes:

1. The dot ('.') is mandatory in front of utf8. This is a POSIX convention, locale
names have the following format:
language[_country[.encoding]]

Ex: 'fr_FR'
    'french'
    'ru_RU.koi8r'

2. utf8 encoding is only supported for the moment under Windows. The less common
utf7 encoding is also supported.

dudewat 15 年前

有一些stl(例如,来自apache-stdcxx的stl)带有几个locale。但在其他情况下,区域设置仅依赖于系统。

如果您可以在一个操作系统上使用名称“ru_ru.utf-8”,这并不意味着其他系统对此区域设置具有相同的名称。Debian和Windows可能有其他名称,这就是出现运行时异常的原因。

您应该先在系统上安装所需的区域设置。或者使用已经具有此区域设置的stl。

我的美分…