代码之家 › 专栏 › 技术社区 › boddhisattva

C++如何使用C++读取Unicode字符(例如,印地语脚本),或者通过其他编程语言有更好的方法吗?

nlp utf-8 perl python c++

boddhisattva · 技术社区 · 15 年前

我有这样的印地语脚本文件:

3.  à¤à¤¾à¤°à¤¤ à¤à¤¾ à¤à¤¤à¤¿à¤¹à¤¾à¤¸ à¤à¤¾à¤«à¥ à¤¸à¤®à¥à¤¦à¥à¤§ à¤à¤µà¤ à¤µà¤¿à¤¸à¥à¤¤à¥à¤¤ à¤¹à¥à¥¤

我必须写一个程序,在每个句子中的每个单词上加上一个位置。因此,一个特定单词位置的每一行的编号应该从括号中的1开始。输出应该是这样的。

3.  à¤à¤¾à¤°à¤¤(1) à¤à¤¾(2) à¤à¤¤à¤¿à¤¹à¤¾à¤¸(3) à¤à¤¾à¤«à¥(4) à¤¸à¤®à¥à¤¦à¥à¤§(5) à¤à¤µà¤(6) à¤µà¤¿à¤¸à¥à¤¤à¥à¤¤(7) à¤¹à¥(8) à¥¤(9)

以上句子的意思是:

3.  India has a long and rich history.

如果你观察到“_￥”(在印地语中是一个句号,相当于英语中的“.”)也有一个词的位置,类似地,其他特殊符号也会有,因为我正在努力进行英语印地语单词对齐(自然语言处理(NLP)的一部分),所以英语中的句号“应该映射到印地语中的“_￥”。序列号保持原样。我认为逐字阅读是一种解决办法。你能帮助我了解如何在C++中运行,如果它很容易,或者更容易,你能建议一些其他编程语言的其他方式,比如Python/Perl?

问题是,我能用C++获得我的英文文本的单词位置,因为我能够用C++中的ASCII值来读取字符,但是我不知道如何在Hydii文本中使用相同的字符。

这一切的最终目的是看英语文本中哪个词的位置映射到哪个位置。这样我就可以实现双向对齐。

感谢您抽出时间……:)

7 回复 | 直到 15 年前

jsbueno 15 年前

separators = [u"à¥¤", u",", u"."]
text = open("indiantext.txt").read()
#This converts the encoded text to an internal unicode object, where
# all characters are properly recognized as an entity:
text = text.decode("utf-8")

#this breaks the text on the white spaces, yielding a list of words:
words = text.split()

counter = 1

output = ""
for word in words:
    #if the last char is a separator, and is joined to the word:
    if word[-1] in separators and len(word) > 1:
        #word up to the second to last char:
        output += word[:-1] + u"(%d) " % counter
        counter += 1
        #last char
        output += word[-1] +  u"(%d) " % counter
    else:
        output += word + u"(%d) " % counter
    counter += 1

print output

http://python.org

daxim Fayland Lam 15 年前

use utf8; use strict; use warnings;
use Encode qw(decode);
my $index;
join ' ', map { $index++; "$_($index)" } split /\s+|(?=à¥¤)/, decode 'UTF-8', <>;
# returns à¤à¤¾à¤°à¤¤(1) à¤à¤¾(2) à¤à¤¤à¤¿à¤¹à¤¾à¤¸(3) à¤à¤¾à¤«à¥(4) à¤¸à¤®à¤¦à¤§(5) à¤à¤µ(6) à¤µà¤¿à¤¸à¤¤à¤¤(7) à¤¹(8) à¥¤(9)

STDIN

jkp 15 年前

utfcpp

#!/usr/bin/env python
# encoding: utf-8

string = u"à¤à¤¾à¤°à¤¤ à¤à¤¾ à¤à¤¤à¤¿à¤¹à¤¾à¤¸ à¤à¤¾à¤«à¥ à¤¸à¤®à¥à¤¦à¥à¤§ à¤à¤µà¤ à¤µà¤¿à¤¸à¥à¤¤à¥à¤¤ à¤¹à¥à¥¤"
parts = []
for part in string.split():
    parts.extend(part.split(u"à¥¤"))
print "No of Parts: %d" % len(parts)
print "Parts: %s" % parts

No of Parts: 9
Parts: [u'\u092d\u093e\u0930\u0924', u'\u0915\u093e', u'\u0907\u0924\u093f\u0939\u093e\u0938', u'\u0915\u093e\u092b\u0940', u'\u0938\u092e\u0943\u0926\u094d\u0927', u'\u090f\u0935\u0902', u'\u0935\u093f\u0938\u094d\u0924\u0943\u0924', u'\u0939\u0948', u'']

NLTK

Milan BabuÅ¡kov 15 年前

ICU - International Components for Unicode

cc. 15 年前

http://site.icu-project.org/

MSalters 15 年前

std::wstring wchar_t L' ' input.find_first_of(L" à¥¤")

ravenspoint 15 年前

FILE * fp = _wfopen( L"fname",L"r" );
wchar_t buf[1000];
while( fgetws(buf,999, fp ) )   {
    fwprintf(L"%s",buf);
}

// convert UTF-8 to UNICODE

    void String2WString( std::wstring& ws, const std::string& s )
    {
        ws.clear();
        int nLenOfWideCharStr = MultiByteToWideChar(CP_ACP, 0, 
            s.c_str(), s.length(), NULL, 0); 
        PWSTR pWideCharStr = (PWSTR)HeapAlloc(GetProcessHeap(), 0, 
            nLenOfWideCharStr * sizeof(wchar_t)+2); 
        if (pWideCharStr == NULL)         
            return; 
        MultiByteToWideChar(CP_ACP, 0, 
            s.c_str(), s.length(), 
            pWideCharStr, nLenOfWideCharStr);
        *(pWideCharStr+nLenOfWideCharStr ) = L'\0';
        ws = pWideCharStr ;
        HeapFree(GetProcessHeap(), 0, pWideCharStr); 

    }

    // read UTF-8
FILE * fp = fopen( "fname","r" );
char buf[1000];
std::string aline;
std::wstring wline;
std::vector< std::wstring> vline;
while( fgets(buf,999, fp ) )    {
    aline = buf;
    String2WString( wline, aline );
    vline.push_back( wline );
}