代码之家 › 专栏 › 技术社区 › Shikhar Mishra

ANTLR:Unicode字符扫描

lexer antlr java

5

Shikhar Mishra · 技术社区 · 15 年前

问题:无法正确打印Unicode字符。

这是我的语法:

options { k=1; filter=true;
 // Allow any char but \uFFFF (16 bit -1)
charVocabulary='\u0000'..'\uFFFE'; 
}

ANYCHAR :'$'
|    '_' { System.out.println("Found underscore: "+getText()); }
|    'a'..'z' { System.out.println("Found alpha: "+getText()); }
|    '\u0080'..'\ufffe' { System.out.println("Found unicode: "+getText()); }
;

调用lexer的主方法的代码段:

public static void main(String[] args) {
SimpleLexer simpleLexer = new SimpleLexer(System.in);
while(true) {
try {
Token t = simpleLexer.nextToken();
System.out.println("Token : "+t);

} catch(Exception e) {}

}
}

用于输入 “γ” ,我得到以下输出:

Found unicode: 
Token : ["Ã ",<5>,line=1,col=7]
Found unicode: 
Token : ["Â¤",<5>,line=1,col=8]
Found unicode:  
Token : [" ",<5>,line=1,col=9]

似乎lexer将unicode字符“_”视为三个单独的字符。我的目标是扫描并打印“_”。

1 回复 | 直到 15 年前

1

6

jpalecek 15 年前

您的问题不在ANTLR生成的Listar中,而是在Java流中传递给它。流只读取字节(不在编码中解释它们),您看到的是一个UTF-8序列。

如果它的角3,你可以使用 ANTLRInputStream 将Ancoding作为参数的构造函数:

ANTLRInputStream (InputStream input, String encoding) throws IOException