代码之家 › 专栏 › 技术社区 › 1 JustOnly 1

为什么Java不能在UTF-8文件中读取这个Unicode字符?

io encoding java

1 JustOnly 1 · 技术社区 · 7 年前

“unicode.txt”UTF-8文件

ð«¢¹ã¢à¤«à¨¸á¡ÅÃ¼abÃÄ°Ãâ©ã¢

第一个字符有4个字节。当我运行这段代码时,我无法得到我期望的输出

InputStream in = new FileInputStream("unicode.txt");
InputStreamReader inReader = new InputStreamReader(in, "UTF-8");
char ch = (char)inReader.read();
System.out.println(ch); // Writes '?' character to the console. Why ?

为什么此代码不向控制台写入“_”字符?我怎么写呢?

我的默认编码:

System.out.println(System.getProperty("file.encoding")); // output: "UTF-8"
System.out.println(Charset.defaultCharset()); // output: "UTF-8"

我认为,问题在于char数据类型。

谢谢。

1 回复 | 直到 7 年前

Jared Stewart 7 年前

char数据类型基于原始的unicode规范,该规范将字符定义为固定宽度的16位实体。Unicode标准已经更改为允许字符的表示需要16位以上。Unicode代码点的范围现在是U+0000到U+10ffff。从u+0000到u+ffff的字符集称为基本多语言平面(BMP),代码点大于u+ffff的字符称为补充字符。因此,char值表示BMP代码点,包括代理代码点或UTF-16编码的代码单元。int值表示所有Unicode代码点,包括补充代码点。

https://wiki.sei.cmu.edu/confluence/plugins/servlet/mobile?contentId=88487813#content/view/88487813

char