代码之家 › 专栏 › 技术社区 › Rohit

使用utf-8文本的脚本与Windows中的RStudio和命令行运行方式不同

utf-8 windows regex r

0

Rohit · 技术社区 · 8 年前

我正在处理包含印地语文本的文件并解析它们。我用Rstudio编写代码并执行它,没有很多问题。但现在,我需要使用R.exe/R script.exe从命令行执行相同的脚本,但它的工作方式不同。我在RStudio和终端上运行了一个简单的脚本:

n_p<-'à¤¨à¤¾à¤®'

Encoding(n_p)

gregexpr(n_p,c('adfdafc','à¤¨à¤¾à¤® adsfdfa'))
sessionInfo()

> n_p<-'à¤¨à¤¾à¤®'
> 
> Encoding(n_p)
[1] "UTF-8"
> 
> gregexpr(n_p,c('adfdafc','à¤¨à¤¾à¤® adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1

[[2]]
[1] 1
attr(,"match.length")
[1] 3

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
[5] LC_TIME=English_India.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rJava_0.9-10

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0

在命令中用R.exe输出(用于调试目的。Rscript.exe给出了一个类似的(如果不是相同的)输出)

> n_p<-'Ã â¼"Ã â¼_Ã â¼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','Ã â¼"Ã â¼_Ã â¼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1

[[2]]
[1] 1
attr(,"match.length")
[1] 9

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

Sys.setlocale 拒绝正常工作。在某些情况下, gregexpr 当无法解析非ASCII代码时出错。最后,当它在没有错误的情况下运行时,它不能正确地匹配正则表达式。目前我无法提供一个可重复的例子,但我稍后会尝试。

救命啊。

1 回复 | 直到 8 年前

1

0

wp78de 8 年前

您需要确保R运行在适当的区域设置中:

运行rterm使用: Sys.getlocale() 以查找当前区域设置。

Sys.setlocale(category = "LC_ALL", locale = "hi-IN")

# Try "hi-IN.UTF-8" too...

您可以找到区域设置名称 here ,和 MSDN here

如果您有正确的值,请将 Sys.setlocale() 你的命令 ~/.Rprofile

工具书类