我正在做一项任务,我需要把一段话分成几个句子。例如,给定一段:
"This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool."
This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence.
Sometimes there are problems, i.e. in this one.
here and abbr at the end x.y..
cool
现在它非常类似于
this task
这是用JavaScript实现的。
var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g;
var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
var result = str.replace(re, function(m, g1, g2){
return g1 ? g1 : g2+"\r";
});
var arr = result.split("\r");
document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";
我正试图在Java的帮助下实现这一点
this link
replace
public static void main(String[] args) {
String content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.";
Pattern p = Pattern.compile("/\\b(\\w\\.\\w\\.)|([.?!])\\s+(?=[A-Za-z])/g");
Matcher m = p.matcher(content);
List<String> tokens = new LinkedList<String>();
while (m.find()) {
String token = m.group(1); // group 0 is always the entire match
tokens.add(token);
}
System.out.println(tokens);
}