代码之家  ›  专栏  ›  技术社区  ›  learner

用Java将段落拆分成句子

  •  0
  • learner  · 技术社区  · 6 年前

    我正在做一项任务,我需要把一段话分成几个句子。例如,给定一段:

    "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool."
    

    This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence.
    
    Sometimes there are problems, i.e. in this one.
    
    here and abbr at the end x.y..
    
    cool
    

    现在它非常类似于 this task 这是用JavaScript实现的。

    var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g; 
    var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
    var result = str.replace(re, function(m, g1, g2){
      return g1 ? g1 : g2+"\r";
    });
    var arr = result.split("\r");
    document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";
    

    我正试图在Java的帮助下实现这一点 this link replace

    public static void main(String[] args) {
        String content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.";
        Pattern p = Pattern.compile("/\\b(\\w\\.\\w\\.)|([.?!])\\s+(?=[A-Za-z])/g");
        Matcher m = p.matcher(content);
        List<String> tokens = new LinkedList<String>();
        while (m.find()) {
            String token = m.group(1); // group 0 is always the entire match
            tokens.add(token);
        }
    
        System.out.println(tokens);
    }
    

    1 回复  |  直到 6 年前
        1
  •  1
  •   geef    6 年前
    public static void main(String[] args) {
    
        String content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.";
        BreakIterator bi = BreakIterator.getSentenceInstance();
        bi.setText(content);
        int index = 0;
        while (bi.next() != BreakIterator.DONE) {
            String sentence = content.substring(index, bi.current());
            System.out.println(sentence);
            index = bi.current();
        }
    }