代码之家  ›  专栏  ›  技术社区  ›  George Wang

斯坦福CoreNLP 3.9.1中国模型未加载

  •  0
  • George Wang  · 技术社区  · 7 年前

    我一直在使用 斯坦福大学CoreNLP 用于中文处理。

    我升级到最新版本3.9.1,发现中文分词器(和ssplit,pos)不起作用

    这是我的 “StanfordCoreNLP.属性” 文件(位于“资源”文件夹下)

    # Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
    annotators = tokenize, ssplit, pos
    
    # segment
    tokenize.language = zh
    segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
    segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
    segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
    segment.sighanPostProcessing = true
    
    # sentence split
    ssplit.boundaryTokenRegex = [.\u3002]|[!?\uFF01\uFF1F]+
    
    # pos
    pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger
    
    # ner
    ner.language = chinese
    ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
    ner.applyNumericClassifiers = true
    ner.useSUTime = false
    
    # regexner
    ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
    ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE
    
    # parse
    parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz
    
    # depparse
    depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
    depparse.language = chinese
    
    # coref
    coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
    coref.input.type = raw
    coref.postprocessing = true
    coref.calculateFeatureImportance = false
    coref.useConstituencyTree = true
    coref.useSemantics = false
    coref.algorithm = hybrid
    coref.path.word2vec =
    coref.language = zh
    coref.defaultPronounAgreement = true
    coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
    coref.print.md.log = false
    coref.md.type = RULE
    coref.md.liberalChineseMD = false
    
    # kbp
    kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
    kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
    kbp.language = zh
    kbp.model = none
    
    # entitylink
    entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz
    

    这是使用斯坦福CoreNLP的代码

    public class CoreNlp {
    
        private static StanfordCoreNLP pipeline = new StanfordCoreNLP();
        private static HashSet<String> meaningless = new HashSet<>(Arrays.asList("AD","AS","BA","CC","CS","DEC","DEG","DER","DEV","DT","ETC","IJ",
                "LB","LC","MSP","ON","P","PN","PU","SB","SP","VC","VE"));
        public static List<String> annotating(String linea){
            List<String> words = new ArrayList<>();
    
            if(linea == null){
                return words;
            }
    
            String text = clean(linea);
            if(Util.isNull(text)){
                return words;
            }
    
            CoreDocument document = new CoreDocument(text);
            CoreNlp.pipeline.annotate(document);
    
            for (CoreLabel token:  document.tokens()) {
                String word = token.word();
                String pos = token.tag(); 
                if(meaningless.contains(pos)) {
                    continue;
                }
    
                words.add(word);
            }
    
            return words;
        }
    
        private static String clean(String myString) {
            StringBuilder newString = new StringBuilder(myString.length());
            for (int offset = 0; offset < myString.length();)
            {
                int codePoint = myString.codePointAt(offset);
                offset += Character.charCount(codePoint);
                // Replace invisible control characters and unused code points
                switch (Character.getType(codePoint))
                {
                    case Character.CONTROL:     // \p{Cc}
                    case Character.FORMAT:      // \p{Cf}
                    case Character.PRIVATE_USE: // \p{Co}
                    case Character.SURROGATE:   // \p{Cs}
                    case Character.UNASSIGNED:  // \p{Cn}
                    case Character.SPACE_SEPARATOR: // \p{Zs}
                    case Character.LINE_SEPARATOR: // \p{Zl}
                    case Character.PARAGRAPH_SEPARATOR: // \p{Zp}
                        newString.append("");
                        break;
                    default:
                        newString.append(Character.toChars(codePoint));
                }
            }
            return newString.toString();
        }
    }
    

    这是加载日志:

    2018-03-13 16:22:54.178  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Searching for resource: StanfordCoreNLP.properties ... found.
    2018-03-13 16:22:54.179  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator tokenize
    2018-03-13 16:22:54.194  INFO 1424 --- [io-10301-exec-5] e.s.nlp.pipeline.TokenizerAnnotator      : No tokenizer type provided. Defaulting to PTBTokenizer.
    2018-03-13 16:22:54.280  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator ssplit
    2018-03-13 16:22:54.318  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator pos
    2018-03-13 16:22:55.241  INFO 1424 --- [io-10301-exec-5] e.s.nlp.tagger.maxent.MaxentTagger       : Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].
    

    似乎中国的模型还没有加载。

    因此,默认模型(英文)已用于段、ssplit和pos;导致中文处理失败

    请告知:,

    谢谢

    1 回复  |  直到 7 年前
        1
  •  0
  •   George Wang    7 年前

    我通过对“CoreNlp”类的以下更改解决了这个问题:

    private static StanfordCoreNLP pipeline = new StanfordCoreNLP("StanfordCoreNLP");
    

    加载日志现在如下所示

    2018-03-15 12:50:40.821  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Searching for resource: StanfordCoreNLP.properties ... found.
    2018-03-15 12:50:41.185  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator tokenize
    2018-03-15 12:50:52.337  INFO 1460 --- [io-10301-exec-7] e.s.nlp.ie.AbstractSequenceClassifier    : Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [10.7 sec].
    2018-03-15 12:50:52.393  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator ssplit
    2018-03-15 12:50:52.419  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator pos
    2018-03-15 12:50:53.292  INFO 1460 --- [io-10301-exec-7] e.s.nlp.tagger.maxent.MaxentTagger       : Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [0.8 sec].
    2018-03-15 12:50:53.362  INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary        : Loading Chinese dictionaries from 1 file:
    2018-03-15 12:50:53.362  INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary        :   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
    2018-03-15 12:50:53.657  INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary        : Done. Unique words in ChineseDictionary is: 423200.
    2018-03-15 12:50:53.797  INFO 1460 --- [io-10301-exec-7] edu.stanford.nlp.wordseg.CorpusChar      : Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
    2018-03-15 12:50:53.806  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.wordseg.AffixDictionary   : Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].