代码之家  ›  专栏  ›  技术社区  ›  Freakant

NLTK。检测句子是否是疑问句?

  •  11
  • Freakant  · 技术社区  · 7 年前

    我想使用NLTK或任何最好的库创建一个python脚本,以正确识别给定的句子是否是疑问句(一个问题)。我曾尝试使用regex,但在更深层的情况下,regex会失败。所以想用自然语言处理任何人都能帮上忙!

    3 回复  |  直到 3 年前
        1
  •  15
  •   PolkaDot    7 年前

    This 可能会解决你的问题。

    代码如下:

    import nltk
    nltk.download('nps_chat')
    posts = nltk.corpus.nps_chat.xml_posts()[:10000]
    
    
    def dialogue_act_features(post):
        features = {}
        for word in nltk.word_tokenize(post):
            features['contains({})'.format(word.lower())] = True
        return features
    
    featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
    size = int(len(featuresets) * 0.1)
    train_set, test_set = featuresets[size:], featuresets[:size]
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    print(nltk.classify.accuracy(classifier, test_set))
    

    这应该是0.67,这是相当准确的。 如果要通过此分类器处理文本字符串,请尝试:

    print(classifier.classify(dialogue_act_features(line)))
    

    您可以将字符串分类为ynQuestion、Statement等,并提取所需内容。

    这种方法使用的是NaiveBayes,在我看来,这是最简单的方法,但肯定有很多方法可以处理这一点。希望这有帮助!

        2
  •  4
  •   Sunil Garg    4 年前

    根据@PolkaDot的答案,我创建了使用NLTK的函数,然后使用一些自定义代码来获得更高的精度。

    posts = nltk.corpus.nps_chat.xml_posts()[:10000]
    
    def dialogue_act_features(post):
        features = {}
        for word in nltk.word_tokenize(post):
            features['contains({})'.format(word.lower())] = True
        return features
    
    featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
    
    # 10% of the total data
    size = int(len(featuresets) * 0.1)
    
    # first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
    train_set, test_set = featuresets[size:], featuresets[:size]
    
    # get the classifer from the training set
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    # to check the accuracy - 0.67
    # print(nltk.classify.accuracy(classifier, test_set))
    
    question_types = ["whQuestion","ynQuestion"]
    def is_ques_using_nltk(ques):
        question_type = classifier.classify(dialogue_act_features(ques)) 
        return question_type in question_types
    

    然后

    question_pattern = ["do i", "do you", "what", "who", "is it", "why","would you", "how","is there",
                        "are there", "is it so", "is this true" ,"to know", "is that true", "are we", "am i", 
                       "question is", "tell me more", "can i", "can we", "tell me", "can you explain",
                       "question","answer", "questions", "answers", "ask"]
    
    helping_verbs = ["is","am","can", "are", "do", "does"]
    # check with custom pipeline if still this is a question mark it as a question
    def is_question(question):
        question = question.lower().strip()
        if not is_ques_using_nltk(question):
            is_ques = False
            # check if any of pattern exist in sentence
            for pattern in question_pattern:
                is_ques  = pattern in question
                if is_ques:
                    break
    
            # there could be multiple sentences so divide the sentence
            sentence_arr = question.split(".")
            for sentence in sentence_arr:
                if len(sentence.strip()):
                    # if question ends with ? or start with any helping verb
                    # word_tokenize will strip by default
                    first_word = nltk.word_tokenize(sentence)[0]
                    if sentence.endswith("?") or first_word in helping_verbs:
                        is_ques = True
                        break
            return is_ques    
        else:
            return True
    

    你只需要使用 is_question 检查传递的句子是否为疑问句的方法。

        3
  •  3
  •   Jerry Fanelli    5 年前

    通过使用sklearn库,您可以通过简单的渐变增强来改进PolkaDot解决方案,并达到86%左右的精度。这将导致如下情况:

    import nltk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.metrics import classification_report
    
    nltk.download('nps_chat')
    posts = nltk.corpus.nps_chat.xml_posts()
    
    
    posts_text = [post.text for post in posts]
    
    #divide train and test in 80 20
    train_text = posts_text[:int(len(posts_text)*0.8)]
    test_text = posts_text[int(len(posts_text)*0.2):]
    
    #Get TFIDF features
    vectorizer = TfidfVectorizer(ngram_range=(1,3), 
                                 min_df=0.001, 
                                 max_df=0.7, 
                                 analyzer='word')
    
    X_train = vectorizer.fit_transform(train_text)
    X_test = vectorizer.transform(test_text)
    
    y = [post.get('class') for post in posts]
    
    y_train = y[:int(len(posts_text)*0.8)]
    y_test = y[int(len(posts_text)*0.2):]
    
    # Fitting Gradient Boosting classifier to the Training set
    gb = GradientBoostingClassifier(n_estimators = 400, random_state=0)
    #Can be improved with Cross Validation
    
    gb.fit(X_train, y_train)
    
    predictions_rf = gb.predict(X_test)
    
    #Accuracy of 86% not bad
    print(classification_report(y_test, predictions_rf))
    

    然后,您可以使用该模型对新数据进行预测,方法是 gb.predict(vectorizer.transform(['new sentence here']) .