代码之家  ›  专栏  ›  技术社区  ›  lrrr

如何使用python postagger检测aboutness

  •  0
  • lrrr  · 技术社区  · 9 年前

    我正在与python合作,以获取facebook的状态,告诉他们状态和情绪。本质上,我需要知道情绪指的是什么,我已经成功地编写了一个情绪分析器,所以麻烦是让一个POS标签来计算情绪指的什么。

    如果你有任何经验上的建议,我将不胜感激。我读过一些关于从主-客体、NP-PP和NP-NP关系计算有关度的论文,但没有看到任何好的例子,也没有找到很多论文。

    最后,如果你曾与POS标签者合作过,作为一名非计算机科学家,我对python的最佳选择是什么。我是一名物理学家,所以我可以一起破解代码,但如果有一个包包含了我需要的一切,我不想再发明轮子。

    提前非常感谢!

    1 回复  |  直到 9 年前
        1
  •  1
  •   lrrr    9 年前

    这就是我发现的有用之处,我要编辑它并将其与nltk pos tagger一起使用,看看我能得到什么结果。

    import nltk
    from nltk.corpus import brown
    
    # http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/
    
    
    # This is our fast Part of Speech tagger
    #############################################################################
    brown_train = brown.tagged_sents(categories='news')
    regexp_tagger = nltk.RegexpTagger(
        [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
         (r'(-|:|;)$', ':'),
         (r'\'*$', 'MD'),
         (r'(The|the|A|a|An|an)$', 'AT'),
         (r'.*able$', 'JJ'),
         (r'^[A-Z].*$', 'NNP'),
         (r'.*ness$', 'NN'),
         (r'.*ly$', 'RB'),
         (r'.*s$', 'NNS'),
         (r'.*ing$', 'VBG'),
         (r'.*ed$', 'VBD'),
         (r'.*', 'NN')
    ])
    unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
    bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
    #############################################################################
    
    
    # This is our semi-CFG; Extend it according to your own needs
    #############################################################################
    cfg = {}
    cfg["NNP+NNP"] = "NNP"
    cfg["NN+NN"] = "NNI"
    cfg["NNI+NN"] = "NNI"
    cfg["JJ+JJ"] = "JJ"
    cfg["JJ+NN"] = "NNI"
    #############################################################################
    
    
    class NPExtractor(object):
    
        def __init__(self, sentence):
            self.sentence = sentence
    
        # Split the sentence into singlw words/tokens
        def tokenize_sentence(self, sentence):
            tokens = nltk.word_tokenize(sentence)
            return tokens
    
        # Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
        def normalize_tags(self, tagged):
            n_tagged = []
            for t in tagged:
                if t[1] == "NP-TL" or t[1] == "NP":
                    n_tagged.append((t[0], "NNP"))
                    continue
                if t[1].endswith("-TL"):
                    n_tagged.append((t[0], t[1][:-3]))
                    continue
                if t[1].endswith("S"):
                    n_tagged.append((t[0], t[1][:-1]))
                    continue
                n_tagged.append((t[0], t[1]))
            return n_tagged
    
        # Extract the main topics from the sentence
        def extract(self):
    
            tokens = self.tokenize_sentence(self.sentence)
            tags = self.normalize_tags(bigram_tagger.tag(tokens))
    
            merge = True
            while merge:
                merge = False
                for x in range(0, len(tags) - 1):
                    t1 = tags[x]
                    t2 = tags[x + 1]
                    key = "%s+%s" % (t1[1], t2[1])
                    value = cfg.get(key, '')
                    if value:
                        merge = True
                        tags.pop(x)
                        tags.pop(x)
                        match = "%s %s" % (t1[0], t2[0])
                        pos = value
                        tags.insert(x, (match, pos))
                        break
    
            matches = []
            for t in tags:
                if t[1] == "NNP" or t[1] == "NNI":
                #if t[1] == "NNP" or t[1] == "NNI" or t[1] == "NN":
                    matches.append(t[0])
            return matches
    
    
    # Main method, just run "python np_extractor.py"
    Summary="""
    
    
    Verizon has not honored this appointment or notified me of the delay in an appropriate manner. It is now 1:20 PM and the only way I found out of a change is that I called their chat line and got a message saying my appointment is for 2 PM. My cell phone message says the original time as stated here.
    
    
    """
    def main(Topic):
        facebookData=[]
        readdata=csv.reader(open('fb_data1.csv','r'))
        for row in readdata:
            facebookData.append(row)
        relevant_sentence=[]
        for status in facebookData:
            summary=status.split('.')
            for sentence in summary:
                np_extractor = NPExtractor(sentence)
                result = np_extractor.extract()
                if Topic in result:
                    relevant_sentence.append(sentence)
                print sentence
                print "This sentence is about: %s" % ", ".join(result)
            return relevant_sentence
    
    if __name__ == '__main__':
        result=main('Verizon')
    

    注意,它将只保存与您定义的主题相关的句子。因此,如果我在分析奶酪的状态,我可以将其作为主题,提取奶酪上的所有句子,然后对这些句子进行情感分析。如果您对改进这一点有意见或建议,请告诉我!