代码之家  ›  专栏  ›  技术社区  ›  OverflowingTheGlass

将平均感知器标记位置转换为WordNet位置,避免元组错误

  •  1
  • OverflowingTheGlass  · 技术社区  · 8 年前

    from nltk.corpus import wordnet
    from nltk.stem import WordNetLemmatizer
    from nltk import pos_tag
    from nltk.tokenize import word_tokenize
    
    string = 'dogs runs fast'
    
    tokens = word_tokenize(string)
    tokensPOS = pos_tag(tokens)
    print(tokensPOS)
    

    [('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]
    

    lemmatizedWords = []
    for w in tokensPOS:
           lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))
    
    print(lemmatizedWords)
    

    产生的错误:

    Traceback (most recent call last):
    
      File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
        lemmatizedWords = WordNetLemmatizer().lemmatize(w)
    
      File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
        lemmas = wordnet._morphy(word, pos)
    
      File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
        forms = apply_rules([form])
    
      File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
        for form in forms
    
      File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
        if form.endswith(old)]
    
    AttributeError: 'tuple' object has no attribute 'endswith'
    

    我想我有两个问题:

    1. POS标记没有转换为WordNet可以理解的标记(我尝试了实现类似于这个答案的东西 wordnet lemmatization and pos tagging in python
    2. 数据结构的格式不正确,无法循环遍历每个元组(除此之外,我找不到更多关于这个错误的信息) os 相关代码)

    我该如何使用柠檬化进行词性标注以避免这些错误?

    1 回复  |  直到 8 年前
        1
  •  2
  •   Jakub Rakus    8 年前

    Python解释器清楚地告诉您:

    AttributeError: 'tuple' object has no attribute 'endswith'
    

    tokensPOS 是一个元组数组,因此不能将其元素直接传递给 lemmatize() 方法(查看类的代码 WordNetLemmatizer here endswith() ,因此需要从中传递每个元组的第一个元素 tokenPOS

    lemmatizedWords = []
    for w in tokensPOS:
        lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))   
    

    方法 柠檬化() 使用 wordnet.NOUN 作为默认位置。不幸的是,Wordnet使用不同于其他nltk语料库的标记,因此您必须手动翻译它们(如您提供的链接中所示),并使用适当的标记作为第二个参数来 get_wordnet_pos() 从…起 this answer :

    from nltk.corpus import wordnet
    from nltk.stem import WordNetLemmatizer
    from nltk import pos_tag
    from nltk.tokenize import word_tokenize
    
    def get_wordnet_pos(treebank_tag):
    
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return ''
    
    string = 'dogs runs fast'
    
    tokens = word_tokenize(string)
    tokensPOS = pos_tag(tokens)
    print(tokensPOS)
    
    lemmatizedWords = []
    for w in tokensPOS:
        lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))
    
    print(lemmatizedWords)