代码之家  ›  专栏  ›  技术社区  ›  fraxture

如何建立一个包含bigrams的gensim字典?

  •  2
  • fraxture  · 技术社区  · 7 年前

    我正在尝试建立一个Tf-Idf模型,它可以使用 gensim . 为此,我构建了一个gensim词典,然后使用该词典来创建用于构建模型的语料库的单词表示包。

    dict = gensim.corpora.Dictionary(tokens)
    

    哪里 token 单字和双字的列表如下:

    [('restore',),
     ('diversification',),
     ('made',),
     ('transport',),
     ('The',),
     ('grass',),
     ('But',),
     ('distinguished', 'newspaper'),
     ('came', 'well'),
     ('produced',),
     ('car',),
     ('decided',),
     ('sudden', 'movement'),
     ('looking', 'glasses'),
     ('shapes', 'replaced'),
     ('beauties',),
     ('put',),
     ('college', 'days'),
     ('January',),
     ('sometimes', 'gives')]
    

    但是,当我向 gensim.corpora.Dictionary()

    test = gensim.corpora.Dictionary([(('happy', 'dog'))])
    [test[id] for id in test]
    => ['dog', 'happy']
    

    2 回复  |  直到 7 年前
        1
  •  4
  •   EzLo tumao kaixin    6 年前
    from gensim.models import Phrases
    from gensim.models.phrases import Phraser
    from gensim import models
    
    
    
    docs = ['new york is is united states', 'new york is most populated city in the world','i love to stay in new york']
    
    token_ = [doc.split(" ") for doc in docs]
    bigram = Phrases(token_, min_count=1, threshold=2,delimiter=b' ')
    
    
    bigram_phraser = Phraser(bigram)
    
    bigram_token = []
    for sent in token_:
        bigram_token.append(bigram_phraser[sent])
    

    输出为: [['new york', 'is', 'is', 'united', 'states'],['new york', 'is', 'most', 'populated', 'city', 'in', 'the', 'world'],['i', 'love', 'to', 'stay', 'in', 'new york']]

    #now you can make dictionary of bigram token 
    dict = gensim.corpora.Dictionary(bigram_token)
    
    print(dict.token2id)
    #Convert the word into vector, and now you can use tfidf model from gensim 
    corpus = [dict.doc2bow(text) for text in bigram_token]
    
    tfidf_model = models.TfidfModel(corpus)
    
        2
  •  0
  •   fbparis    6 年前

    在创建词典之前,你必须“短语化”你的语料库来检测bigrams。

    我建议您在输入字典之前也对其进行词干或柠檬化处理,下面是一个使用nltk词干分析器函数的示例:

    import re
    from gensim.models.phrases import Phrases, Phraser
    from gensim.corpora.dictionary import Dictionary
    from gensim.models import TfidfModel
    from nltk.stem.snowball import SnowballStemmer as Stemmer
    
    stemmer = Stemmer("YOUR_LANG") # see nltk.stem.snowball doc
    
    stopWords = {"YOUR_STOPWORDS_FOR_LANG"} # as a set
    
    docs = ["LIST_OF_STR"]
    
    def tokenize(text):
        """
        return list of str from a str
        """
        # keep lowercase alphanums and "-" but not "_"
        return [w for w in re.split(r"_+|[^\w-]+", text.lower()) if w not in stopWords]
    
    docs = [tokenize(doc) for doc in docs]
    phrases = Phrases(docs)
    bigrams = Phraser(phrases)
    corpus = [[stemmer.stem(w) for w in bigrams[doc]] for doc in docs]
    dictionary = Dictionary(corpus)
    # and here is your tfidf model:
    tfidf = TfidfModel(dictionary=dictionary, normalize=True)