代码之家 › 专栏 › 技术社区 › Cranjis

从我自己的语料库创建可靠语言模型的代码

language-model data-science lstm nlp python

Cranjis · 技术社区 · 7 年前

我有一个特定领域的句子语料库。我正在寻找一个开源代码/包,我可以提供数据,它将生成一个好的、可靠的语言模型。(意思是,给定一个上下文,知道每个词的概率)。

有这样的代码/项目吗?

我看到这个Github回购: https://github.com/rafaljozefowicz/lm 但是没有用。

1 回复 | 直到 6 年前

inkalchemist1994 7 年前

我建议编写自己的基本实现。首先,我们需要一些句子:

import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())

sentences 现在是列表列表。每个子列表表示一个句子,每个单词作为一个元素。现在,您需要决定是否要在模型中包含标点符号。如果要删除它,请尝试如下操作:

punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
    new_sentence = [word for word in sentence if word not in punctuation]
    sentences[i] = new_sentence

接下来,你需要决定你是否关心资本化。如果你不在乎它,你可以这样移除它:

for i, sentence in enumerate(sentences.copy()):
    new_sentence = list()
    for j, word in enumerate(sentence.copy()):
        new_word = word.lower() # Lower case all characters in word
        new_sentence.append(new_word)
    sentences[i] = new_sentence

接下来,我们需要特别的开始和结束单词表示在句首和句尾有效的单词。你应该选择开始和结束训练数据中不存在的单词。

start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
    new_sentence = start + sentence + end
    sentences[i] = new_sentence

现在,让我们来计算一克。单格是句子中一个词的序列。是的,一个单格模型只是语料库中每个词的频率分布:

new_words = list()
for sentence in sentences:
    for word in sentence:
        new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)

现在是时候数一数大人物了。双字词是一个句子中两个单词的序列。所以,对于这个句子 “我是海象” ,我们有以下大问题: “我” , “我” , “是吗?” , “海象” 和 “海象<>” .

bigrams = list()
for sentence in sentences:
    new_bigrams = nltk.bigrams(sentence)
    bigrams += new_bigrams

现在我们可以创建一个频率分布:

bigram_fdist = nltk.ConditionalFreqDist(bigrams)

最后,我们想知道模型中每个单词的概率:

def getUnigramProbability(word):
    if word in unigram_fdist:
        return unigram_fdist[word]/total_words
    else:
        return -1 # You should figure out how you want to handle out-of-vocabulary words

def getBigramProbability(word1, word2):
    if word1 not in bigram_fdist:
        return -1 # You should figure out how you want to handle out-of-vocabulary words
    elif word2 not in bigram_fdist[word1]:
        # i.e. "word1 word2" never occurs in the corpus
        return getUnigramProbability(word2)
    else:
        bigram_frequency = bigram_fdist[word1][word2]
        unigram_frequency = unigram_fdist[word1]
        bigram_probability = bigram_frequency / unigram_frequency
        return bigram_probability

虽然这不是一个只为您构建模型的框架/库,但我希望看到这段代码已经使语言模型中发生的事情变得清晰。

Tomas P 6 年前

你可以试试 word_language_model 从pytorch示例中。如果你有一个大语料库,可能会有一个问题。它们将所有数据加载到内存中。