我有下面的示例数据框。我在Jupyter笔记本中执行Python代码。
No category problem_definition
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
我使用下面的代码标记我的文本列:
from nltk.tokenize import sent_tokenize, word_tokenize
import pandas as pd
import re
df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)
set(stopwords.words('english'))
stop_words = set(stopwords.words('english'))
df['problem_definition_stopwords'] = df['problem_definition_tokenized'].apply(lambda x: [i for i in x if i not in stop_words])
接下来,我使用配置包计算了三叉图。
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])
finder.apply_freq_filter(8)
finder.nbest(trigram_measures.pmi, 100)
s = pd.Series(df['problem_definition_stopwords'])
from nltk import ngrams
from collections import Counter
ngram_list = [pair for row in s for pair in ngrams(row, 3)]
counts = Counter(ngram_list).most_common()
df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df
结果是这样的……”xxx”代表一个单词
gram count
(xxx, xxx, xxx) 23
(xxx, xxx, xxx) 14
(xxx, xxx, xxx) 63
(xxx, xxx, xxx) 28
我可以在Python中运行上述所有代码,但是当我尝试在PySpark环境中运行这些代码时,它会一直旋转。
有没有办法把我写的代码转换成PySpark代码?我在谷歌上搜索过,但找不到任何确定的东西。