代码之家 › 专栏 › 技术社区 › PineNuts0

PythonPandas to PySpark:如何在PySpark中标记化、删除权宜之计单词和执行三叉图

nltk nlp pyspark pandas python

PineNuts0 · 技术社区 · 7 年前

我有下面的示例数据框。我在Jupyter笔记本中执行Python代码。

No  category    problem_definition
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

我使用下面的代码标记我的文本列:

from nltk.tokenize import sent_tokenize, word_tokenize 
import pandas as pd 
import re 

df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)

set(stopwords.words('english'))

stop_words = set(stopwords.words('english'))

df['problem_definition_stopwords'] = df['problem_definition_tokenized'].apply(lambda x: [i for i in x if i not in stop_words])

接下来,我使用配置包计算了三叉图。

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])

finder.apply_freq_filter(8) 

finder.nbest(trigram_measures.pmi, 100) 

s = pd.Series(df['problem_definition_stopwords'])

from nltk import ngrams
from collections import Counter

ngram_list = [pair for row in s for pair in ngrams(row, 3)]

counts = Counter(ngram_list).most_common()

df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])

df

结果是这样的……”xxx”代表一个单词

gram               count 
(xxx, xxx, xxx)    23
(xxx, xxx, xxx)    14
(xxx, xxx, xxx)    63
(xxx, xxx, xxx)    28

我可以在Python中运行上述所有代码,但是当我尝试在PySpark环境中运行这些代码时,它会一直旋转。

有没有办法把我写的代码转换成PySpark代码?我在谷歌上搜索过,但找不到任何确定的东西。

0 回复 | 直到 7 年前

推荐文章

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

1 年前

Cam · Pandas列表日期到日期时间

1 年前

jjkennedy · Pandas文本文件导入:当每个文件中存在多个表时,自动选择1个表

1 年前

Sun Jar · 在另一个系列中查找当前df值的索引,并将其添加到列中

1 年前

dietzi96 · Pandas DataFrame.to_sql随机和静默地失败,没有错误消息

1 年前

Bijan · Pandas批量更新帐户字符串

1 年前

Kernel · TypeError:Index.reindex()收到意外的关键字参数fill_value'

1 年前

Kernel · 进入熊猫的定义。系列super().reindex

1 年前

adventurous_chip_55 · 如何引爆柱子

1 年前

RKIDEV · Panda迭代行并将第n行值乘以下一(n+1)行值

1 年前