代码之家  ›  专栏  ›  技术社区  ›  dnclem

如何获取pandas数据帧中单词列表(子字符串)的出现次数?

  •  1
  • dnclem  · 技术社区  · 7 年前

    我有一个大约150万行的熊猫数据框。我想在某一列中查找特定的选定单词(都是已知的)的出现次数。这只适用于一个单词。

    d = df["Content"].str.contains("word").value_counts()
    

    但我想从列表中找出多个已知单词的出现情况,如“word1”、“word2”。word2也可以是word2或wordtwo,如下所示:

    word1           40
    word2/wordtwo   120
    

    我如何做到这一点?

    2 回复  |  直到 7 年前
        1
  •  3
  •   MaxU - stand with Ukraine    7 年前

    海事组织最有效的方法之一是 sklearn.feature_extraction.text.CountVectorizer 向其传递一个词汇表(要计算的单词列表)。

    演示:

    In [21]: text = """
        ...: I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words in a certain colu
        ...: mn. This works for a single word. But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could
        ...: be word2 or wordtwo, like so"""
    
    In [22]: df = pd.DataFrame(text.split('. '), columns=['Content'])
    
    In [23]: df
    Out[23]:
                                                 Content
    0  \nI have a pandas data frame with approximatel...
    1  I want to find the number of occurrences of sp...
    2                       This works for a single word
    3  But I want to find out the occurrences of mult...
    4      Also word2 could be word2 or wordtwo, like so
    
    In [24]: from sklearn.feature_extraction.text import CountVectorizer
    
    In [25]: vocab = ['word', 'words', 'word1', 'word2', 'wordtwo']
    
    In [26]: vect = CountVectorizer(vocabulary=vocab)
    
    In [27]: res = pd.Series(np.ravel((vect.fit_transform(df['Content']).sum(axis=0))),
                             index=vect.get_feature_names())
    
    In [28]: res
    Out[28]:
    word       1
    words      2
    word1      1
    word2      3
    wordtwo    1
    dtype: int64
    
        2
  •  3
  •   Ami Tavory    7 年前

    您可以创建如下字典:

    {w: df["Content"].str.contains(w).sum() for w in words}
    

    假设 words 是单词列表。