代码之家  ›  专栏  ›  技术社区  ›  Josh Chilton

Python和nGrams

  •  0
  • Josh Chilton  · 技术社区  · 8 年前

    我试图使用nltk或其他模块在Python中复制ASTER ngram的输出。我需要能够为1到4的NGRAM执行此操作。输出到csv。

    Unique_ID, Text_Narrative
    

    需要的输出:

    Unique_id, ngram(token), ngram(frequency)
    

    2 回复  |  直到 8 年前
        1
  •  0
  •   Uri Goren    8 年前

    我写这个简单的版本只有 python 的标准图书馆,出于教育原因。

    生产代码应使用 spacy pandas

    import collections
    from operator import itemgetter as at
    with open("input.csv",'r') as f:
        data = [l.split(',', 2) for l in f.readlines()]
    spaced = lambda t: (t[0][0],' '.join(map(at(1), t))) if t[0][0]==t[1][0] else []
    unigrams = [(i,w) for i, d in data for w in d.split()]
    bigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:] )))
    trigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:], unigrams[2:])))
    with open("output.csv", 'w') as f:
        for ngram in [unigrams, bigrams, trigrams]:
            counts = collections.Counter(ngram)
            for t,count in counts.items():
                f.write("{i},{w},{c}\n".format(c=count, i=t[0], w=t[1]))
    
        2
  •  0
  •   Axle Max    8 年前

    正如其他人所说,这个问题确实很模糊,但因为你是新来的,这里有一个长格式的指南

    from collections import Counter
    
    #Your starting input  - a phrase with an ID
    #I added some extra words to show count
    dict1 = {'023345': 'I love Python love Python Python'}
    
    
    #Split the dict vlue into a list for counting
    dict1['023345'] = dict1['023345'].split()
    
    #Use counter to count
    countlist = Counter(dict1['023345'])
    
    #count list is now "Counter({'I': 1, 'Python': 1, 'love': 1})"
    
    #If you want to output it like you requested, interate over the dict
    for key, value in dict1.iteritems(): 
        id1 = key
        for key, value in countlist.iteritems():
            print id1, key, value