代码之家  ›  专栏  ›  技术社区  ›  Huzo

与tfidfvector.fit变换的返回结果混淆

  •  4
  • Huzo  · 技术社区  · 7 年前

    我想了解更多关于NLP的信息。我遇到了这段代码。但我对 TfidfVectorizer.fit_transform 打印结果时。我熟悉tfidf是什么,但我不明白数字的含义。

    import tensorflow as tf 
    import numpy as np 
    from sklearn.feature_extraction.text import TfidfVectorizer
    import os 
    import io
    import string 
    import requests 
    import csv 
    import nltk
    from zipfile import ZipFile 
    
    sess = tf.Session()
    
    batch_size = 100
    max_features = 1000
    
    save_file_name = os.path.join('smsspamcollection','SMSSpamCollection.csv')
    if os.path.isfile(save_file_name):
        text_data = []
        with open(save_file_name,'r') as temp_output_file:
            reader = csv.reader(temp_output_file)
            for row in reader:
                text_data.append(row)
    
    else:
        zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
        r = requests.get(zip_url)
        z = ZipFile(io.BytesIO(r.content))
        file = z.read('SMSSpamCollection')
    
        #Format data 
        text_data = file.decode()
        text_data = text_data.encode('ascii',errors='ignore')
        text_data = text_data.decode().split('\n')
        text_data = [x.split('\t') for x in text_data if len(x)>=1]
    
        #And write to csv 
        with open(save_file_name,'w') as temp_output_file:
            writer = csv.writer(temp_output_file)
            writer.writerows(text_data)
    
    texts = [x[1] for x in text_data]
    target = [x[0] for x in text_data]
    target = [1 if x=='spam' else 0 for x in target]
    
    
    #Normalize the text
    texts = [x.lower() for x in texts] #lower
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts] #remove punctuation
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts] #remove numbers
    texts = [' '.join(x.split()) for x in texts] #trim extra whitespace
    
    def tokenizer(text):
        words = nltk.word_tokenize(text)
        return words
    
    tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features)
    sparse_tfidf_texts = tfidf.fit_transform(texts)
    print(sparse_tfidf_texts)
    

    输出为:

    (0,630)0.37172623140154337(0,160)0.36805562944957004(0, 38)0.3613966215413548(0,545)0.2561101665717327(0, 326)0.264528991765623(0,967)0.32774477602873963(0, 421)0.3896274380321477(0,227)0.28102915589024796(0, 323)0.22032541100275282(0,922)0.2709848154866997(1, 577)0.4007895093299793(1425)0.5970064521899725(1, 943)0.6310763941180291(1878)0.29102173465492637(2, 282)0.1771481430848552(243)0.5517018054305785(2, 955)0.2920174942032025(2138)0.30143666813167863(2, 946)0.2269933441326121(2165)0.3051095293405041(2, 268)0.2820392223588522(2780)0.2411962642264894(2, 823)0.1890454397278538(2674)0.256251970757827(2, 874)0.19343834015314287::(5569,648)0.241716524922226922
    (5569,123)0.2301190939432202(5569,957)0.24817919217662862
    (5569549)0.28583789844730134(5569863)0.3026729783085827
    (5569844)0.20228305447951195(5569146)0.2514415602877767
    (5569595)0.246325987380789(5569511)0.3091904754885042
    (5569,230)0.2872728684768659(5569,638)0.3415390143548765
    (5569,83)0.3464271621701711(5570370)0.41999120000421362
    (5570,46)0.48234172093857797(5570,317)0.4171646676697801
    (5570281)0.6456993475093024(5572282)0.25540827228532487
    (5572385)0.36945842040023935(5572448)0.25540827228532487
    (5572931)0.3031800542518209(5572192)0.29866989620926737
    (5572303)0.43990016711221736(557287)0.4521284173737176
    (5572332)0.3924202767503492(5573866)1.0

    如果有人能解释输出,我会非常高兴。

    1 回复  |  直到 7 年前
        1
  •  5
  •   Jan K    7 年前

    请注意,您正在打印稀疏矩阵,因此与打印标准密集矩阵相比,输出看起来不同。主要部件如下:

    • 元组表示: (document_id, token_id)
    • 元组后面的值表示给定文档中给定令牌的tf idf得分
    • 不存在的元组的tf idf得分为0

    如果你想找到 token_id 对应,检查 get_feature_names 方法。