代码之家  ›  专栏  ›  技术社区  ›  anitasp

减少矢量器的Pickle大小

  •  0
  • anitasp  · 技术社区  · 7 年前

    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf = TfidfVectorizer(
            strip_accents = 'ascii', sublinear_tf=True, min_df=5, norm='l2',
            encoding='latin-1', ngram_range=(1, 2), stop_words=spanish_stopwords,
            token_pattern = r'\w+[a-z,ñ]')
    features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
    
    features.shape
    

    (617, 22997)

    import pickle
    pickle.dump(tfidf, open("vectorizer3.pickle", "wb"))
    

    1 回复  |  直到 7 年前
        1
  •  2
  •   Kalsi    4 年前

    尝试使用gzip

    import gzip
    import pickle
    
    # writing into file. This will take long time
    fp = gzip.open('tfidf.data','wb')
    pickle.dump(tfidf,fp)
    fp.close()
    
    # read the file
    fp = gzip.open('primes.data','rb') #This assumes that tfidf.data is already packed with gzip
    tfidf = pickle.load(fp)
    fp.close()
    

    此方法可能无法保证将文件大小减小到<10兆字节。但肯定会减少pickle文件的大小