代码之家  ›  专栏  ›  技术社区  ›  ItFreak

数据帧删除无用列

  •  1
  • ItFreak  · 技术社区  · 7 年前

    data = pd.read_csv('statistic.csv', 
    parse_dates=True, index_col=['DATE'], low_memory=False)
    data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
    data_extracted = data.groupby(['DATE','ARTICLENO']) 
    ['QUANTITY'].sum().unstack()
    #replace string nan with numpy data type
    data_extracted = data_extracted.fillna(value=np.nan)
    #remove footer of csv file
    data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2], 
    errors="coerce")
    #resample to one week rythm
    data_resampled = data_extracted.resample('W-MON', label='left', 
    loffset=pd.DateOffset(days=1)).sum()
    # reduce to one year
    data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
    #fill possible NaNs with 1 (not 0, because of division by zero when doing 
    pct_change
    data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
    data_pct_change = 
    data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf], 
    np.nan).fillna(0)
    # actual dropping logic if column has no values at all
    data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems() 
    if val == 0 ], axis=1, inplace=True)
    normalized_modeling_data = preprocessing.normalize(data_pct_change, 
    norm='l2', axis=0)
    normalized_data_headers = pd.DataFrame(normalized_modeling_data, 
    columns=data_pct_change.columns)
    normalized_modeling_data = normalized_modeling_data.transpose()
    kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
    print(kmeans.labels_)
    np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
    for i, cluster_center in enumerate(kmeans.cluster_centers_):
            plp.plot(cluster_center, label='Center {0}'.format(i))
    plp.legend(loc='best')
    plp.show()
    

    不要紧,我的数据框里有很多0(文章不是从同一天开始的,所以如果a从2015年开始,B从2016年开始,B在2015年全年都是0) 以下是分组数据帧:

    ARTICLENO     205123430604  205321436644  405659844106  305336746308  
    DATE                                                                     
    2015-01-05            9.0            6.0          560.0         2736.0   
    2015-01-19            2.0            1.0          560.0         3312.0   
    2015-01-26            NaN            5.0          600.0         2196.0   
    2015-02-02            NaN            NaN           40.0         3312.0   
    2015-02-16            7.0            6.0          520.0         5004.0   
    2015-02-23           12.0            4.0          480.0         4212.0   
    2015-04-13           11.0            6.0          920.0         4230.0 
    

    ARTICLENO     205123430604   205321436644  405659844106  305336746308  
    DATE                                                                     
    2015-01-05       0.000000       0.000000       0.000000       0.000000   
    2015-01-19      -0.777778      -0.833333       0.000000       0.210526   
    2015-01-26      -0.500000       4.000000       0.071429      -0.336957   
    2015-02-02       0.000000      -0.800000      -0.933333       0.508197   
    2015-02-16       6.000000       5.000000      12.000000       0.510870   
    2015-02-23       0.714286      -0.333333      -0.076923      -0.158273 
    

    405659844106处的系数12“正确”

    ARTICLENO     305123446353  205423146377  305669846421  905135949255  
    DATE                                                                     
    2015-01-05         2175.0          200.0            NaN            NaN   
    2015-01-19         2550.0            NaN            NaN            NaN   
    2015-01-26          925.0            NaN            NaN            NaN   
    2015-02-02          675.0            NaN            NaN            NaN   
    2015-02-16         1400.0          200.0          120.0            NaN   
    2015-02-23         6125.0          320.0            NaN            NaN   
    

    相应的百分比变化:

    ARTICLENO      305123446353  205423146377  305669846421    905135949255  
    DATE                                                                  
    2015-01-05       0.000000       0.000000       0.000000    0.000000   
    2015-01-19       0.172414      -0.995000       0.000000   -0.058824   
    2015-01-26      -0.637255       0.000000       0.000000    0.047794   
    2015-02-02      -0.270270       0.000000       0.000000   -0.996491   
    2015-02-16       1.074074     199.000000     119.000000  279.000000   
    2015-02-23       3.375000       0.600000      -0.991667    0.310714   
    

    如您所见,因子200-300的变化是由被替换的NaN的变化引起的。

    这些数据被用来进行kmeans集群,这样的“胡说八道”数据会破坏我的kmeans中心。

    有人知道如何删除这些列吗?

    1 回复  |  直到 7 年前
        1
  •  0
  •   ItFreak    7 年前

    我用下面的语句删除了无意义的列:

    max_nan_value_count = 5
    data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda 
    col: col.isnull().sum() > max_nan_value_count)], axis=1)