代码之家  ›  专栏  ›  技术社区  ›  user7337539

如何使用python、sklearn和未知X值预测多维时间序列

  •  1
  • user7337539  · 技术社区  · 8 年前

    通过预测未来比特币价格,我陷入了以下困境:

    我只能通过以下方式预测y标签(例如,开盘价): 所有X功能 我曾经训练过我的模特。

    以下是我的数据片段(6个要素列,1个标签):

                       Open    High     Low    HL-PCT  PCT-change  \
    

    2016-01-01 00:00:00 430.89 432.58 429.82 0.642129 -0.030161
    2016-01-01 01:00:00 431.51 432.01 429.08 0.682856 0.348829
    2016-01-01 02:00:00 430.00 431.69 430.00 0.393023 -0.132383
    2016-01-01 03:00:00 430.50 433.37 430.03 0.776690 -0.662252
    2016-01-01 04:00:00 433.34 435.72 432.55 0.732863 -0.406794
    2016-01-01 05:00:00 435.11 436.00 434.47 0.352153 -0.066605
    2016-01-01 06:00:00 435.44 435.44 430.08 1.246280 0.440569
    2016-01-01 07:00:00 434.71 436.00 433.50 0.576701 0.126681
    2016-01-01 08:00:00 433.82 434.19 431.00 0.740139 -0.059897
    2016-01-01 09:00:00 433.99 433.99 431.23 0.640030 0.460648

                     Volume (BTC)   Label  
    

    2016-01-01 00:00:00 41.32 434.87
    2016-01-01 01:00:00 31.21 434.44
    2016-01-01 02:00:00 12.25 433.47
    2016-01-01 03:00:00 74.98 431.80
    2016-01-01 04:00:00 870.80 433.28
    2016-01-01 05:00:00 78.53 433.31
    2016-01-01 06:00:00 177.11 433.39

    2016-01-01 08:00:00 210.59 432.80
    2016-01-01 09:00:00 129.68 432.17

    这是我的代码:

    #First get my own data
    symbols = ["bitstamp_hourly_2016"]
    timestamp = pd.date_range(start='2016-01-01 00:00', end='2016-12-23 09:00', 
                          freq='1h', periods=None)
    
    df_all = bf.get_data2(symbols, timestamp)    
    #Feature Slicing
    df = df_all[['Open', 'High', 'Low', 'Close', 'Volume (BTC)']]    
    
    df.loc[:,'HL-PCT'] = (df['High'] - df['Low'])/df['Low']*100.0
    df.loc[:,'PCT-change'] = (df['Open'] - df['Close'])/df['Close']*100.0
    
    #only relevant features
    df= df[['Open','High', 'Low', 'HL-PCT', 'PCT-change', 'Volume (BTC)']]
    
    df.fillna(-99999, inplace=True)
    
    #cut off the last 24 hours
    forecast_out = int(math.ceil(0.0027*len(df)))
    
    forecast_col = 'Open'
    df['Label'] = df[forecast_col].shift(-forecast_out)
    
    #X Features and y Label
    X = np.array(df.drop(['Label'],1))
    X = preprocessing.scale(X)
    
    #Last 24 hours
    X_lately = X[-forecast_out:]
    X = X[:-forecast_out]
    y = np.array(df['Label'])
    y = y[:-forecast_out]
    
    #Train and Test set
    test_size= int(math.ceil(0.3*len(df)))
    X_train, y_train = X[:-test_size], y[:-test_size]
    X_test, y_test= X[-test_size:], y[-test_size:]
    
    #use linear regression
    clf = LinearRegression(n_jobs=-1)
    clf.fit(X_train, y_train)
    
    #BIG QUESTION: WHAT TO INSERT HERE TO GET THE REAL FUTURE VALUES
    prediction = clf.predict(X_lately)
    
    # The coefficients
    print('Coefficients: \n', clf.coef_)
    # The mean squared error
    print("Mean squared error: %.4f"
          % np.mean((clf.predict(X_test) - y_test) ** 2))
    # Explained variance score: 1 is perfect prediction
    print('Variance score: %.4f' % clf.score(X_test, y_test))
    

    结果:

    How many Hours were predicted:  24
    Coefficients: [  5.30676009e+00   1.05641430e+02   1.44632212e+01       1.47255264e+00
    -1.52247332e+00  -6.26777634e-03]
    Mean squared error: 133.4017
    Variance score: 0.9717
    

    我想做的是: 只给出一个新的日期,使用经过训练的模型及其过去的知识,给我一个合理的结果 比如说接下来的24小时(实际的未来,我没有数据)。 到目前为止,我只能使用clf.predict()上的过去数据。

    这应该是可能的回归线,但如何?我也可以使用日期作为我的X数据帧,但这不会使我的模型无用吗?

    谢谢

    1 回复  |  直到 8 年前
        1
  •  0
  •   Community CDub    8 年前

    如果您希望坚持线性回归,而不仅仅使用日期,您可以尝试预测(使用您喜欢的任何模型)模型的回归因子,然后使用预测值执行线性回归。

    无论如何,您需要的建议类型似乎与编程无关,我认为您的问题更适合 https://stats.stackexchange.com/