代码之家 › 专栏 › 技术社区 › Ishigami

GridSearchCV,数据按时间索引

gridsearchcv python

1

Ishigami · 技术社区 · 7 月前

我正在尝试使用 GridSearchCV 从 sklearn.model_selection 我的数据是一组按时间索引的分类。因此,在进行交叉验证时,我希望训练集只包含测试集中数据之前的时间数据。

所以我的训练集 X_train, y_train 看起来像

Time        feature_1 feature_2 result
2020-01-30  3         6         1
2020-02-01  4         2         0
2021-03-02  7         1         0

以及测试集 X_test, y_test 看起来像

Time        feature_1 feature_2 result
2023-01-30  3         6         1
2023-02-01  4         2         0
2024-03-02  7         1         0

假设我使用的模型如下 xgboost ,然后为了优化超参数,我使用了 GridSearchCV 代码看起来像

param_grid = {
        'max_depth': [1,2,3,4,5],
        'min_child_weight': [0,1,2,3,4,5],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'colsample_bytree': [0.6, 0.8, 1.0],
}

clf = XGBClassifier(learning_rate=0.02, 
                    n_estimators=600,
                    objective='binary:logistic',
                    silent=True, 
                    nthread=1)

grid_search = GridSearchCV(
        estimator=clf,
        param_grid=param_grid,
        scoring='accuracy',
        n_jobs= -1)

grid_search.fit(X_train, y_train)

但是,我应该如何设置 cv 在里面 grid_search ?提前非常感谢。

编辑 :所以我试着设置 cv=0 因为我希望我的训练数据严格地“早于”测试数据,所以我得到了以下错误: InvalidParameterError: The 'cv' parameter of GridSearchCV must be an int in the range [2, inf), an object implementing 'split' and 'get_n_splits', an iterable or None. Got 0 instead.

1 回复 | 直到 7 月前

1

Ganesh Bajaj 7 月前

GridSearchCV中的默认交叉验证在拆分时不考虑时间依赖性。您可以使用TimeSeriesSplit代替模型选择中的默认CV。TimeSeriesSplit正是为您的这个用例而构建的。