以下是一个使用tidr列表列、rsmaple折叠和purlr图训练XGB模型的工作流程:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
train_cv <- vfold_cv(training_data, 5) %>%
# create training and validation sets within each fold
mutate(train = map(splits, ~training(.x)),
validate = map(splits, ~testing(.x)))
# xgb across each fold
mod.xgb <- train_cv %>%
# convert regression data to a dmatrix for xgb. Just simple price ~ carat for here and now
mutate(train_dmatrix = map(train, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price)),
validate_dmatrix = map(validate, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price))) %>%
mutate(regression = map(train_dmatrix, ~xgboost(.x, objective = "reg:squarederror", nrounds = 100))) %>% # fit the model
mutate(predictions =map2(.x = regression, .y = validate_dmatrix, ~predict(.x, .y))) %>% # predictions
mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse
我的实际脚本和数据使用
crossing()
以及具有自己的超参数的其他模型,以便选择最佳模型。因此,上述实际块允许我比较几个模型,因为它实际上包含几个模型。
我喜欢这个工作流程,因为使用dplyr动词和管道运算符,我可以在执行每个步骤时根据需要进行更改,然后使用map函数将它们应用于每个折叠。
现在我处于测试阶段并通过了交叉验证阶段,我想模拟这个“流”,但我没有折叠,所以不需要map_*函数。
但是,我仍然需要进行如上所述的转换,添加一个xgb。DMatrix,因为我正在使用xgboost。
示例,下面是我为测试我选择的xgb模型而创建的内容:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
# create xgb.DMatrix'
training_data_xgb_matrix <- xgb.DMatrix(training_data %>% select(-price) %>% as.matrix(), label = training_data$price)
test_data_xgb_matrix <- xgb.DMatrix(testing_data %>% select(-price) %>% as.matrix(), label = testing_data$price)
# create a regression
model_xgb <- xgboost(training_data_xgb_matrix, nrounds = 100, objective = "reg:squarederror")
# predict on test data
xgb_predictions <- predict(model_xgb, test_data_xgb_matrix)
# evaluate using rmse
test_rmse <- rmse(actual = testing_data$price, predicted = xgb_predictions)
test_rmse
# 1370.185
所以,这就是一步一步地做。我的问题是,我能否以类似于在交叉验证期间使用上述方法的方式做到这一点,特别是在现有的df/list列中添加一个新列?
在测试数据上评估模型的“整洁”方式是什么?是否可以从training_data开始,在新列中附加测试数据,并在调用mutate()后启动工作流,在自己的列中使用rmse达到相同的结果?
training_data %>%
(add test data in a new column) %>%
mutate(convert training data to a xgb.DMatrix) %>%
mutate(convert test data to a xgb.DMatrix) %>%
mutate(fit a regression model based on the training data xgb.DMatrix) %>%
mutate(predict with the regression model on test data xgb.DMatrix) %>%
mutate(calculate rmse)
这可能吗?