代码之家 › 专栏 › 技术社区 › Doug Fir

创建一个仅包含一个项目的列表列(无分组依据)

tidyr r

Doug Fir · 技术社区 · 5 年前

以下是一个使用tidr列表列、rsmaple折叠和purlr图训练XGB模型的工作流程:

library(rsample)
library(xgboost)
library(Metrics)

# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)

train_cv <- vfold_cv(training_data, 5) %>% 

  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x)))

# xgb across each fold
mod.xgb <- train_cv %>%

  # convert regression data to a dmatrix for xgb. Just simple price ~ carat for here and now
  mutate(train_dmatrix = map(train, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price)),
         validate_dmatrix = map(validate, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price))) %>% 

  mutate(regression = map(train_dmatrix, ~xgboost(.x, objective = "reg:squarederror", nrounds = 100))) %>% # fit the model
  mutate(predictions =map2(.x = regression, .y = validate_dmatrix, ~predict(.x, .y))) %>% # predictions
  mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
  mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
  mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse

我的实际脚本和数据使用 crossing() 以及具有自己的超参数的其他模型,以便选择最佳模型。因此,上述实际块允许我比较几个模型,因为它实际上包含几个模型。

我喜欢这个工作流程,因为使用dplyr动词和管道运算符,我可以在执行每个步骤时根据需要进行更改,然后使用map函数将它们应用于每个折叠。

现在我处于测试阶段并通过了交叉验证阶段,我想模拟这个“流”,但我没有折叠,所以不需要map_*函数。

但是,我仍然需要进行如上所述的转换,添加一个xgb。DMatrix,因为我正在使用xgboost。

示例,下面是我为测试我选择的xgb模型而创建的内容:

library(rsample)
library(xgboost)
library(Metrics)

# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)

# create xgb.DMatrix'
training_data_xgb_matrix <- xgb.DMatrix(training_data %>% select(-price) %>% as.matrix(), label = training_data$price)
test_data_xgb_matrix <- xgb.DMatrix(testing_data %>% select(-price) %>% as.matrix(), label = testing_data$price)

# create a regression
model_xgb <- xgboost(training_data_xgb_matrix, nrounds = 100, objective = "reg:squarederror")

# predict on test data
xgb_predictions <- predict(model_xgb, test_data_xgb_matrix)

# evaluate using rmse
test_rmse <- rmse(actual = testing_data$price, predicted = xgb_predictions)
test_rmse
# 1370.185

所以,这就是一步一步地做。我的问题是,我能否以类似于在交叉验证期间使用上述方法的方式做到这一点,特别是在现有的df/list列中添加一个新列?

在测试数据上评估模型的“整洁”方式是什么?是否可以从training_data开始,在新列中附加测试数据,并在调用mutate()后启动工作流,在自己的列中使用rmse达到相同的结果?

training_data %>% 
  (add test data in a new column) %>%
  mutate(convert training data to a xgb.DMatrix) %>%
  mutate(convert test data to a xgb.DMatrix) %>%
  mutate(fit a regression model based on the training data xgb.DMatrix) %>%
  mutate(predict with the regression model on test data xgb.DMatrix) %>%
  mutate(calculate rmse)

这可能吗?

0 回复 | 直到 5 年前