はじめに

tidymodelsパッケージの使い方をいくつかのノートに分けてまとめている。tidymodelsパッケージは、統計モデルや機械学習モデルを構築するために必要なパッケージをコレクションしているパッケージで、非常に色んなパッケージがある。ここでは、今回はtune/dialsというパッケージの使い方をまとめていく。モデルの数理的な側面や機械学習の用語などは、このノートでは扱わない。

下記の公式ドキュメントやtidymodelsパッケージに関する書籍を参考にしている。

`tune`と`dials`パッケージの目的

tuneパッケージは、tidymodelsパッケージを利用したモデリングにおいて、ハイパーパラメーターのチューニングを効率よく実行するためのパッケージで、dialsパッケージは、同じくチューニングパラメーターの値を作成および管理することを目的としている。tidymodelsパッケージでモデリングする際は、これらのパッケージを利用することで効率よくハイパーパラメーターのチューニングを行える。

公式ドキュメントは下記の通り。

tune
dials

パッケージやモデリングの作業と関連づけるために色々してたせいで、変数名がワケワカメなのはご愛嬌…。

`tune/dials`パッケージの実行例

tune/dialsパッケージの基本的な利用方法を確認する前に、必要なオブジェクトを定義しておく。関数の使い方をまとめることが目的なので、ここでは動けば良い程度にレシピを作成しておく。

library(tidymodels)
library(tidyverse)
df_past <- read_csv("https://raw.githubusercontent.com/SugiAki1989/statistical_note/main/note_TidyModels00/df_past.csv")

# rsample
set.seed(1989)
df_initial <- df_past %>% initial_split(prop = 0.8, strata = "Status")
df_train <- df_initial %>% training()
df_test <- df_initial %>% testing()

set.seed(1989)
df_train_stratified_splits <- 
  vfold_cv(df_train, v = 5, strata = "Status")

# recipes
recipe <- recipe(Status ~ ., data = df_train) %>% 
  step_impute_bag(Income, impute_with = imp_vars(Marital, Expenses)) %>% 
  step_impute_bag(Assets, impute_with = imp_vars(Marital, Expenses, Income)) %>% 
  step_impute_bag(Debt, impute_with = imp_vars(Marital, Expenses, Income, Assets)) %>% 
  step_impute_bag(Home, impute_with = imp_vars(Marital, Expenses, Income, Assets, Debt)) %>% 
  step_impute_bag(Marital, impute_with = imp_vars(Marital, Expenses, Income, Assets, Debt, Home)) %>% 
  step_impute_bag(Job, impute_with = imp_vars(Marital, Expenses, Income, Assets, Debt, Home, Marital))

ハイパーパラメータのチューニングを行う場合、モデルの設定段階で、パラメタを定数で設定するのではなく、tune関数を利用してモデル設定を行う必要がある。ランダムフォレストの各パラメタの簡単な説明は下記のとおり。

mtry: 各決定木が使用する特徴量の数を指定する。ランダムフォレストでは、各決定木を作成する際に、特徴量をランダムに選択して使用するため、mtryで指定された数の中から、ランダムに選択される特徴量の数を決定する。
min_n: 各決定木が分割する前に保持する最小サンプルサイズを指定する。min_nを小さくすることで、各決定木の複雑度を増やすことができる一方で、min_nを小さくすると、各決定木が少ないサンプルで訓練されるため、過学習の可能性が増える。
trees: ランダムフォレストに含まれる決定木の数を指定する。treesを増やすことで、ランダムフォレストの汎化性能を高めることができる一方で、treesを増やすと、訓練時間が長くなる。

# parsnip
model <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification")

workflow <- workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(model)

次は探索グリッドを作成する。ワークフローをモデル設定とレシピを利用して作成し、追加でパラメタ関連の設定を行なうが、モデルに対してデフォルトで探索範囲が決まっていることもある。extract_parameter_set_dials関数で確認できる。

workflow %>% 
  extract_parameter_set_dials()

## Collection of 3 parameters for tuning
## 
##  identifier  type    object
##        mtry  mtry nparam[?]
##       trees trees nparam[+]
##       min_n min_n nparam[+]
## 
## Model parameters needing finalization:
##    # Randomly Selected Predictors ('mtry')
## 
## See `?dials::finalize` or `?dials::update.parameters` for more information.

3つのパラメタがチューニング対象になっており、object列を見ると、?や+が表示されている。各詳細を確認するためには、extract_parameter_dials関数でパラメタ名を指定する。

mtryは[1,?]となっており、各決定木が使用する特徴量の数の上限が決まっていないことを表している。他のパラメタは下限と上限がデフォルトで定まっている。つまり、mtryの上限はデータセットの内容に応じて変化するので、デフォルトでは設定されていない。

map(
  .x = c("mtry", "trees", "min_n"),
  .f = function(x){
    workflow %>% 
      extract_parameter_set_dials() %>% 
      extract_parameter_dials(x)
  }
)

## [[1]]
## # Randomly Selected Predictors (quantitative)
## Range: [1, ?]
## 
## [[2]]
## # Trees (quantitative)
## Range: [1, 2000]
## 
## [[3]]
## Minimal Node Size (quantitative)
## Range: [2, 40]

デフォルトで設定されていないパラメタの範囲を決める必要がある。パラメタの範囲を更新する際はupdate関数を利用すればよい。

map(
  .x = c("mtry", "trees", "min_n"),
  .f = function(x){
  workflow %>% 
    extract_parameter_set_dials() %>% 
    update(
      mtry = mtry(range = c(5, 10)),
      trees = trees(range = c(500, 1000)),
      min_n = min_n(range = c(50, 100)),
      ) %>%
    extract_parameter_dials(x)}
)

## [[1]]
## # Randomly Selected Predictors (quantitative)
## Range: [5, 10]
## 
## [[2]]
## # Trees (quantitative)
## Range: [500, 1000]
## 
## [[3]]
## Minimal Node Size (quantitative)
## Range: [50, 100]

今回のmtryのようなデータに合わせてパラメタを設定したいときは、finalize関数と訓練データを渡すことで、よしなに設定してくれる。

finalize関数の中の部分では、ターゲットの目的変数を除外して、レシピで設定している前処理を実行して、カラム数を確定させている。ここではカラム数だけわかればよいので、slice関数でレコードを減らしている。前処理が重たいと、bake関数の処理に時間がかかることを避けるために行った(1行とかにすると、前処理で使用している関数によっては警告やエラーが出るかも)。

map(
  .x = c("mtry", "trees", "min_n"),
  .f = function(x){
  workflow %>% 
    extract_parameter_set_dials() %>% 
    update(
      trees = trees(range = c(500, 1000)),
      min_n = min_n(range = c(50, 100)),
      ) %>%
    finalize(
      recipe %>% 
        prep() %>% 
        bake(df_train %>% select(-Status) %>% dplyr::slice(1:3))
    ) %>% 
    extract_parameter_dials(x)}
)

## [[1]]
## # Randomly Selected Predictors (quantitative)
## Range: [1, 13]
## 
## [[2]]
## # Trees (quantitative)
## Range: [500, 1000]
## 
## [[3]]
## Minimal Node Size (quantitative)
## Range: [50, 100]

パラメタグリッドを作成するときは、grid_regular関数を利用する。levels引数は各パラメタが取れる値の数を決めるもので、今回の場合であれば、パラメタが3つで、取れるレベルが2つなので、2^3=8のグリッド作成される。

workflow %>% 
  extract_parameter_set_dials() %>% 
  update(
    mtry = mtry(range = c(5, 10)),
    trees = trees(range = c(500, 1000)),
    min_n = min_n(range = c(50, 100)),
    ) %>% 
  grid_regular(levels = 2)

## # A tibble: 8 × 3
##    mtry trees min_n
##   <int> <int> <int>
## 1     5   500    50
## 2    10   500    50
## 3     5  1000    50
## 4    10  1000    50
## 5     5   500   100
## 6    10   500   100
## 7     5  1000   100
## 8    10  1000   100

グリッドの作成方法を変える場合は下記の関数を利用できる。

ランダムグリッド: grid_random(size = n)関数
ラテンハイパーキューブ: grid_latin_hypercube(size = n)関数
最大エントロピー: grid_max_entropy(size = n)関数

ではグリッドを作成してハイパーパラメータのチューニングを行う。モデルとレシピがまとまっているワークフローを利用して、モデル情報を使ってグリッドを作成する。

set.seed(1989)
hyper_parameter_grid <- workflow %>% 
  extract_parameter_set_dials() %>% 
  update(
    mtry = mtry(range = c(5, 10)),
    trees = trees(range = c(500, 1000)),
    min_n = min_n(range = c(50, 100)),
    ) %>% 
  grid_regular(levels = 2)
hyper_parameter_grid

## # A tibble: 8 × 3
##    mtry trees min_n
##   <int> <int> <int>
## 1     5   500    50
## 2    10   500    50
## 3     5  1000    50
## 4    10  1000    50
## 5     5   500   100
## 6    10   500   100
## 7     5  1000   100
## 8    10  1000   100

前回までのようにパラメタが固定されていれば、fit_resamples関数でクロスバリデーションを行うことができたが、今回はモデル設定でパラメタにはtune関数を指定しているため、fit_resamples関数を使うことはできない。

workflow %>% 
  fit_resamples(
    resamples = df_train_stratified_splits,
    metrics = metric_set(accuracy),
    control = control_resamples(save_pred = TRUE)
  )

Error:
! 3 arguments have been tagged for tuning in these components: model_spec. 
Please use one of the tuning functions (e.g. `tune_grid()`) to optimize them.
Run `rlang::last_error()` to see where the error occurred.

そのため、エラー文にもあるように、tune_grid関数を利用する。

workflow_tuned <- 
  workflow %>% 
    tune_grid(
      resamples = df_train_stratified_splits,
      grid = hyper_parameter_grid,
      metrics = metric_set(accuracy),
      control = control_resamples(
        extract = extract_model, # 出力の.extractsに対応
        save_pred = TRUE         # 出力の.predictionsに対応
        )
    )

中身を確認すると、いくつかのカラムができており、内容は下記の通り。

splits: クロスバリデーションのために分割されたデータ
id: 分割されたデータのフォールド番号
.metrics: 評価指標の値
.notes: エラー、ワーニングなどの情報
.extracts: 各パラメタのグリッドに応じたモデルの情報
.predictions: 評価指標を計算するために利用したデータの観測値とモデルの予測値

ハイパーパラメータチューニング(grid = 8)のグリッド数と5フォールドクロスバリデーション(cv = 5)を利用しているので少し出力が複雑ではあるが、合計40行が出力されることになる。

つまり、1つのパラメタの組み合わせを5つのフォールドに対して行なうため、評価指標は5レコード分作成される。これをパラメタの組み合わせ分として8回繰り返す。pluck(".metrics", 1)で表示されるのは8つのパラメタ組み合わせの1つ目のフォールの評価結果。

workflow_tuned %>% 
  pluck(".metrics", 1)

## # A tibble: 8 × 7
##    mtry trees min_n .metric  .estimator .estimate .config             
##   <int> <int> <int> <chr>    <chr>          <dbl> <chr>               
## 1     5   500    50 accuracy binary         0.796 Preprocessor1_Model1
## 2    10   500    50 accuracy binary         0.785 Preprocessor1_Model2
## 3     5  1000    50 accuracy binary         0.793 Preprocessor1_Model3
## 4    10  1000    50 accuracy binary         0.787 Preprocessor1_Model4
## 5     5   500   100 accuracy binary         0.794 Preprocessor1_Model5
## 6    10   500   100 accuracy binary         0.794 Preprocessor1_Model6
## 7     5  1000   100 accuracy binary         0.799 Preprocessor1_Model7
## 8    10  1000   100 accuracy binary         0.801 Preprocessor1_Model8

.extractsにはモデルの情報が格納されている。

workflow_tuned %>% 
  pluck(".extracts", 1)

## # A tibble: 8 × 5
##    mtry trees min_n .extracts .config             
##   <int> <int> <int> <list>    <chr>               
## 1     5   500    50 <ranger>  Preprocessor1_Model1
## 2    10   500    50 <ranger>  Preprocessor1_Model2
## 3     5  1000    50 <ranger>  Preprocessor1_Model3
## 4    10  1000    50 <ranger>  Preprocessor1_Model4
## 5     5   500   100 <ranger>  Preprocessor1_Model5
## 6    10   500   100 <ranger>  Preprocessor1_Model6
## 7     5  1000   100 <ranger>  Preprocessor1_Model7
## 8    10  1000   100 <ranger>  Preprocessor1_Model8

.predictionsの中身は、予測結果が記録されている。.predictionsの中身は、予測値.pred_class、行番号.row、観測値Status、何番目のフォールドのモデルなのか.configが記録されている。

5136行あるのは、1つのフォールドの評価データは642行だが、8つのパラメタの組み合わせ分が計算されているので、642 × 8 = 5136レコードとなる。

workflow_tuned %>% 
  pluck(".predictions", 1)

## # A tibble: 5,136 × 7
##    .pred_class  .row  mtry trees min_n Status .config             
##    <fct>       <int> <int> <int> <int> <fct>  <chr>               
##  1 bad             2     5   500    50 bad    Preprocessor1_Model1
##  2 bad            11     5   500    50 bad    Preprocessor1_Model1
##  3 bad            14     5   500    50 bad    Preprocessor1_Model1
##  4 bad            18     5   500    50 bad    Preprocessor1_Model1
##  5 good           19     5   500    50 bad    Preprocessor1_Model1
##  6 good           24     5   500    50 bad    Preprocessor1_Model1
##  7 good           26     5   500    50 bad    Preprocessor1_Model1
##  8 good           30     5   500    50 bad    Preprocessor1_Model1
##  9 good           35     5   500    50 bad    Preprocessor1_Model1
## 10 bad            43     5   500    50 bad    Preprocessor1_Model1
## # … with 5,126 more rows

評価結果をまとめると、さきほほどの話はわかりよい。

map_dfr(
  .x = 1:nrow(df_train_stratified_splits),
  .f = function(x){workflow_tuned %>% pluck(".metrics", x)}
  ) %>% 
  arrange(.config) %>% 
  print(n = 50)

## # A tibble: 40 × 7
##     mtry trees min_n .metric  .estimator .estimate .config             
##    <int> <int> <int> <chr>    <chr>          <dbl> <chr>               
##  1     5   500    50 accuracy binary         0.796 Preprocessor1_Model1
##  2     5   500    50 accuracy binary         0.801 Preprocessor1_Model1
##  3     5   500    50 accuracy binary         0.777 Preprocessor1_Model1
##  4     5   500    50 accuracy binary         0.774 Preprocessor1_Model1
##  5     5   500    50 accuracy binary         0.766 Preprocessor1_Model1
##  6    10   500    50 accuracy binary         0.785 Preprocessor1_Model2
##  7    10   500    50 accuracy binary         0.801 Preprocessor1_Model2
##  8    10   500    50 accuracy binary         0.769 Preprocessor1_Model2
##  9    10   500    50 accuracy binary         0.780 Preprocessor1_Model2
## 10    10   500    50 accuracy binary         0.764 Preprocessor1_Model2
## 11     5  1000    50 accuracy binary         0.793 Preprocessor1_Model3
## 12     5  1000    50 accuracy binary         0.802 Preprocessor1_Model3
## 13     5  1000    50 accuracy binary         0.775 Preprocessor1_Model3
## 14     5  1000    50 accuracy binary         0.775 Preprocessor1_Model3
## 15     5  1000    50 accuracy binary         0.767 Preprocessor1_Model3
## 16    10  1000    50 accuracy binary         0.787 Preprocessor1_Model4
## 17    10  1000    50 accuracy binary         0.804 Preprocessor1_Model4
## 18    10  1000    50 accuracy binary         0.774 Preprocessor1_Model4
## 19    10  1000    50 accuracy binary         0.783 Preprocessor1_Model4
## 20    10  1000    50 accuracy binary         0.767 Preprocessor1_Model4
## 21     5   500   100 accuracy binary         0.794 Preprocessor1_Model5
## 22     5   500   100 accuracy binary         0.790 Preprocessor1_Model5
## 23     5   500   100 accuracy binary         0.764 Preprocessor1_Model5
## 24     5   500   100 accuracy binary         0.774 Preprocessor1_Model5
## 25     5   500   100 accuracy binary         0.764 Preprocessor1_Model5
## 26    10   500   100 accuracy binary         0.794 Preprocessor1_Model6
## 27    10   500   100 accuracy binary         0.798 Preprocessor1_Model6
## 28    10   500   100 accuracy binary         0.766 Preprocessor1_Model6
## 29    10   500   100 accuracy binary         0.780 Preprocessor1_Model6
## 30    10   500   100 accuracy binary         0.769 Preprocessor1_Model6
## 31     5  1000   100 accuracy binary         0.799 Preprocessor1_Model7
## 32     5  1000   100 accuracy binary         0.790 Preprocessor1_Model7
## 33     5  1000   100 accuracy binary         0.766 Preprocessor1_Model7
## 34     5  1000   100 accuracy binary         0.772 Preprocessor1_Model7
## 35     5  1000   100 accuracy binary         0.766 Preprocessor1_Model7
## 36    10  1000   100 accuracy binary         0.801 Preprocessor1_Model8
## 37    10  1000   100 accuracy binary         0.799 Preprocessor1_Model8
## 38    10  1000   100 accuracy binary         0.764 Preprocessor1_Model8
## 39    10  1000   100 accuracy binary         0.780 Preprocessor1_Model8
## 40    10  1000   100 accuracy binary         0.767 Preprocessor1_Model8

この評価データを平均すれば、各パラメタで学習したモデルの評価が得られる。

map_dfr(
  .x = 1:nrow(df_train_stratified_splits),
  .f = function(x){workflow_tuned %>% pluck(".metrics", x)}
  ) %>% 
  # .configだけでもよいが、パラメタの数字を明示するために下記を利用
  group_by(.config, mtry, trees, min_n) %>% 
  summarise(mean_accuracy = mean(.estimate), n = n())

## # A tibble: 8 × 6
## # Groups:   .config, mtry, trees [8]
##   .config               mtry trees min_n mean_accuracy     n
##   <chr>                <int> <int> <int>         <dbl> <int>
## 1 Preprocessor1_Model1     5   500    50         0.783     5
## 2 Preprocessor1_Model2    10   500    50         0.780     5
## 3 Preprocessor1_Model3     5  1000    50         0.783     5
## 4 Preprocessor1_Model4    10  1000    50         0.783     5
## 5 Preprocessor1_Model5     5   500   100         0.777     5
## 6 Preprocessor1_Model6    10   500   100         0.781     5
## 7 Preprocessor1_Model7     5  1000   100         0.779     5
## 8 Preprocessor1_Model8    10  1000   100         0.782     5

手計算しなくてもcollect_metrics関数を使うことで、各パラメタのクロスバリデーションの結果を集計して表示してくれる。今回であれば、データを5つに分割しており、その各フォールドの結果がまとめられている。結果を見るとPreprocessor1_Model4のaccuracyが高いので、ベターなパラメタの組み合わせはmtry = 10 & trees = 1000 & min_n = 50である。

workflow_tuned %>%
  collect_metrics() %>% 
  arrange(desc(mean))

## # A tibble: 8 × 9
##    mtry trees min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1    10  1000    50 accuracy binary     0.783     5 0.00624 Preprocessor1_Model4
## 2     5  1000    50 accuracy binary     0.783     5 0.00645 Preprocessor1_Model3
## 3     5   500    50 accuracy binary     0.783     5 0.00671 Preprocessor1_Model1
## 4    10  1000   100 accuracy binary     0.782     5 0.00765 Preprocessor1_Model8
## 5    10   500   100 accuracy binary     0.781     5 0.00643 Preprocessor1_Model6
## 6    10   500    50 accuracy binary     0.780     5 0.00642 Preprocessor1_Model2
## 7     5  1000   100 accuracy binary     0.779     5 0.00675 Preprocessor1_Model7
## 8     5   500   100 accuracy binary     0.777     5 0.00632 Preprocessor1_Model5

自分で並び替えずとも、show_best関数を利用すると集計後の結果を上位順で表示してくれる。

workflow_tuned %>% 
  show_best(n = 5, metric = "accuracy")

## # A tibble: 5 × 9
##    mtry trees min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1    10  1000    50 accuracy binary     0.783     5 0.00624 Preprocessor1_Model4
## 2     5  1000    50 accuracy binary     0.783     5 0.00645 Preprocessor1_Model3
## 3     5   500    50 accuracy binary     0.783     5 0.00671 Preprocessor1_Model1
## 4    10  1000   100 accuracy binary     0.782     5 0.00765 Preprocessor1_Model8
## 5    10   500   100 accuracy binary     0.781     5 0.00643 Preprocessor1_Model6

パラメタチューニングの結果を可視化したいときは、autoplot関数を利用する。自分で作成せず、autoplot関数で作成するので、ぱっと見てもよくわからないが、縦軸はaccuracy、横軸はmtryである。まずは色分けに着目する。緑がtrees = 1000の場合で、赤がtrees = 500の場合なので、これは緑のほうが良い。左の赤線をのぞいて、mtryは高いほうが良い。一方で、min_nが小さく、treesも少ないが、mtryを大きくすると、性能が悪くなることがわかる。

autoplot(workflow_tuned) + theme_bw()

良き評価指標のモデルを取り出すにはselect_best関数を利用する。ベターなパラメタの組み合わせであるmtry = 10 & trees = 1000 & min_n = 50が取り出されている。

better_paramters <- workflow_tuned %>% 
  select_best(metric = "accuracy")
better_paramters

## # A tibble: 1 × 4
##    mtry trees min_n .config             
##   <int> <int> <int> <chr>               
## 1    10  1000    50 Preprocessor1_Model4

パラメタチューニングまで終わったので、ここではベターなモデルで訓練データ全体を使って学習し直す方法でモデルを再学習する。先程のパラメタを使ってモデル設計をし直す必要はなく、finalize_workflow関数をすれば、ベターモデルでワークフローを更新できる。

better_workflow <- workflow %>% 
  finalize_workflow(parameters = better_paramters)

better_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 10
##   trees = 1000
##   min_n = 50
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger

ワークフローでは、1つのパラメタに固定値を2つ以上設定することはできないので、複数のモデルからアンサンブルする場合などはワークフローに別々のパラメタを設定する必要がある。

workflow %>% 
  finalize_workflow(
    parameters = workflow_tuned %>% show_best(n = 2, metric = "accuracy") %>% select(mtry, trees, min_n)
    )

Error in `check_final_param()`:
! The parameter tibble should have a single row.
Backtrace:
 1. workflow %>% ...
 2. tune::finalize_workflow(...)
 3. tune:::check_final_param(parameters)

ベターなパラメタを用いてワークフローを更新したので、そのワークフローで訓練データを使って再学習を行なう。

set.seed(1989)
model_trained_better_workflow <- 
  better_workflow %>% 
  fit(df_train)

model_trained_better_workflow

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## • step_impute_bag()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~10L,      x), num.trees = ~1000L, min.node.size = min_rows(~50L, x),      importance = ~"permutation", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      3206 
## Number of independent variables:  13 
## Mtry:                             10 
## Target node size:                 50 
## Variable importance mode:         permutation 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1458116

再学習されたモデルでテストデータの値を予測する。ここでは予測クラスと予測確率を計算する。

model_predicted_better_workflow <- 
  tibble(
    model_trained_better_workflow %>% predict(df_test, type = "class"),
    model_trained_better_workflow %>% predict(df_test, type = "prob")
    ) %>% 
  bind_cols(obs = factor(df_test$Status, c(levels(.$.pred_class))))

model_predicted_better_workflow

## # A tibble: 802 × 4
##    .pred_class .pred_bad .pred_good obs  
##    <fct>           <dbl>      <dbl> <fct>
##  1 bad             0.757     0.243  bad  
##  2 bad             0.631     0.369  bad  
##  3 bad             0.709     0.291  bad  
##  4 good            0.488     0.512  bad  
##  5 good            0.477     0.523  bad  
##  6 bad             0.659     0.341  bad  
##  7 bad             0.932     0.0681 bad  
##  8 good            0.393     0.607  bad  
##  9 bad             0.758     0.242  bad  
## 10 good            0.257     0.743  bad  
## # … with 792 more rows

最終的なテストデータに対する、accuracyの数値は下記の通り。

model_predicted_better_workflow %>% 
  yardstick::accuracy(truth = obs, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.802

再学習したモデルの変数重要度はこちら。

library(vip)
model_trained_better_workflow %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(alpha = 0.8, fill = "#006E4F")) + theme_bw()

グリッドがない場合の挙動

ここまでの例では、下記の通りパラメタグリッドを用意してからチューニングを行っていたが、グリッドを用意しない場合、どのようになるのか。

hyper_parameter_grid

## # A tibble: 8 × 3
##    mtry trees min_n
##   <int> <int> <int>
## 1     5   500    50
## 2    10   500    50
## 3     5  1000    50
## 4    10  1000    50
## 5     5   500   100
## 6    10   500   100
## 7     5  1000   100
## 8    10  1000   100

現状のワークフローは、mtryは上限がきまってなく、treesとmin_nは下限、上限が決まっている状態。

map(
  .x = c("mtry", "trees", "min_n"),
  .f = function(x){
    workflow %>% 
      extract_parameter_set_dials() %>% 
      extract_parameter_dials(x)
  }
)

## [[1]]
## # Randomly Selected Predictors (quantitative)
## Range: [1, ?]
## 
## [[2]]
## # Trees (quantitative)
## Range: [1, 2000]
## 
## [[3]]
## Minimal Node Size (quantitative)
## Range: [2, 40]

ドキュメントによると、どうやらグリッドが提供されない場合、ラテンハイパーキューブによってグリッドが作成されるとのこと。

A data frame of tuning combinations or a positive integer. The data frame should have columns for each parameter being tuned and rows for tuning parameter candidates. An integer denotes the number of candidate parameter sets to be created automatically. チューニングの組み合わせのデータフレーム、または正の整数。データフレームは，調整中の各パラメータの列と，調整パラメータ候補の行を持つ必要がある。整数は，自動的に作成されるパラメータセット候補の数を表す．

Parameter Grids If no tuning grid is provided, a semi-random grid (via dials::grid_latin_hypercube()) is created with 10 candidate parameter combinations. 調整用グリッドが提供されない場合、半ランダムグリッド (dials::grid_latin_hypercube() による) が、10個のパラメータ候補の組み合わせで作成されます。

set.seed(1989)
workflow_tuned_no_grid <- 
  workflow %>% 
    tune_grid(
      resamples = df_train_stratified_splits,
      metrics = metric_set(accuracy),
      grid = 10,
      control = control_resamples(
        extract = extract_model, # 出力の.extractsに対応
        save_pred = TRUE         # 出力の.predictionsに対応
        )
    )

workflow_tuned_no_grid %>%
  pluck(".extracts", 1) %>% 
  arrange(mtry)

## # A tibble: 10 × 5
##     mtry trees min_n .extracts .config              
##    <int> <int> <int> <list>    <chr>                
##  1     2  1238    25 <ranger>  Preprocessor1_Model10
##  2     3   715    37 <ranger>  Preprocessor1_Model02
##  3     4   553    34 <ranger>  Preprocessor1_Model06
##  4     5   216     3 <ranger>  Preprocessor1_Model08
##  5     6  1583    11 <ranger>  Preprocessor1_Model05
##  6     8  1170    30 <ranger>  Preprocessor1_Model03
##  7     9    81    15 <ranger>  Preprocessor1_Model09
##  8    10   934     9 <ranger>  Preprocessor1_Model01
##  9    12  1742    21 <ranger>  Preprocessor1_Model04
## 10    13  1901    19 <ranger>  Preprocessor1_Model07

TidyModels: tune/dialsパッケージ

はじめに

`tune`と`dials`パッケージの目的

`tune/dials`パッケージの実行例

グリッドがない場合の挙動

参考文献

TidyModels: tune/dialsパッケージ

はじめに

tuneとdialsパッケージの目的

tune/dialsパッケージの実行例

グリッドがない場合の挙動

参考文献

`tune`と`dials`パッケージの目的

`tune/dials`パッケージの実行例