UPDATE: 2022-12-28 10:08:47
modeltime
パッケージを使いながら時系列データの予測を行なうノートをいくつかに分けてまとめている。modeltime
パッケージの目的は、時系列モデルの構築を効率よく行うためのパッケージで、tidymodels
パッケージと組み合わせて時系列モデルを作成する際に使用される。時系列予測のための体系的なワークフローを提供し、時系列ブーストモデルなど新しいアルゴリズムもカバーしている点が特徴的。モデルの数理的な側面や機械学習の用語などは、このノートでは扱わない。
このノートでは、時系列データの特徴量エンジニアリングについてまとめる。ここではtimetk
パッケージを使って特徴量エンジニアリングを行なう。必要なパッケージを読み込んでおく。
library(tidyverse)
library(timetk)
library(tsibble)
library(tsibbledata)
library(fastDummies)
オーストラリアの小売の売上データセット(tsibbledata::aus_retail
)を利用する。データの各時系列は下記のキーで一意に識別される。
State
: オーストラリアの州Industry
: 業種ここではAustralian Capital Territory
の州と5つの業種の値のみに注目する。
::aus_retail %>%
tsibbledatafilter(State == "Australian Capital Territory") %>%
distinct(State, Industry)
## # A tibble: 20 × 2
## State Industry
## <chr> <chr>
## 1 Australian Capital Territory Cafes, restaurants and catering services
## 2 Australian Capital Territory Cafes, restaurants and takeaway food services
## 3 Australian Capital Territory Clothing retailing
## 4 Australian Capital Territory Clothing, footwear and personal accessory retai…
## 5 Australian Capital Territory Department stores
## 6 Australian Capital Territory Electrical and electronic goods retailing
## 7 Australian Capital Territory Food retailing
## 8 Australian Capital Territory Footwear and other personal accessory retailing
## 9 Australian Capital Territory Furniture, floor coverings, houseware and texti…
## 10 Australian Capital Territory Hardware, building and garden supplies retailing
## 11 Australian Capital Territory Household goods retailing
## 12 Australian Capital Territory Liquor retailing
## 13 Australian Capital Territory Newspaper and book retailing
## 14 Australian Capital Territory Other recreational goods retailing
## 15 Australian Capital Territory Other retailing
## 16 Australian Capital Territory Other retailing n.e.c.
## 17 Australian Capital Territory Other specialised food retailing
## 18 Australian Capital Territory Pharmaceutical, cosmetic and toiletry goods ret…
## 19 Australian Capital Territory Supermarket and grocery stores
## 20 Australian Capital Territory Takeaway food services
tsibble
クラスなので、通常のデータフレームに変換しておく。
<- c(
Industries "Cafes, restaurants and catering services",
"Clothing retailing",
"Clothing, footwear and personal accessory retailing",
"Department stores",
"Electrical and electronic goods retailing"
)
<- tsibbledata::aus_retail %>%
monthly_retail_tbl tk_tbl() %>%
filter(State == "Australian Capital Territory") %>%
filter(Industry %in% Industries) %>%
mutate(Month = as.Date(Month)) %>%
mutate(Industry = as_factor(Industry)) %>%
select(Month, Industry, Turnover)
%>%
monthly_retail_tbl distinct(Industry)
## # A tibble: 5 × 1
## Industry
## <fct>
## 1 Cafes, restaurants and catering services
## 2 Clothing retailing
## 3 Clothing, footwear and personal accessory retailing
## 4 Department stores
## 5 Electrical and electronic goods retailing
使用するデータは最終的に下記の通り。
%>%
monthly_retail_tbl group_by(Industry) %>%
summarise(
min_date = min(Month),
max_date = max(Month),
cnt = n()
)
## # A tibble: 5 × 4
## Industry min_date max_date cnt
## <fct> <date> <date> <int>
## 1 Cafes, restaurants and catering services 1982-04-01 2018-12-01 441
## 2 Clothing retailing 1982-04-01 2018-12-01 441
## 3 Clothing, footwear and personal accessory retaili… 1982-04-01 2018-12-01 441
## 4 Department stores 1982-04-01 2018-12-01 441
## 5 Electrical and electronic goods retailing 1982-04-01 2018-12-01 441
5つの業種の時系列を可視化しておく。基本的には上昇トレンドではあるが、各系列ごとに異なる特徴があ見られる。
%>%
monthly_retail_tbl group_by(Industry) %>%
plot_time_series(
Month,
Turnover,.facet_ncol = 3,
.smooth = FALSE,
.interactive = TRUE
)
各系列の月、四半期、年の特徴を可視化する。Department stores
に焦点を当てる。tk_seasonal_diagnostics
関数で、系列の月、四半期、年ごとの特徴を計算できる。
%>%
monthly_retail_tbl filter(Industry == Industries[4]) %>%
tk_seasonal_diagnostics(.date_var = Month,
.value = Turnover)
## # A tibble: 441 × 5
## Month .value month.lbl quarter year
## <date> <dbl> <fct> <fct> <fct>
## 1 1982-04-01 10.3 April 2 1982
## 2 1982-05-01 10.6 May 2 1982
## 3 1982-06-01 9.9 June 2 1982
## 4 1982-07-01 8.8 July 3 1982
## 5 1982-08-01 8.8 August 3 1982
## 6 1982-09-01 9.2 September 3 1982
## 7 1982-10-01 9.7 October 4 1982
## 8 1982-11-01 11.3 November 4 1982
## 9 1982-12-01 18.5 December 4 1982
## 10 1983-01-01 7.4 January 1 1983
## # … with 431 more rows
時系列プロットの激しいスパイクは12月に発生していることがわかる。
%>%
monthly_retail_tbl filter(Industry == Industries[4]) %>%
plot_seasonal_diagnostics(.date_var = Month,
.value = Turnover,
.title = Industries[4])
有効な特徴量かどうかは、線形回帰を実行して、調整済R2乗値Adjusted R-squared
が増加するかを調べればよい。手っ取り早い方法。Adjusted R-squared
は0.8637。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
plot_time_series_regression(
.date_var = Month,
.formula = Turnover ~ Month,
.show_summary = TRUE,
.title = Industries[1]
)
##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7358 -2.5295 0.5654 2.5335 16.2649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.115e+01 6.247e-01 -17.86 <2e-16 ***
## Month 2.790e-03 5.284e-05 52.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.299 on 439 degrees of freedom
## Multiple R-squared: 0.864, Adjusted R-squared: 0.8637
## F-statistic: 2789 on 1 and 439 DF, p-value: < 2.2e-16
分散を小さくするために対数変換を利用できる。また、標準化も利用される。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
plot_time_series(
Month,
Turnover,.facet_ncol = 3,
.smooth = FALSE,
.interactive = TRUE
)
Time Series Signature
とも呼ばれ、timetk::tk_augment_timeseries_signature()
関数でデータセットに25以上の特徴量を追加できる。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_timeseries_signature(.date_var = Month)
## # A tibble: 441 × 31
## Month Industry Turno…¹ index…² diff year year.…³ half quarter month
## <date> <fct> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 1982-04-01 Cafes, … -2.01 3.86e8 NA 1982 1982 1 2 4
## 2 1982-05-01 Cafes, … -2.36 3.89e8 2592000 1982 1982 1 2 5
## 3 1982-06-01 Cafes, … -2.28 3.92e8 2678400 1982 1982 1 2 6
## 4 1982-07-01 Cafes, … -2.14 3.94e8 2592000 1982 1982 2 3 7
## 5 1982-08-01 Cafes, … -2.28 3.97e8 2678400 1982 1982 2 3 8
## 6 1982-09-01 Cafes, … -2.07 4.00e8 2678400 1982 1982 2 3 9
## 7 1982-10-01 Cafes, … -1.89 4.02e8 2592000 1982 1982 2 4 10
## 8 1982-11-01 Cafes, … -1.73 4.05e8 2678400 1982 1982 2 4 11
## 9 1982-12-01 Cafes, … -1.37 4.08e8 2592000 1982 1982 2 4 12
## 10 1983-01-01 Cafes, … -2.21 4.10e8 2678400 1983 1982 1 1 1
## # … with 431 more rows, 21 more variables: month.xts <int>, month.lbl <ord>,
## # day <int>, hour <int>, minute <int>, second <int>, hour12 <int>,
## # am.pm <int>, wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>,
## # qday <int>, yday <int>, mweek <int>, week <int>, week.iso <int>,
## # week2 <int>, week3 <int>, week4 <int>, mday7 <int>, and abbreviated
## # variable names ¹Turnover, ²index.num, ³year.iso
index.num
: “1970-01-01 00:00:00”からの秒数diff
: 直前のレコードのindex.numとの差year
: 年成分year.iso
:
ISO年成分。ISOカレンダーは、週の始まりを月曜日とし、1月4日が含まれる週をその年の第1週するカレンダー。half
: 1月から6月が1、7月から12月が2quarter
: 四半期成分month
: 月成分month.xts
: xtsパッケージの実装month.lbl
: 月成分のラベル。day
: 日成分hour
: 時間成分minute
: 分成分second
: 秒成分hour12
: 12時間スケールでの時間成分am.pm
: 午前(AM)=1、午後(PM)=2wday
: 日曜日=1、土曜日=7の曜日成分wday.xts
:
xtsパッケージの実装。日曜日は0、土曜日は6の曜日成分wday.lbl
:
日曜日から始まり土曜日で終わる順序付き因子としての曜日成分mday
: その年の日成分(1~366)qday
:
その年の四半期基準での日成分(1四半期で1~90という感じ)yday
: その年の日成分(1~366)mweek
: その年の週成分(1-52)week
: その年の週番号(日曜始まり1-52)。week.iso
: その年のISO週番号(月曜始まり1-52)。week2
: ?(隔週での頻度に対する係数)week3
: ?(3週間ごとの頻度に対する係数)week4
:
?(4週に一度の頻度に対する係数。1月から4月は1、5月から8月は2、9月から12月は3)mday7
:
?(その月の曜日の7分割の整数値で、その日がその月に出現した最初の、2番目の、3番目の、…例を返す。例えば、その月の最初の土曜日は
mday7 = 1 となる。2番目はmday7 = 2)系列に合わせて不要なカラムは削除する。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_timeseries_signature(.date_var = Month) %>%
select(-diff, -matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)"))
## # A tibble: 441 × 9
## Month Industry Turno…¹ index…² year half quarter month month…³
## <date> <fct> <dbl> <dbl> <int> <int> <int> <int> <ord>
## 1 1982-04-01 Cafes, restaura… -2.01 3.86e8 1982 1 2 4 April
## 2 1982-05-01 Cafes, restaura… -2.36 3.89e8 1982 1 2 5 May
## 3 1982-06-01 Cafes, restaura… -2.28 3.92e8 1982 1 2 6 June
## 4 1982-07-01 Cafes, restaura… -2.14 3.94e8 1982 2 3 7 July
## 5 1982-08-01 Cafes, restaura… -2.28 3.97e8 1982 2 3 8 August
## 6 1982-09-01 Cafes, restaura… -2.07 4.00e8 1982 2 3 9 Septem…
## 7 1982-10-01 Cafes, restaura… -1.89 4.02e8 1982 2 4 10 October
## 8 1982-11-01 Cafes, restaura… -1.73 4.05e8 1982 2 4 11 Novemb…
## 9 1982-12-01 Cafes, restaura… -1.37 4.08e8 1982 2 4 12 Decemb…
## 10 1983-01-01 Cafes, restaura… -2.21 4.10e8 1983 1 1 1 January
## # … with 431 more rows, and abbreviated variable names ¹Turnover, ²index.num,
## # ³month.lbl
month.lbl
はこのままでは利用できないので、ダミー変数にワンホットエンコーディングする。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_timeseries_signature(.date_var = Month) %>%
select(-diff, -matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>%
dummy_cols(select_columns = c("month.lbl")) %>%
select(-month.lbl) %>%
mutate(index.num = normalize_vec(x = index.num)) %>%
mutate(year = normalize_vec(x = year))
## # A tibble: 441 × 20
## Month Indus…¹ Turno…² index…³ year half quarter month month…⁴ month…⁵
## <date> <fct> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 1982-04-01 Cafes,… -2.01 0 0 1 2 4 0 0
## 2 1982-05-01 Cafes,… -2.36 0.00224 0 1 2 5 0 0
## 3 1982-06-01 Cafes,… -2.28 0.00455 0 1 2 6 0 0
## 4 1982-07-01 Cafes,… -2.14 0.00679 0 2 3 7 0 0
## 5 1982-08-01 Cafes,… -2.28 0.00911 0 2 3 8 0 0
## 6 1982-09-01 Cafes,… -2.07 0.0114 0 2 3 9 0 0
## 7 1982-10-01 Cafes,… -1.89 0.0137 0 2 4 10 0 0
## 8 1982-11-01 Cafes,… -1.73 0.0160 0 2 4 11 0 0
## 9 1982-12-01 Cafes,… -1.37 0.0182 0 2 4 12 0 0
## 10 1983-01-01 Cafes,… -2.21 0.0205 0.0278 1 1 1 1 0
## # … with 431 more rows, 10 more variables: month.lbl_March <int>,
## # month.lbl_April <int>, month.lbl_May <int>, month.lbl_June <int>,
## # month.lbl_July <int>, month.lbl_August <int>, month.lbl_September <int>,
## # month.lbl_October <int>, month.lbl_November <int>,
## # month.lbl_December <int>, and abbreviated variable names ¹Industry,
## # ²Turnover, ³index.num, ⁴month.lbl_January, ⁵month.lbl_February
Adjusted R-squared
は0.9028となっており、0.8637よりも説明力が上がっていることがわかる。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_timeseries_signature(.date_var = Month) %>%
select(-diff, -matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>%
dummy_cols(select_columns = c("month.lbl")) %>%
select(-month.lbl) %>%
mutate(index.num = normalize_vec(x = index.num)) %>%
mutate(year = normalize_vec(x = year)) %>%
plot_time_series_regression(.date_var = Month,
.formula = Turnover ~ as.numeric(Month) +
+ year + half + quarter + month +
index.num + month.lbl_February + month.lbl_March + month.lbl_April +
month.lbl_January + month.lbl_June + month.lbl_July + month.lbl_August +
month.lbl_May + month.lbl_October + month.lbl_November + month.lbl_December,
month.lbl_September .show_summary = TRUE)
##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99808 -0.14635 0.01804 0.21316 0.71995
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.15046 233.20946 0.241 0.8098
## as.numeric(Month) -0.01335 0.05360 -0.249 0.8034
## index.num NA NA NA NA
## year 178.78161 704.75959 0.254 0.7999
## half -0.07938 0.13649 -0.582 0.5612
## quarter -0.11990 0.19909 -0.602 0.5474
## month 0.50620 1.60957 0.314 0.7533
## month.lbl_January 0.22550 0.20559 1.097 0.2733
## month.lbl_February 0.19531 0.15070 1.296 0.1957
## month.lbl_March 0.23904 0.11591 2.062 0.0398 *
## month.lbl_April 0.20250 0.17068 1.186 0.2361
## month.lbl_May 0.14760 0.11566 1.276 0.2026
## month.lbl_June NA NA NA NA
## month.lbl_July 0.15123 0.19430 0.778 0.4368
## month.lbl_August 0.07558 0.11566 0.653 0.5138
## month.lbl_September NA NA NA NA
## month.lbl_October 0.09758 0.13649 0.715 0.4751
## month.lbl_November NA NA NA NA
## month.lbl_December NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3117 on 427 degrees of freedom
## Multiple R-squared: 0.9057, Adjusted R-squared: 0.9028
## F-statistic: 315.5 on 13 and 427 DF, p-value: < 2.2e-16
tk_augment_fourier
関数を利用することで、周期的な変動を取り込める。.K
は含めるべきsin項とcos項のペア数を指定する。m = 12
(1年)とK = 1
の例。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_fourier(.date_var = Month, .periods = 12, .K = 1) %>%
select(-Industry) %>%
pivot_longer(-Month) %>%
plot_time_series(Month, .value = value,
.facet_vars = name,
.smooth = FALSE)
Adjusted R-squared
は0.9037となっており、0.9028より微増したことがわかる。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_timeseries_signature(.date_var = Month) %>%
select(-diff, -matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>%
dummy_cols(select_columns = c("month.lbl")) %>%
select(-month.lbl) %>%
mutate(index.num = normalize_vec(x = index.num)) %>%
mutate(year = normalize_vec(x = year)) %>%
tk_augment_fourier(.date_var = Month, .periods = 12, .K = 1) %>%
plot_time_series_regression(.date_var = Month,
.formula = Turnover ~ as.numeric(Month) +
+ year + half + quarter + month +
index.num + month.lbl_February + month.lbl_March + month.lbl_April +
month.lbl_January + month.lbl_June + month.lbl_July + month.lbl_August +
month.lbl_May + month.lbl_October + month.lbl_November + month.lbl_December +
month.lbl_September + Month_cos12_K1,
Month_sin12_K1 .show_summary = TRUE)
##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00307 -0.14673 0.02057 0.21026 0.67888
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.73199 232.13701 0.227 0.8204
## as.numeric(Month) -0.01257 0.05335 -0.236 0.8139
## index.num NA NA NA NA
## year 168.43506 701.51875 0.240 0.8104
## half -0.03566 0.13704 -0.260 0.7948
## quarter -0.11057 0.19829 -0.558 0.5774
## month 0.47144 1.60220 0.294 0.7687
## month.lbl_January 0.16452 0.20810 0.791 0.4296
## month.lbl_February 0.13657 0.15372 0.888 0.3748
## month.lbl_March 0.19052 0.11841 1.609 0.1084
## month.lbl_April 0.16058 0.17087 0.940 0.3479
## month.lbl_May 0.12559 0.11549 1.087 0.2775
## month.lbl_June NA NA NA NA
## month.lbl_July 0.12020 0.19392 0.620 0.5357
## month.lbl_August 0.06280 0.11530 0.545 0.5863
## month.lbl_September NA NA NA NA
## month.lbl_October 0.09590 0.13588 0.706 0.4807
## month.lbl_November NA NA NA NA
## month.lbl_December NA NA NA NA
## Month_sin12_K1 -0.03302 0.02295 -1.439 0.1510
## Month_cos12_K1 -0.04541 0.02296 -1.978 0.0486 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3103 on 425 degrees of freedom
## Multiple R-squared: 0.907, Adjusted R-squared: 0.9037
## F-statistic: 276.4 on 15 and 425 DF, p-value: < 2.2e-16
ラグ特徴量とは、過去に起こったことが未来に影響を与える、あるいはある種の本質的な情報を含んでいるという仮定に基づいて作成されるため有用な場合がある。PACFがラグ13と関連していることがわかる。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
plot_acf_diagnostics(Month,
Turnover,.lags = "24 months",
.interactive = FALSE)
具体的なイメージはこちら。
<- seq(as.Date("2022-01-01"), as.Date("2022-01-30"), by = "day")
date <- tibble(date, value = 1:30)
d %>%
d tk_augment_lags(.value = value, .lags = c(12, 13)) %>%
print(n = 30)
## # A tibble: 30 × 4
## date value value_lag12 value_lag13
## <date> <int> <int> <int>
## 1 2022-01-01 1 NA NA
## 2 2022-01-02 2 NA NA
## 3 2022-01-03 3 NA NA
## 4 2022-01-04 4 NA NA
## 5 2022-01-05 5 NA NA
## 6 2022-01-06 6 NA NA
## 7 2022-01-07 7 NA NA
## 8 2022-01-08 8 NA NA
## 9 2022-01-09 9 NA NA
## 10 2022-01-10 10 NA NA
## 11 2022-01-11 11 NA NA
## 12 2022-01-12 12 NA NA
## 13 2022-01-13 13 1 NA
## 14 2022-01-14 14 2 1
## 15 2022-01-15 15 3 2
## 16 2022-01-16 16 4 3
## 17 2022-01-17 17 5 4
## 18 2022-01-18 18 6 5
## 19 2022-01-19 19 7 6
## 20 2022-01-20 20 8 7
## 21 2022-01-21 21 9 8
## 22 2022-01-22 22 10 9
## 23 2022-01-23 23 11 10
## 24 2022-01-24 24 12 11
## 25 2022-01-25 25 13 12
## 26 2022-01-26 26 14 13
## 27 2022-01-27 27 15 14
## 28 2022-01-28 28 16 15
## 29 2022-01-29 29 17 16
## 30 2022-01-30 30 18 17
Adjusted R-squared
は0.9237となっており、0.9037より微増したことがわかる。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_timeseries_signature(.date_var = Month) %>%
select(-diff, -matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>%
dummy_cols(select_columns = c("month.lbl")) %>%
select(-month.lbl) %>%
mutate(index.num = normalize_vec(x = index.num)) %>%
mutate(year = normalize_vec(x = year)) %>%
tk_augment_fourier(.date_var = Month, .periods = 12, .K = 1) %>%
tk_augment_lags(.value = Turnover, .lags = c(12, 13)) %>%
plot_time_series_regression(.date_var = Month,
.formula = Turnover ~ as.numeric(Month) +
+ year + half + quarter + month +
index.num + month.lbl_February + month.lbl_March + month.lbl_April +
month.lbl_January + month.lbl_June + month.lbl_July + month.lbl_August +
month.lbl_May + month.lbl_October + month.lbl_November + month.lbl_December +
month.lbl_September + Month_cos12_K1 +
Month_sin12_K1 + Turnover_lag13,
Turnover_lag12 .show_summary = TRUE)
##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79554 -0.14124 -0.00007 0.17381 0.61169
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 266.848010 197.170496 1.353 0.177
## as.numeric(Month) -0.061529 0.045314 -1.358 0.175
## index.num NA NA NA NA
## year 810.581465 595.830984 1.360 0.174
## half 0.045880 0.117210 0.391 0.696
## quarter -0.006946 0.170880 -0.041 0.968
## month 1.892157 1.360976 1.390 0.165
## month.lbl_January 0.029223 0.178609 0.164 0.870
## month.lbl_February 0.099294 0.134229 0.740 0.460
## month.lbl_March 0.049029 0.104032 0.471 0.638
## month.lbl_April 0.025292 0.147450 0.172 0.864
## month.lbl_May 0.017459 0.100264 0.174 0.862
## month.lbl_June NA NA NA NA
## month.lbl_July -0.045562 0.166993 -0.273 0.785
## month.lbl_August -0.018226 0.098958 -0.184 0.854
## month.lbl_September NA NA NA NA
## month.lbl_October 0.001329 0.117352 0.011 0.991
## month.lbl_November NA NA NA NA
## month.lbl_December NA NA NA NA
## Month_sin12_K1 -0.009669 0.020029 -0.483 0.630
## Month_cos12_K1 -0.021998 0.020105 -1.094 0.275
## Turnover_lag12 0.426532 0.085389 4.995 8.72e-07 ***
## Turnover_lag13 0.063374 0.085430 0.742 0.459
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2617 on 410 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.9267, Adjusted R-squared: 0.9237
## F-statistic: 305.1 on 17 and 410 DF, p-value: < 2.2e-16
時系列データセットでローリングウィンドウ特徴量は、サンプル自体と事前に指定された数のサンプルを含む範囲を定義することによって、特定のデータサンプルの値に関する統計を計算することで特徴量を作成できる。いくつかのローリング期間の値を試した後、3か月、6か月、12か月の3つのローリング期間を設定する。
具体的なイメージはこちら。
%>%
d tk_augment_lags(.value = value, .lags = 12) %>%
tk_augment_slidify(.value = value_lag12,
.f = ~ mean(.x, na.rm = TRUE),
.period = c(3, 6, 9, 12),
.partial = TRUE,
.align = "center") %>%
print(n = 30)
## # A tibble: 30 × 7
## date value value_lag12 value_lag12_roll_3 value_lag12…¹ value…² value…³
## <date> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2022-01-01 1 NA NaN NaN NaN NaN
## 2 2022-01-02 2 NA NaN NaN NaN NaN
## 3 2022-01-03 3 NA NaN NaN NaN NaN
## 4 2022-01-04 4 NA NaN NaN NaN NaN
## 5 2022-01-05 5 NA NaN NaN NaN NaN
## 6 2022-01-06 6 NA NaN NaN NaN NaN
## 7 2022-01-07 7 NA NaN NaN NaN 1
## 8 2022-01-08 8 NA NaN NaN NaN 1.5
## 9 2022-01-09 9 NA NaN NaN 1 2
## 10 2022-01-10 10 NA NaN 1 1.5 2.5
## 11 2022-01-11 11 NA NaN 1.5 2 3
## 12 2022-01-12 12 NA 1 2 2.5 3.5
## 13 2022-01-13 13 1 1.5 2.5 3 4
## 14 2022-01-14 14 2 2 3 3.5 4.5
## 15 2022-01-15 15 3 3 3.5 4 5
## 16 2022-01-16 16 4 4 4.5 4.5 5.5
## 17 2022-01-17 17 5 5 5.5 5 6
## 18 2022-01-18 18 6 6 6.5 6 6.5
## 19 2022-01-19 19 7 7 7.5 7 7.5
## 20 2022-01-20 20 8 8 8.5 8 8.5
## 21 2022-01-21 21 9 9 9.5 9 9.5
## 22 2022-01-22 22 10 10 10.5 10 10.5
## 23 2022-01-23 23 11 11 11.5 11 11.5
## 24 2022-01-24 24 12 12 12.5 12 12.5
## 25 2022-01-25 25 13 13 13.5 13 13
## 26 2022-01-26 26 14 14 14.5 14 13.5
## 27 2022-01-27 27 15 15 15.5 14.5 14
## 28 2022-01-28 28 16 16 16 15 14.5
## 29 2022-01-29 29 17 17 16.5 15.5 15
## 30 2022-01-30 30 18 17.5 17 16 15.5
## # … with abbreviated variable names ¹value_lag12_roll_6, ²value_lag12_roll_9,
## # ³value_lag12_roll_12
Adjusted R-squared
は0.9481となっており、0.9237より微増したことがわかる。
%>%
monthly_retail_tbl filter(Industry == Industries[[1]]) %>%
mutate(Turnover = log1p(x = Turnover)) %>%
mutate(Turnover = standardize_vec(Turnover)) %>%
tk_augment_timeseries_signature(.date_var = Month) %>%
select(-diff, -matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>%
dummy_cols(select_columns = c("month.lbl")) %>%
select(-month.lbl) %>%
mutate(index.num = normalize_vec(x = index.num)) %>%
mutate(year = normalize_vec(x = year)) %>%
tk_augment_fourier(.date_var = Month, .periods = 12, .K = 1) %>%
tk_augment_lags(.value = Turnover, .lags = c(12, 13)) %>%
tk_augment_slidify(.value = c(Turnover_lag12, Turnover_lag13),
.f = ~ mean(.x, na.rm = TRUE),
.period = c(3, 6, 9, 12),
.partial = TRUE,
.align = "center") %>%
plot_time_series_regression(.date_var = Month,
.formula = Turnover ~ as.numeric(Month) +
+ year + half + quarter + month +
index.num + month.lbl_February + month.lbl_March + month.lbl_April +
month.lbl_January + month.lbl_June + month.lbl_July + month.lbl_August +
month.lbl_May + month.lbl_October + month.lbl_November + month.lbl_December +
month.lbl_September + Month_cos12_K1 +
Month_sin12_K1 + Turnover_lag12_roll_3 + Turnover_lag12_roll_6 + Turnover_lag12_roll_9 + Turnover_lag12_roll_12 +
Turnover_lag12 + Turnover_lag13_roll_3 + Turnover_lag13_roll_6 + Turnover_lag13_roll_9 + Turnover_lag13_roll_12,
Turnover_lag13 .show_summary = TRUE)
##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65228 -0.12624 -0.00067 0.13133 0.78657
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.30640 163.30413 0.584 0.55981
## as.numeric(Month) -0.02211 0.03753 -0.589 0.55606
## index.num NA NA NA NA
## year 291.82727 493.48550 0.591 0.55461
## half 0.14258 0.11774 1.211 0.22661
## quarter -0.16283 0.18058 -0.902 0.36775
## month 0.75295 1.12805 0.667 0.50485
## month.lbl_January 0.20144 0.18917 1.065 0.28758
## month.lbl_February 0.18250 0.13264 1.376 0.16961
## month.lbl_March 0.16548 0.10107 1.637 0.10235
## month.lbl_April 0.14551 0.15349 0.948 0.34368
## month.lbl_May 0.10645 0.10276 1.036 0.30086
## month.lbl_June NA NA NA NA
## month.lbl_July 0.05333 0.16037 0.333 0.73963
## month.lbl_August 0.05699 0.09385 0.607 0.54407
## month.lbl_September NA NA NA NA
## month.lbl_October 0.12642 0.11781 1.073 0.28388
## month.lbl_November NA NA NA NA
## month.lbl_December NA NA NA NA
## Month_sin12_K1 -0.05217 0.01766 -2.953 0.00333 **
## Month_cos12_K1 -0.05522 0.01798 -3.071 0.00228 **
## Turnover_lag12 0.15671 0.19215 0.816 0.41521
## Turnover_lag12_roll_3 0.20053 0.38534 0.520 0.60307
## Turnover_lag12_roll_6 0.36903 0.43344 0.851 0.39505
## Turnover_lag12_roll_9 -0.79671 0.74807 -1.065 0.28750
## Turnover_lag12_roll_12 4.55934 0.58574 7.784 6.01e-14 ***
## Turnover_lag13 0.07664 0.19488 0.393 0.69434
## Turnover_lag13_roll_3 -0.17368 0.34252 -0.507 0.61239
## Turnover_lag13_roll_6 -0.43496 0.56519 -0.770 0.44199
## Turnover_lag13_roll_9 -1.43135 0.60602 -2.362 0.01866 *
## Turnover_lag13_roll_12 -1.87214 0.73649 -2.542 0.01140 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2159 on 402 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.9511, Adjusted R-squared: 0.9481
## F-statistic: 312.9 on 25 and 402 DF, p-value: < 2.2e-16