はじめに

縦断データの分析I―変化についてのマルチレベルモデリング―を利用してマルチレベルモデリングの勉強内容をまとめたもの。下記のサポートサイトにサンプルデータなどが保存されている。

TEXTBOOK EXAMPLES APPLIED LONGITUDINAL DATA ANALYSIS

分析内容について

分析内容は、前回に引き続き、青年期のアルコール摂取量の変化に関する分析。縦断データの分析I―変化についてのマルチレベルモデリング―のp75に詳細は書いてあるとおり、青年期のアルコール摂取量の変化に関する分析をここでもお借りする。82人(n)を対象に14歳,15歳,16歳の3回(age)の計測タイミングがある。alcuseはアルコール摂取に関する頻度の合成スコアで、予測変数として、友達の飲酒割合に関するpeerと親がアルコール依存症かを表すcoaが用意されている。

この分析の目的は、親がアルコール依存症であったり、自分の周囲の友だちが飲酒していれば、自分のアルコール摂取量も14歳,15歳,16歳の時間経過とともに、アルコール摂取量が増加していくのではないか、という仮説を検証すること。

library(tidyverse)
library(broom)
library(nlme)
library(DT)
library(patchwork)

alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
alcohol1 <- alcohol1 %>% mutate(
  mean_peer = mean(peer), 
  flg_peer = ifelse(peer < mean_peer, "low peer", "high peer"),
  flg_peer = factor(flg_peer, levels = c("low peer", "high peer"))
  )
#Models
model.a <- lme(alcuse ~ 1, alcohol1, random = ~1 |id, method = "ML")
model.b <- lme(alcuse ~ age_14 , data = alcohol1, random= ~ age_14 | id, method = "ML")
model.c <- lme(fixed  = alcuse ~ coa*age_14, random = ~ age_14 | id, data = alcohol1, method = "ML")
model.d <- lme(fixed  = alcuse ~ coa * age_14 + peer * age_14, random = ~ age_14 | id, data = alcohol1, method = "ML")
model.e <- lme(fixed  = alcuse ~ coa + peer * age_14, random = ~ age_14 | id, data = alcohol1, method = "ML")
model.f <- lme(fixed = alcuse ~ coa  + cpeer * age_14, random = ~ age_14 | id, data = alcohol1, method = "ML")
model.g <- lme(fixed = alcuse ~ ccoa + cpeer * age_14, random = ~ age_14 | id, data = alcohol1, method = "ML")
datatable(alcohol1 %>% mutate_if(is.numeric, round, digit = 2))

ここでの目的は、モデルの妥当性であったりモデルの比較方法をまとめること。

情報量基準

マルチレベルモデルではパラメタを最尤法で推定する際に、対数尤度を計算する。一般的に、同じデータを使って複数のモデルを構築した際に、対数尤度が大きいモデルのほうが当てはまりが良くなる。仮に、対数尤度がマイナスであれば、0に近いほうが良い。よく使われる指標として、AIC、BICをここではおさらいしておく。

AICは、-2を対数尤度にかけ、パラメタの数で罰則をつけた指標。対数尤度が大きくなると、AICは小さくなるので、AICでは小さいモデル(AICが負であればマイナスに大きいモデル)のほうが、よりよい予測性能があるとわかる。真のモデルという話ではない。AICは、モデルに不要な変数を追加することで増加する対数尤度にペナルティをつける。パラメタを増やせば増やすほど、データへの当てはまりはよくなるが、モデルとしては使えないので、そのあたりを勘定してよいモデルを選ぶのにAICが使われる。BICは更にサンプルサイズ(=246)の大きさも考慮している。下記は、各モデルの対数尤度、AIC、BICを抜粋した表である。


	alcuse
	A	B	C	D	E	F_c	G_c

Observations	246	246	246	246	246	246	246
Log Likelihood	-335	-318	-310	-294	-294	-294	-294
Akaike Inf. Crit.	676	648	637	608	606	606	606
Bayesian Inf. Crit.	686	669	665	643	638	638	638

AIC基準でモデル選択を行うと、モデルE、F、GはAICが同じで、他のモデルよりもAICが小さいため、より良いモデルAと考えられる。モデルBを例に具体的な算出方法を確認しておく。

\[ \begin{eqnarray} AIC &=& (-2) lnL + 2(k) \\ &=& (-2)(-318) + 2(6) \\ &=& 648 \\ \\ BIC &=& (-2) lnL + 2(ln(n)/2) k \\ &=& (-2) (-318) + 2(ln(246)/2)6\\ &=& 663 \end{eqnarray} \]

モデルの仮定

回帰分析で最尤法を利用するのであれば、誤差が独立に同じ分散の正規分布に従うことを仮定するのと同じく、マルチレベルモデルでも構造的、確率的な部分に仮定を必要とする。構造的な部分では、レベル1では個人の変化の軌跡が線形なのか、非線形なのか、レベル2では、成長パラメタと時不変な予測変数との関係が定式化されている。確率的な部分は誤差の分布に対して正規分布、2変量正規分布を仮定している。つまり、このように仮定している限り、仮定を満たしているのかを調べることで、モデルに対する理解を深められる。通常の回帰分析と同じく、真のパラメタはわからないので、当てはめられた残差を調べることになる。

関数形の検証

まずは関数形の検証を行う。これはいくつかの個人に対して経験的プロットを可視化することで、レベル1の仮定された関数形が適切かどうかを調べることができる。これらのサンプルを見る限り、そこまで大きく線形の仮定が外れているとは考えられないため、ここでは線形の仮定は満たせているとする。

ggplot(alcohol1 %>% filter(id %in% c(4, 14, 23, 32, 41, 56, 65, 82)), aes(age_14, alcuse)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, colour = "tomato", size = 1) + 
  scale_x_continuous(breaks = c(14, 15, 16)) + 
  scale_x_continuous(breaks = c(0, 1, 2), label = c("14", "15", "16")) +
  ylim(-1, 5) + 
  facet_wrap( ~ id, scales = "free", nrow = 2) + 
  theme_bw()

レベル2の仮定に関しては、個人の成長パラメタの推定値とレベル2の予測変数との関係を可視化する。coaに関しては、親がアルコール依存症の個人の場合、初期値が高くなる傾向があることがわかる。また、peerに対しては、周囲の友人がアルコールを摂取する頻度が大きくなると、比例して初期値が大きくなる傾向があり、弱い関係ではあるが、変化率に対は、周囲の友人がアルコールを摂取する頻度が大きくなると、変化率は緩やかになることがわかる。つまり、使用している予測変数が初期値と変化率に関係があることがわかる。

df_fit <- alcohol1 %>% 
  group_by(id) %>% 
  nest() %>% 
  mutate(
    fit_lm = map(.x = data, .f = function(x){lm(alcuse ~ age_14, data = x)}),
    tidy_lm = map(.x = fit_lm, .f = function(x){tidy(x)}),
    glance_lm = map(.x = fit_lm, .f = function(x){glance(x)})
  ) %>% 
  unnest(tidy_lm) %>% 
  select(term, estimate, std.error) %>% 
  pivot_wider(names_from = term, values_from = c(estimate, std.error)) %>% 
  set_names(c("id", "inercept_coef", "slope_coef", "inercept_std", "slope_std")) %>% 
  select(id, inercept_coef, inercept_std, slope_coef, slope_std) %>% 
  ungroup() %>% 
  left_join(., alcohol1 %>% distinct(id, coa, peer), by = c("id" = "id"))


l1 <- ggplot(df_fit, aes(coa, inercept_coef)) + geom_jitter(width = 0.1) + geom_smooth(method = "lm", col = "tomato", se = FALSE) +  xlim(-0.5, 1.5)
l2 <- ggplot(df_fit, aes(coa, slope_coef))    + geom_jitter(width = 0.1) + geom_smooth(method = "lm", col = "tomato", se = FALSE) +  xlim(-0.5, 1.5)
r1 <- ggplot(df_fit, aes(peer, inercept_coef)) + geom_point() + geom_smooth(method = "lm", col = "tomato", se = FALSE) +  xlim(0, 3)
r2 <- ggplot(df_fit, aes(peer, slope_coef))    + geom_point() + geom_smooth(method = "lm", col = "tomato", se = FALSE) +  xlim(0, 3)

((l1 | r1) / (l2 | r2)) & theme_bw()

正規性の検証

正規性の検証は素残差\(\hat{\epsilon}_{ij}, \hat{\zeta}_{0j}, \hat{\zeta}_{1j}\)についていつもどおり行えばよい。まずはモデルから残差を抜き出しておく。

resid_eps <- as.numeric(residuals(model.f))
resid_eps_std <- resid_eps/sd(resid_eps)
zeta0 <- random.effects(model.f)[[1]]
zeta0_std <- zeta0/sd(zeta0)
zeta1 <- random.effects(model.f)[[2]]
zeta1_std <- zeta1/sd(zeta1)

df_eps <- tibble(
  id = alcohol1$id,
  age = alcohol1$age,
  resid_eps,
  resid_eps_std)

df_zeta <- tibble(
  id = unique(alcohol1$id),
  coa = alcohol1 %>% filter(age == 14) %>% pull(coa),
  peer = alcohol1 %>% filter(age == 14) %>% pull(peer),
  zeta0, 
  zeta0_std,
  zeta1,
  zeta1_std
)

list(df_eps, df_zeta)

## [[1]]
## # A tibble: 246 × 4
##       id   age resid_eps resid_eps_std
##    <int> <int>     <dbl>         <dbl>
##  1     1    14    0.266         0.575 
##  2     1    15    0.245         0.530 
##  3     1    16   -0.0444       -0.0962
##  4     2    14   -0.355        -0.768 
##  5     2    15   -0.580        -1.26  
##  6     2    16    0.195         0.423 
##  7     3    14   -0.177        -0.384 
##  8     3    15    0.0438        0.0949
##  9     3    16    0.581         1.26  
## 10     4    14   -1.08         -2.35  
## # … with 236 more rows
## 
## [[2]]
## # A tibble: 82 × 7
##       id   coa  peer  zeta0 zeta0_std   zeta1 zeta1_std
##    <int> <int> <dbl>  <dbl>     <dbl>   <dbl>     <dbl>
##  1     1     1 1.26   0.330     0.897  0.0558     0.202
##  2     2     1 0.894 -0.524    -1.43  -0.0644    -0.233
##  3     3     1 0.894  0.298     0.811  0.490      1.77 
##  4     4     1 1.79  -0.418    -1.14   0.198      0.717
##  5     5     1 0.894 -0.580    -1.58  -0.311     -1.12 
##  6     6     1 1.55   0.888     2.41   0.223      0.805
##  7     7     1 1.55   0.260     0.707 -0.175     -0.633
##  8     8     1 0     -0.272    -0.739 -0.278     -1.01 
##  9     9     1 0      0.109     0.297  0.633      2.29 
## 10    10     1 2     -0.400    -1.09  -0.179     -0.647
## # … with 72 more rows

正規性の検証であれば、QQプロットを利用して、プロットで線形の関係が確認できるかを調べる。標準化残差を利用した可視化方法で調べるのであれば、残差が正規分布ならばプラス・マイナス2シグマ以内に収まっているかを調べる。QQプロットだと、\(\hat{\epsilon}_{ij}, \hat{\zeta}_{1j}\)は少し外れているようにも見える一方で、\(\hat{\zeta}_{0j}\)は正規分布に従ってそうです。また、標準化残差プロットを見る限り、正規分布からの大きな逸脱はなさそうです。

l1 <- ggplot(df_eps,aes(sample = resid_eps)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q plot of epsilon") 
l2 <- ggplot(df_zeta, aes(sample = zeta0)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q plot of zeta0") 
l3 <- ggplot(df_zeta, aes(sample = zeta1)) + stat_qq() + stat_qq_line() + ggtitle("Normal Q-Q plot of zeta1") 

r1 <- ggplot(df_eps, aes(id, resid_eps_std)) + geom_point() + 
  geom_hline(yintercept = c(-2,2), linetype = "dotted", col = "tomato") + 
  ylim(-3, 3) + 
  ggtitle("Standardized residual plot of epsilon")

r2 <- ggplot(df_zeta, aes(id, zeta0_std)) + geom_point() + 
  geom_hline(yintercept = c(-2,2), linetype = "dotted", col = "tomato") + 
  ylim(-3, 3) + 
  ggtitle("Standardized residual plot of zeta0")

r3 <- ggplot(df_zeta, aes(id, zeta1_std)) + geom_point() + 
  geom_hline(yintercept = c(-2,2), linetype = "dotted", col = "tomato") + 
  ylim(-3, 3) + 
  ggtitle("Standardized residual plot of zeta1")


((l1 | r1) / (l2 | r2) / (l3 | r3)) & theme_bw()

等分散性の検証

予測変数と残差をプロットすることで等分散性の検証ができる。残差のサンプルサイズはレベル1とレベル2で異なるので、そのあたりは注意が必要。仮定が満たされるのであれば、予測変数の残差の変動は、おおよそ等しくなることが期待できる。

レベル1のageは上下の範囲が等しく、レベル2のcoa、peerでも問題なさそうに見える。気になる点としては、Zeta1 & Peerの右下のグラフにおいて、peerが2以上の右端の部分においては、変動が小さくなっており、この部分では等分散性が満たされていないかもしれない。

l <- ggplot(df_eps, aes(age, resid_eps)) + geom_jitter(width = 0.1) + geom_hline(yintercept = 0, linetype = "dotted", col = "tomato") + ggtitle("Epsilon & Age")
l1 <- ggplot(df_zeta, aes(coa , zeta0)) + geom_jitter(width = 0.1) + geom_hline(yintercept = 0, linetype = "dotted", col = "tomato") + ggtitle("Zeta0 & Coa")
r1 <- ggplot(df_zeta, aes(peer, zeta0)) + geom_point() + geom_hline(yintercept = 0, linetype = "dotted", col = "tomato") + ggtitle("Zeta0 & Peer")
l2 <- ggplot(df_zeta, aes(coa , zeta1)) + geom_jitter(width = 0.1) + geom_hline(yintercept = 0, linetype = "dotted", col = "tomato") + ggtitle("Zeta1 & Coa")
r2 <- ggplot(df_zeta, aes(peer, zeta1)) + geom_point() + geom_hline(yintercept = 0, linetype = "dotted", col = "tomato") + ggtitle("Zeta1 & Peer")

(l / (l1 | r1) / (l2 | r2)) & theme_bw()

マルチレベルモデリングChapter4:モデルの妥当性