はじめに

縦断データの分析I―変化についてのマルチレベルモデリング―を利用してマルチレベルモデリングの勉強内容をまとめたもの。下記のサポートサイトにサンプルデータなどが保存されている。

TEXTBOOK EXAMPLES APPLIED LONGITUDINAL DATA ANALYSIS

ここでの目的は、データの測定回数や時間が異なるデータに対するマルチレベルモデルについてまとめておく。

測定タイミングの異なる場合

これまで扱ってきたデータは、測定間隔が同じで、測定回数も同じ構造化されたデータを扱ってきたが、ここでは、測定回数が個人で異なり、個人の測定間隔もバラバラなデータを扱う。

今回のデータは、89人のアフリカ系アメリカ人の子どもの読む能力テスト(PIAT: Peabody Individual Achievement Teat)の得点に関するデータで、測定回数は3回。1986年(6歳)、1988年(8歳)、1990年(10歳)のときに測定が行われている。

このデータの特徴は、時間に関する変数が下記の通り3つも存在していること。

wave: 測定回数を表す
agegrp: 検査が期待される年齢のグループを表す
age: テスト受験時の実際の年齢を表す

library(tidyverse)
library(broom)
library(nlme)
library(DT)
library(patchwork)
library(stargazer)

reading_pp <- read_csv("~/Desktop//reading_pp.csv") %>% 
  mutate(agegrp_c = agegrp - 6.5,
         age_c    = age    - 6.5)
datatable(reading_pp %>% mutate_if(is.numeric, round, 2))

これらの変数の関係を可視化するとわかりやすい。1回目の測定は必ず6歳に行われているわけではなく、7歳ギリギリの子どももいる。2回目には8歳のはずが10歳近くで検査しており、3回目は10歳の予定が、平均的には11歳で検査を受けている。つまり、測定の感覚がバラバラなデータであることがわかる。この現象を測定間隔のずれ(Occasion Creep)と呼ぶ。

reading_pp %>% 
  ggplot(aes(x = wave, y = age)) +
  geom_jitter(alpha = .5, height = 0, width = 0.1) +
  scale_y_continuous(breaks = seq(6, 12, 1)) +
  scale_x_continuous(breaks = 1:3) + 
  theme_bw()

このようなケースでは、wave、age、agegrpのどれを使用するべきなのだろうか。より正確な情報という点ではageであるが、等間隔で解釈しやすいのはagegrpである。個人成長プロットをage、agegrpで可視化しても、どちらも似たような傾向を可視化しているので、可視化からでは判断しがたい。

reading_pp %>% 
  select(-wave) %>% 
  filter(id %in% c(4, 27, 31, 33, 41, 49, 69, 77, 87)) %>% 
  pivot_longer(cols = c(agegrp, age), names_to = "type", values_to = "x") %>% 
  ggplot(., aes(x, piat, col = type)) + 
  geom_point() + 
  stat_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(breaks = seq(6, 12, 1)) + 
  # scale_x_continuous(breaks = c(0, 1, 2), label = c("14", "15", "16")) +
  ylim(0, 80) + 
  facet_wrap( ~ id, scales = "free", nrow = 3) + 
  theme_bw()

可視化だけでは判断し難いため、実際にモデルに利用することでどのような違いが現れるのか、確認する。

無条件成長モデル

ここでは、無条件成長モデルをもとにTIMEにage、agegrpを利用することで、モデルにどのような違いが現れるのか、確認する。いつもどおり、6.5を引いて中心化すると、agegrpを使ったモデルでは、\(\gamma_{00}\)は6.5歳時点の初期値を表し、\(\gamma_{10}\)は6.5歳時点の変化率を表すことになる。

\[ \begin{eqnarray} Y_{ij} &=& \pi_{0i} + \pi_{1i} TIME_{ij} + \epsilon_{ij} \\ \pi_{0i} &=& \gamma_{00} + \zeta_{0i} \\ \pi_{1i} &=& \gamma_{10} + \zeta_{1i} \\ \begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} &\sim& N \begin{bmatrix} \sigma_{0}^{2} & \sigma_{01} \\ \sigma_{10} & \sigma_{1}^{2} \end{bmatrix},\quad \epsilon_{ij} \sim N(0, \sigma_{\epsilon}^{2}) \end{eqnarray} \]

下記のとおりagegrpを使ったモデルを推定する。

lme.agegrp <- lme(piat ~ agegrp_c, reading_pp, random= ~ agegrp | id, method = "ML")
lme.age <- lme(piat ~ age_c, reading_pp, random= ~ age | id, method = "ML")

list(
  agegrp = summary(lme.agegrp),
  age = summary(lme.age)
)

## $agegrp
## Linear mixed-effects model fit by maximum likelihood
##   Data: reading_pp 
##        AIC      BIC    logLik
##   1831.949 1853.473 -909.9746
## 
## Random effects:
##  Formula: ~agegrp | id
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 13.244933 (Intr)
## agegrp       2.096994 -0.97 
## Residual     5.200308       
## 
## Fixed effects:  piat ~ agegrp_c 
##                 Value Std.Error  DF  t-value p-value
## (Intercept) 21.162921 0.6165805 177 34.32304       0
## agegrp_c     5.030899 0.2967329 177 16.95430       0
##  Correlation: 
##          (Intr)
## agegrp_c -0.316
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -2.6585175 -0.5418435 -0.1434177  0.3871955  3.3133003 
## 
## Number of Observations: 267
## Number of Groups: 89 
## 
## $age
## Linear mixed-effects model fit by maximum likelihood
##   Data: reading_pp 
##        AIC      BIC    logLik
##   1815.896 1837.419 -901.9478
## 
## Random effects:
##  Formula: ~age | id
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 10.668938 (Intr)
## age          1.816985 -0.985
## Residual     5.238946       
## 
## Fixed effects:  piat ~ age_c 
##                 Value Std.Error  DF  t-value p-value
## (Intercept) 21.060816 0.5614178 177 37.51363       0
## age_c        4.540021 0.2616162 177 17.35375       0
##  Correlation: 
##       (Intr)
## age_c -0.287
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -3.0068407 -0.4948681 -0.1369712  0.4095973  3.7274108 
## 
## Number of Observations: 267
## Number of Groups: 89

パラメタの分散はこの通り。

list(
       agegrp = VarCorr(lme.agegrp),
       age = VarCorr(lme.age)
    )

## $agegrp
## id = pdLogChol(agegrp) 
##             Variance   StdDev    Corr  
## (Intercept) 175.428242 13.244933 (Intr)
## agegrp        4.397385  2.096994 -0.97 
## Residual     27.043202  5.200308       
## 
## $age
## id = pdLogChol(age) 
##             Variance   StdDev    Corr  
## (Intercept) 113.826238 10.668938 (Intr)
## age           3.301434  1.816985 -0.985
## Residual     27.446551  5.238946

モデルの情報を比較する。\(\gamma_{0i}\)はどちらも21あたりであり大きな違いは見られない。また、子供の個人内のばらつき\(\sigma_{\epsilon}^{2}\)も27であたりであり大きな違いは見られない。変化率を表す\(\gamma_{1i}\)には、0.5ポイントくらいの差が出ている。これは2年ごとに、1ポイント差がでるため、研究の4年間では2ポイントの差に積み上がってしまう。

今回のデータであれば、agegrp実際のはageの観測タイミングよりも早い値になるため、固定効果であるす\(\gamma_{1i}\)が大きくなってしまいやすい。また、分散成分も大きくなってしまいやすい。


	Dependent variable:

	piat
	agegrp_c	age_c

Constant\(\gamma_{0i}\)	21.163^***	21.061^***
	(0.617)	(0.561)
time\(\gamma_{1i}\)	5.031^***	4.540^***
	(0.297)	(0.262)


Level1 \(\sigma_{\epsilon}^{2}\)	27.04	27.44
Level2 \(\sigma_{0}^{2}\)	175.42	113.82
Level2 \(\sigma_{1}^{2}\)	4.39	3.30

Observations	267	267
Log Likelihood	-909	-901
Akaike Inf. Crit.	1,831	1,815
Bayesian Inf. Crit.	1,853	1,837

Note:	p<0.1; p<0.05; p<0.01

そのため、時間構造化されていないデータ(age)を時間構造化されたもの(agegrp)として扱うのはよくなく、実際の観測タイミングでの値を利用するほうがよい。

測定時点の異なる場合

男子高校生の中退者の労働市場経験と賃金の関係を追跡したデータを測定時点の異なる場合の例とする。このデータは下記の特徴を持っている。

最初の時点で年齢が14歳から17歳までばらついている
測定の間隔が1年後であったり、2年後の場合もある
調査月もバラバラな状態
面接調査のときに2つ以上の仕事に回答できる
異なる時期に退学して、異なる時期に就職して、異なる時期に転職できる

wages_pp <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_pp.txt", header=T, sep=",") %>% 
  select(id, exper, lnw, black, hgc, uerate, hgc.9)
datatable(wages_pp %>% filter(id %in% c(206, 332, 1028)))

lnwは自然対数変換された賃金で、\(e^{2.03}=7.6\)ドルとなる。experは時間的な変数でありlnwの観測値と関連付けられた特定の時点を表す。例えば、id=206は3回の測定回数をもっており、1.87, 2.81, 4.31年経過後を意味している。転職すると、調査があって、レコードが生成されるというイメージ。

各個人の測定回数の分布を可視化すると、9回観測されている個人が多く、最小1回、最大13回の個人が存在している。

wages_pp %>% 
  group_by(id) %>% count() %>% 
  group_by(n)  %>% count() %>% 
  ggplot(., aes(x = n, y = nn)) +
  geom_bar(stat = "identity") + 
  scale_x_continuous(breaks = seq(0, 15, 1)) + 
  theme_bw()

さらに可視化すると、このデータの測定回数や測定間隔がいかにばらついているものなのかがわかる。

wages_pp %>% 
  filter(id %in% c(134, 173, 206, 332, 537)) %>% 
  mutate(id = factor(id)) %>% 
  ggplot(aes(x = exper, y = lnw, color = id)) +
  geom_point() +
  geom_line() +
  theme_bw()

測定回数や測定間隔がばらついているからといって、マルチレベルモデルでパラメタを推定できない、ということはない。

モデルの比較

測定間隔や測定回数が異なるからと言って、パラメタを推定する場合に、特別な処理が必要というわけではない。ここでは、いくつかのモデルを構築した結果を解釈する。

まずは無条件成長モデルであるモデルAの結果をみると、experの係数\(\hat{\gamma}_{10}\)が有意であることから、時間の経過とともに上昇していることがわかる。ただ、自然対数変換を行っているので、片対数モデルであり、xが1変化したときに\(100(e^{0.0457}-1)=4.7\)となるため、平均的に年間で4.7％上昇していることがわかる。

model.a <- lme(lnw ~ exper,
               random = ~ exper | id,
               data = wages_pp,
               method = "ML")
summary(model.a)

## Linear mixed-effects model fit by maximum likelihood
##   Data: wages_pp 
##        AIC     BIC    logLik
##   4933.394 4973.98 -2460.697
## 
## Random effects:
##  Formula: ~exper | id
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.23295543 (Intr)
## exper       0.04154474 -0.301
## Residual    0.30839044       
## 
## Fixed effects:  lnw ~ exper 
##                 Value   Std.Error   DF   t-value p-value
## (Intercept) 1.7156039 0.010798215 5513 158.87848       0
## exper       0.0456807 0.002341936 5513  19.50555       0
##  Correlation: 
##       (Intr)
## exper -0.565
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.20113323 -0.52073077 -0.03159151  0.44063201  7.03696492 
## 
## Number of Observations: 6402
## Number of Groups: 888

モデルBには高校中退者の人種blackと中退までに修了した学年を中心化したhgc.9がモデルに取り込まれている。hgcは6年(小学6年)から12年(高校3年)までの値をとり、平均は8.8なので、9をひくことで、平均値に近い意味のある値の周辺に中心化されている。つまり、平均的な値を表すことで、モデルを解釈しやすいようにしている。

初期状態に関するhgc.9の\(\hat{\gamma}_{01}\)は0.034で有意である。つまり、終了した年数が大きいほうが、初期時点での賃金が高い。また、変化率に関するexper:blackの\(\hat{\gamma}_{12}\)は-0.018であるため、アフリカ系であると賃金の上昇傾向が緩やかになる。

model.b <- lme(lnw ~ exper * hgc.9 + exper * black,
               random = ~ exper | id,
               data = wages_pp,
               method = "ML")
summary(model.b)

## Linear mixed-effects model fit by maximum likelihood
##   Data: wages_pp 
##        AIC      BIC    logLik
##   4893.751 4961.395 -2436.876
## 
## Random effects:
##  Formula: ~exper | id
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.22748209 (Intr)
## exper       0.04044512 -0.31 
## Residual    0.30853484       
## 
## Fixed effects:  lnw ~ exper * hgc.9 + exper * black 
##                  Value   Std.Error   DF   t-value p-value
## (Intercept)  1.7171385 0.012548264 5511 136.84272  0.0000
## exper        0.0493428 0.002632917 5511  18.74074  0.0000
## hgc.9        0.0349201 0.007885174  885   4.42858  0.0000
## black        0.0153954 0.023937728  885   0.64315  0.5203
## exper:hgc.9  0.0012794 0.001723983 5511   0.74213  0.4580
## exper:black -0.0182129 0.005501727 5511  -3.31040  0.0009
##  Correlation: 
##             (Intr) exper  hgc.9  black  exp:.9
## exper       -0.575                            
## hgc.9        0.071 -0.020                     
## black       -0.523  0.301 -0.020              
## exper:hgc.9 -0.019 -0.003 -0.578  0.011       
## exper:black  0.275 -0.478  0.011 -0.573 -0.023
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.26907895 -0.51809174 -0.03347664  0.44516366  7.01940007 
## 
## Number of Observations: 6402
## Number of Groups: 888

モデルCはレベル2サブモデルの切片にhgc9が含まれ、変化率にblackが含まれているモデル。これも、hgc9が大きいほうが初期時点での賃金の値が高く、アフリカ系であると変化率が緩やかになることがわかる。

model.c <- lme(lnw ~ exper + exper:black + hgc.9,
               random = ~ exper | id,
               data = wages_pp,
               method = "ML")
summary(model.c)

## Linear mixed-effects model fit by maximum likelihood
##   Data: wages_pp 
##        AIC      BIC    logLik
##   4890.704 4944.818 -2437.352
## 
## Random effects:
##  Formula: ~exper | id
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.22766400 (Intr)
## exper       0.04057936 -0.312
## Residual    0.30850207       
## 
## Fixed effects:  lnw ~ exper + exper:black + hgc.9 
##                  Value   Std.Error   DF   t-value p-value
## (Intercept)  1.7214750 0.010700478 5512 160.87832   0e+00
## exper        0.0488470 0.002514195 5512  19.42849   0e+00
## hgc.9        0.0383608 0.006435381  886   5.96092   0e+00
## exper:black -0.0161150 0.004512766 5512  -3.57099   4e-04
##  Correlation: 
##             (Intr) exper  hgc.9 
## exper       -0.515              
## hgc.9        0.077 -0.023       
## exper:black -0.036 -0.391 -0.015
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.27879293 -0.51825166 -0.03471513  0.44262171  7.01696748 
## 
## Number of Observations: 6402
## Number of Groups: 888

各モデルの結果をまとめると、下記の通りとなる。


	Dependent variable:

	lnw
	modelA	modelB	modelC

Constant\(\gamma_{00}\)	1.716^***	1.717^***	1.721^***
	(0.011)	(0.013)	(0.011)
hgc.9\(\gamma_{01}\)		0.035^***	0.038^***
		(0.008)	(0.006)

black\(\gamma_{02}\)		0.015
		(0.024)

exper\(\gamma_{10}\)	0.046^***	0.049^***	0.049^***
	(0.002)	(0.003)	(0.003)

exper:hgc.9\(\gamma_{11}\)		0.001
		(0.002)

exper:black\(\gamma_{12}\)		-0.018^***	-0.016^***
		(0.006)	(0.005)



Observations	6,402	6,402	6,402
Log Likelihood	-2,460.697	-2,436.876	-2,437.352
Akaike Inf. Crit.	4,933.394	4,893.751	4,890.704
Bayesian Inf. Crit.	4,973.980	4,961.395	4,944.818

Note:	p<0.1; p<0.05; p<0.01

モデルの理解を深めるために下記の典型的な4人を可視化する。blackは0 or 1で、hgc.9は9を引いて中心化しているので、0は9年生、3は12年生を表す。

白人ラテン系(black=0)で9年生(hgc.9=0)で中退
白人ラテン系(black=0)で12年生(hgc.9=3)で中退
アフリカ系(black=1)で9年生(hgc.9=0)で中退
アフリカ系(black=1)で12年生(hgc.9=3)で中退

モデルの係数解釈通り、修了年数が大きいほうが初期時点で賃金が高く、アフリカ系であると変化率が緩やかになるため、「アフリカ系の12年生で中退した個人」は、最初は「白人ラテン系の12年生で中退した個人」と同じ賃金であるが、年数が経過するにつれて、「白人ラテン系の9年生で中退した個人」のほうが、「アフリカ系の12年生で中退した個人」よりも賃金が高くなっていることがわかる。

fixef.c <- fixef(model.c)
df_fit_plt_c <- expand_grid(black = 0:1, 
            hgc.9 = c(0, 3), 
            exper = seq(from = 0, to = 12, length.out = 2)) %>% 
  mutate(lnwage = fixef.c[[1]] + 
           fixef.c[[2]] * exper + 
           fixef.c[[3]] * hgc.9 + 
           fixef.c[[4]] * exper * black,
         type = paste0("black=", black, " & hgc.9=", hgc.9))

ggplot(df_fit_plt_c, aes(exper, lnwage, col = as.character(type))) + 
  geom_path(size = 1) +
  geom_text(aes(y = lnwage + 0.02, label = round(lnwage, 2))) +
  scale_x_continuous(breaks = seq(0, 12, 1)) +
  scale_color_manual(values = c("#a0d8ef", "#19448e", "#ec6d71", "#a22041"), name = "type") +
  xlab("exper") + 
  ggtitle("Model C for the effects of balck & hgc9") +
  theme_bw()

モデルの問題点

各個人の観測回数が少ないデータでモデルを当てはめると、繰り返し計算のアルゴリズムが収束せず、分散成分が計算できない場合がある。無条件成長モデルを使って考えてみる。

\[ \begin{eqnarray} Y_{ij} &=& \gamma_{00} + \gamma_{10} Time_{ij} + [\zeta_{0i} + \zeta_{1i}Time_{ij} + \epsilon_{ij}] \\ &=& \gamma_{00} + \gamma_{10} Time_{ij} + \epsilon^{*}_{ij} \\ \begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} &\sim& N \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_{0}^{2} & \sigma_{01} \\ \sigma_{10} & \sigma_{1}^{2} \end{bmatrix}\right),\quad \epsilon_{ij} \sim N(0, \sigma_{\epsilon}^{2}) \end{eqnarray} \]

普通の回帰モデルに似ているが、大きな違いは残差が合成されている残差であること。これまでどおり、レベル1サブモデルの残差は正規分布に従い、レベル2サブモデルは2変量正規分布に従う。

例えばこのモデルにおいて、\(\epsilon^{*}_{ij} \sim N(0, \sigma_{\epsilon^{*}}^{2})\)だとすると、レベル２の残差\(\zeta_{0i},\zeta_{1i}\)は0ということになる。つまり、\(\sigma_{0}^{2},\sigma_{1}^{2}\)も0となる。\(\sigma_{0}^{2},\sigma_{1}^{2}\)が0であるということは、初期値や変化率が個人間で同じで一定となる。

このような場合に、モデルに必要なのは\(Time_{ij}\)が十分であること。極端な話で厳密さにかけるが、2点しかないのであれば、点と線を結んでいるようなものなので、残差が0で分散が計算できない。

summary(lm(c(10,20) ~ c(1,2)))

## 
## Call:
## lm(formula = c(10, 20) ~ c(1, 2))
## 
## Residuals:
## ALL 2 residuals are 0: no residual degrees of freedom!
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.549e-15        NaN     NaN      NaN
## c(1, 2)     1.000e+01        NaN     NaN      NaN
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 1 and 0 DF,  p-value: NA

少なくとも2時点のデータで\(Y_{ij}\)がばらついているのであれば、残差も計算できる。3時点のデータで\(Y_{ij}\)がばらついているのであれば、各地点の縦縞に直線が通る。測定回数や頻度が異なるデータは、観測地点で観測が行われず、各地点の近辺に値がばらつくので、縦縞にはならず、横に広がったような関係になり、推定されやすくなる。

また、理論的に取りうることがないような値を制約するパラメタ境界制約という問題もある。例えば分散はマイナスにならないはずである。ただ、場合によってはマルチレベルモデルのパラメタ推定において、繰り返し計算を行っていると0やありえない値を推定することがある。つまり、分散が0になっている場合には、境界制約の問題に出くわしている可能性が高い。

意図的にサイズを小さくして問題が起こるようにしたデータセットを利用して実際に分散が0になるかを確認しておく。このデータの個人は3回以下の観測しか持たない。

wages.small <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_small_pp.txt", header=T, sep=",")
wages.small %>% 
  group_by(id) %>% count() %>% 
  group_by(n)  %>% count() %>% 
  ggplot(., aes(x = n, y = nn)) +
  geom_bar(stat = "identity") + 
  scale_x_continuous(breaks = seq(0, 15, 1)) + 
  theme_bw()