はじめに

このノートではSaaSのLTVを計算するために必要な平均契約期間を推定することを目的としている。SaaSのLTVを計算することは、自社のビジネスにとって、LTV基準で優良な顧客を識別するためにも役立てられるし、慣習的なLTV/CAC > 3x基準で広告がうまく運用できているかを調べるために必要だったりする。そこで今回は、LTVの推定に必要な平均契約期間を推定する方法をまとめておく。適切かどうかはわからない。

平均契約期間の問題点

インターネットで検索すれば、平均契約期間は\(\frac{1}{ChurnRate}\)で計算できると書かれている記事が沢山でてくる。ある仮定のもとで\(\frac{1}{ChurnRate}\)が示されているものは良いと思うが、\(\frac{1}{ChurnRate}\)という計算式だけを記載しているものも多い。

\(\frac{1}{ChurnRate}\)が平均契約期間となるのは、ChurnRateが固定のときの話であって、ChurnRateが変化しやすい実際のサービスでは、ChurnRateが固定であるという仮定は正しいのかもわからない。そもそもBtoB系のサービスで、リテンションカーブを描くと、大概は最初が急で、あとはなだらかになるため、チャーンレートは一定ではない。

また、ChurnRateが1%改善することで、月数ベースで計算しているのあれば、1%の改善で1/0.02=50ヶ月から1/0.01=100ヶ月へと変化し、平均契約期間がプラス50ヶ月されることになる。これに粗利単価をかけるとLTVとなるので、月次でLTVを計算していると、LTVが安定しなくなってしまう。

この問題を回避するために、単価に生存曲線の生存確率を現在価値(Present Value)の割引率として利用して、LTVを推定する方法を前回まとめてたが、あの方法は確かに現在価値を総和することでLTVを計算できるように見える。ただ問題があって、「期間に応じて何円」という形で提示できるが、平均的に「何円」とは言えない。平均契約期間がないので、どこまでの期間を総和すればよいかわからないためである。

そこで、平均契約期間を推定するためにワイブル分布を利用してみようというのが今回のノートの内容。

ワイブル分布

事象の生起確率が期間内において一定ではなく、変化する場合、その事象が発生するまでの時間を確率変数と考えると、その確率変数が従う分布はワイブル分布となる。

\[ f(x) = \left( \frac{shape}{scale} \right)\left( \frac{1}{scale} \right)^{shape-1}e^-{\left( \frac{x}{scale} \right)^{shape}}, x \ge 0, shape \ge 0, scale \ge 0 \]

ワイブル分布の期待値の導出をPCで書くのは少し手間なので、今回は手書きのノートを載せておく。なんでか1年前くらいに導出していた模様で、Rでは関数のパラメタが逆のことも書いてある。多分、。ワイブル分布を使う用事があって、Rでうまくいかなかったんだろう。

ワイブル分布のE(X),V(X)

さておき、ワイブル分布は何らの製品が故障するまでの時間を確率変数として事象を説明する際に利用されることが多く、バスタブ曲線の例がよく挙げられるが、故障するまでの確率を一定としていない。SaaSサブスクに置き換えると、サービスを契約してから解約するまでの期間において、解約率が一定ではないとも考えれるので、ワイブル分布の期待値を計算すれば平均契約期間を計算できそうである。これが適切かどうかはわからない。

ただ問題点があって、ワイブル分布に従うであろうデータがあるから期待値をすぐに計算できるわけでもない。shapeとscaleパラメタが必要になるので、まずはこれを推定する。そのためにワイブル分布を仮定してパラメトリックサバイバルモデリングを行い、shapeとscaleパラメタを手に入れてから、期待値を計算する。

ここではサンプルデータとして、lungデータセットを利用する。

library(tidyverse)
library(survival)
library(rstan)

# status: 1=censored, 2=dead
# sex: 1=male, 2=female
lung2 <- lung %>% 
  dplyr::mutate(
    status = status - 1,
    is_female = sex - 1
  ) %>% 
  dplyr::select(time, status, is_female) %>% 
  dplyr::filter(time != 0)

head(lung2)

##   time status is_female
## 1  306      1         0
## 2  455      1         0
## 3 1010      0         0
## 4  210      1         0
## 5  883      1         0
## 6 1022      0         0

survreg関数では、モデルのフィッティングは対数スケールで行われるため、回帰係数は対数生存時間への係数を表す。ここでは、is_female=1という条件のデータの期待値を最終的に計算するので、説明変数としてis_femaleを使っている。

結果を見ると、exp(0.3956)=1.48より、女性のほうが生存時間が1.48倍長いことを意味する。

lung_reg <- survreg(
  formula = Surv(time, status) ~ is_female,
  data = lung2,
  dist = 'weibull'
)

summary(lung_reg)

## 
## Call:
## survreg(formula = Surv(time, status) ~ is_female, data = lung2, 
##     dist = "weibull")
##               Value Std. Error     z       p
## (Intercept)  5.8842     0.0720 81.76 < 2e-16
## is_female    0.3956     0.1276  3.10  0.0019
## Log(scale)  -0.2809     0.0619 -4.54 5.7e-06
## 
## Scale= 0.755 
## 
## Weibull distribution
## Loglik(model)= -1148.7   Loglik(intercept only)= -1153.9
##  Chisq= 10.4 on 1 degrees of freedom, p= 0.0013 
## Number of Newton-Raphson Iterations: 5 
## n= 228

生存時間の予測は下記の通り、predict関数で予測できる。

# 対数生存時間から生存時間の予測
# getS3method("predict", "survreg")の通りデフォルトは指数変換後の生存時間を返す
lung_pred_lp <- predict(lung_reg, type = 'lp')
lung_pred <- exp(lung_pred_lp)
head(lung_pred, 5)

## [1] 359.3015 359.3015 359.3015 359.3015 359.3015

生存曲線も可視化しておく。

# 生存曲線を可視化
# Ref:https://stats.stackexchange.com/questions/177679/using-quantile-in-predict-for-survival
pct <- seq(0, 1, 0.01)
pred_female <- predict(
  lung_reg,
  type = 'quantile',
  newdata = data.frame(is_female = 1),
  se = TRUE,
  p = pct
  )
fit_female <- pred_female$fit
upr_female <- fit_female + 1.96 * pred_female$se.fit
lwr_female <- fit_female - 1.96 * pred_female$se.fit
plot(fit_female, 1-pct, type = 'l', col = 'tomato')
lines(upr_female, 1-pct, lty = 2, col = 'tomato')
lines(lwr_female, 1-pct, lty = 2, col = 'tomato')

このモデルからパラメタを取り出すことができる。下記の慶応大学の山本先生の記事を参考にさせていただいた。

Rによる生存時間解析

実際にここでは、is_female=1という条件のデータの期待値を計算する。計算されたshapeとscaleパラメタを使った結果、491日となった。

# shapeはscaleの逆数
shape_female <- 1/lung_reg$scale
# intercept, is_female=1という条件
x_female <- c(1, 1) 
# 係数に変数を掛けて総和=線形予測子
lp_female <- coef(lung_reg) %*% x_female
# 線形予測子を指数変換する
scale_female <- exp(lp_female)
# ワイブル分布の期待値を計算
expected_value <- scale_female * gamma(1 + 1/shape_female)
expected_value

##          [,1]
## [1,] 491.0803

計算されたshapeとscaleパラメタを使ってワイブル分布を可視化しておく。

# 推定されたパラメタをもとにワイブル分布を可視化
time <- seq(min(lung2$time), max(lung2$time), 1)
y_female <- dweibull(time, shape = shape_female, scale = scale_female)
plot(time, y_female, type = 'l', col = 'tomato')

期待値が491日(≒平均契約期間)が得られたので、LTVを総和する期間はこれを目安すればよい。lungデータセットは日数が長いので、491日となっているが、例えば、月単位で計算した結果、期待値が30ヶ月となった場合、30ヶ月までの単価に生存曲線を掛けたPVを総和すれば、LTV「何円」と推定できる。

ワイブル分布の期待値の95％信頼区間を計算する方法は分からないが(ブートストラップ法で計算できなくはなさそう)、計算できれば幅をもってLTVを推定できそう。知らんけど。

おまけ

おまけとしてStanで生存分析とCox回帰を行う例を載せておく。下記を参考にさせていただいた。

Stanで生存時間解析(Weibull 回帰)

サンプルデータは、432人の男性囚人の再犯に関する実験データ。刑務所から釈放されてから1年間(52週)追跡された。逮捕された人が事件を起こし、再逮捕されるまでの期間(単位：週)をCox回帰分析を使って、各変数の影響を調べることを目的にしたデータセット。

weekが再逮捕されるまでの期間(単位：週)で、52週を超えると打ち切り。arrestは再逮捕されたか否か、finは財政的な援助を受けているかどうか。比例ハザードモデルでは、財政的な援助が再逮捕の発生を下げるかどうかを検討する。

url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
df <- read.table(url, header = TRUE)
df <- df %>% select(week, arrest, fin) %>% 
  mutate(fin = if_else(fin == "no", 0 , 1))
  
head(df)

##   week arrest fin
## 1   20      1   0
## 2   17      1   0
## 3   25      1   0
## 4   52      0   1
## 5   52      0   0
## 6   52      0   0

打ち切りの扱いに関しては、アヒル本の第7章の7.8 打ち切りが詳しい。ここでは、打ち切りとなる52週の尤度を下記のように累積分布関数を利用して尤度に加える。

\[ \begin{align} Prob[52 \lt y] &= \int_{52}^{\infty}weibull(52|m, \eta) \\\ &= 1 - \int_0^{52} weibull(52|m, \eta) \\\ &= 1 - weibull\_cdf(52|m, \eta) \\\ &= weibull\_ccdf(52|m, \eta)\end{align} \]

modelブロックでは、打ち切り(=0)かどうかを判定して、密度関数、分布関数から尤度を計算している。

data {
  int N;
  int week[N];
  int arrest[N];
}

parameters {
  real shape;
  real scale;
}

model {
  for(n in 1:N){
    if(arrest[n] == 0){
    // The log of the Weibull complementary cumulative distribution function
      target += weibull_lccdf(week[n]| shape, scale);
    }else{
    // The log of the Weibull density
      target += weibull_lpdf(week[n]| shape, scale);
    }
  }
}

generated quantities {
  real pred_Y[N];

  for(n in 1:N){
    pred_Y[n] = (1 - weibull_cdf(week[n], shape, scale));
  }
}

Stanに渡すデータを用意する。

standata <- list(N = nrow(df), 
                 week = df$week,
                 arrest = df$arrest)

fit <- stan(file = 'weibull.stan', data = standata, iter = 2000, chains = 4)

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## using C compiler: ‘Apple clang version 14.0.0 (clang-1400.0.29.202)’
## using SDK: ‘MacOSX13.1.sdk’
## clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -0.49686, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Scale parameter is -1.86395, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Scale parameter is -1.11009, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -1.55321, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -0.369625, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -0.950526, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Scale parameter is -0.595757, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -1.46093, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -0.605972, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Scale parameter is -0.0237462, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -0.503671, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Scale parameter is -0.255635, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -0.29396, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: weibull_lpdf: Shape parameter is -1.10761, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 1: 
## Chain 1: Gradient evaluation took 2.5e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.161 seconds (Warm-up)
## Chain 1:                0.149 seconds (Sampling)
## Chain 1:                0.31 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: weibull_lpdf: Scale parameter is -1.27362, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: weibull_lpdf: Shape parameter is -1.94515, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: weibull_lpdf: Shape parameter is -0.521643, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: weibull_lpdf: Shape parameter is -1.83261, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: weibull_lpdf: Shape parameter is -0.400427, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: weibull_lpdf: Scale parameter is -0.276471, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 2: 
## Chain 2: Gradient evaluation took 1.6e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.287 seconds (Warm-up)
## Chain 2:                0.139 seconds (Sampling)
## Chain 2:                0.426 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
## Chain 3: Rejecting initial value:
## Chain 3:   Error evaluating the log probability at the initial value.
## Chain 3: Exception: weibull_lpdf: Scale parameter is -0.520892, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 3: Rejecting initial value:
## Chain 3:   Error evaluating the log probability at the initial value.
## Chain 3: Exception: weibull_lpdf: Shape parameter is -1.3579, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 3: Rejecting initial value:
## Chain 3:   Error evaluating the log probability at the initial value.
## Chain 3: Exception: weibull_lpdf: Shape parameter is -0.0495647, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 3: Rejecting initial value:
## Chain 3:   Error evaluating the log probability at the initial value.
## Chain 3: Exception: weibull_lpdf: Scale parameter is -1.48575, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 3: Rejecting initial value:
## Chain 3:   Error evaluating the log probability at the initial value.
## Chain 3: Exception: weibull_lpdf: Shape parameter is -1.00092, but must be positive finite! (in 'anon_model', line 15, column 6 to column 52)
## Chain 3: 
## Chain 3: Gradient evaluation took 1.6e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.197 seconds (Warm-up)
## Chain 3:                0.127 seconds (Sampling)
## Chain 3:                0.324 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.7e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.33 seconds (Warm-up)
## Chain 4:                0.133 seconds (Sampling)
## Chain 4:                0.463 seconds (Total)
## Chain 4:

推定結果は下記の通り。

print(fit, pars = c("scale", "shape"))

## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##         mean se_mean    sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## scale 127.04    0.45 14.47 103.79 116.89 125.26 135.52 160.37  1035 1.00
## shape   1.36    0.00  0.12   1.13   1.27   1.35   1.44   1.60  1052 1.01
## 
## Samples were drawn using NUTS(diag_e) at Tue Jan  2 20:32:07 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

信用区間つきの生存関数を描く。

rstan::extract(fit)$pred_Y %>% 
  data.frame() %>% 
  apply(., 2, quantile, prob = c(0.025, 0.10, 0.5, 0.90, 0.975)) %>% 
  t() %>% 
  `colnames<-`(c("cred2.5", "cred10", "cred50", "cred90", "cred97.5")) %>% 
  cbind(df) %>% 
  ggplot() + 
  geom_line(aes(week, y = cred50), colour = "#4169e1") +
  geom_ribbon(aes(week, ymin = cred2.5, ymax = cred97.5), alpha = 0.1, fill = "#4169e1") +
  geom_ribbon(aes(week, ymin = cred10, ymax = cred90), alpha = 0.3, fill = "#4169e1") +
  theme_classic() +
  theme(axis.text  = element_text(size = 15),
        axis.title = element_text(size = 15))

次は比例ハザードモデル。ワイブル分布では、現象の傾向を示す形状(shape)パラメタ\(m\)と、現象の時間スケールを示す尺度(scale)パラメタ\(\eta\)で表現される。ハザード関数は下記のように表現される。

\[ \begin{align} h(t|m, \eta) &= \frac{m}{\eta}\left(\frac{t}{\eta}\right)^{m-1} \\ &= \frac{m}{\eta}t^{m-1}\left( \frac{1}{\eta} \right)^{m} \left( \frac{1}{\eta} \right)^{-1} \\\ &= \frac{m}{\eta}t^{m-1}\left( \frac{1}{\eta} \right)^{m} \eta \\\ &= \frac{m}{\eta^m}t^{m-1} \\\ &= m \frac{1}{\eta^m}t^{m-1} \\\ &= m \lambda t^{m-1} \\\ \end{align} \]

参考にしたブログに記載されている通り、ここでも下記の資料に従い、

尺度(scale)パラメタ\(\eta\)に共変量の効果を加える。そして、尺度パラメータ\(\eta\)へダイレクトに加えるのではなく\(\lambda\)を経由させる。基底状態に対するパラメタを切片、共変量、ダミー変数\(\beta_{0}, \beta, F\)とする。

\[ \lambda = exp(\beta_0 + \beta^TF ) \]

これを\(\eta\)について変形する。これをStanで実装する。

\[ \eta = exp \left(-\frac{\beta_0 + \beta^TF}{m} \right) \]

finは財政的な援助を受けているかどうかなので、財政的な援助が再逮捕の発生を下げるかどうかを検討する。

standata <- list(N = nrow(df), 
                 week = df$week,
                 arrest = df$arrest,
                 fin = df$fin)
fit <- stan(file = 'coxreg.stan', data = standata, iter = 2000, chains = 4)

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## using C compiler: ‘Apple clang version 14.0.0 (clang-1400.0.29.202)’
## using SDK: ‘MacOSX13.1.sdk’
## clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 5.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.59 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.906 seconds (Warm-up)
## Chain 1:                0.838 seconds (Sampling)
## Chain 1:                1.744 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 4.4e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.44 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.904 seconds (Warm-up)
## Chain 2:                0.963 seconds (Sampling)
## Chain 2:                1.867 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 4.2e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.42 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.858 seconds (Warm-up)
## Chain 3:                0.873 seconds (Sampling)
## Chain 3:                1.731 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 4.1e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.41 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.863 seconds (Warm-up)
## Chain 4:                0.817 seconds (Sampling)
## Chain 4:                1.68 seconds (Total)
## Chain 4:

推定結果はこの通りです。beta[2]が財政的な援助finで、-0.37[-0.75, -0.01]となり、95%の確率で再逮捕の可能性は低下させる模様。

print(fit, pars = c("beta"))

## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## beta[1] -6.48    0.01 0.49 -7.44 -6.82 -6.48 -6.13 -5.51  1324    1
## beta[2] -0.37    0.00 0.19 -0.74 -0.50 -0.37 -0.25 -0.01  1699    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Jan  2 20:32:37 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

可視化しておく。経済的支援を受けた場合は、再犯率が低くなっている。

df_Y1 <- rstan::extract(fit)$pred_Y1%>% 
  data.frame() %>% 
  apply(., 2, quantile, prob = c(0.025, 0.10, 0.5, 0.90, 0.975)) %>% 
  t() %>% 
  `colnames<-`(c("cred2.5", "cred10", "cred50", "cred90", "cred97.5")) %>% 
  cbind(df)

df_Y2 <- rstan::extract(fit)$pred_Y2%>% 
  data.frame() %>% 
  apply(., 2, quantile, prob = c(0.025, 0.10, 0.5, 0.90, 0.975)) %>% 
  t() %>% 
  `colnames<-`(c("cred2.5", "cred10", "cred50", "cred90", "cred97.5")) %>% 
  cbind(df)

ggplot() + 
  geom_line(data = df_Y1, aes(week, y = cred50), colour = "#4169e1") +
  geom_ribbon(data = df_Y1, aes(week, ymin = cred2.5, ymax = cred97.5), alpha = 0.3, fill = "#4169e1") +
  geom_line(data = df_Y2, aes(week, y = cred50), colour = "#AF4F5C") +
  geom_ribbon(data = df_Y2, aes(week, ymin = cred2.5, ymax = cred97.5), alpha = 0.3, fill = "#AF4F5C") +
  theme_classic() +
  theme(axis.text  = element_text(size = 15),
        axis.title = element_text(size = 15))

# url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
# Rossi <- read.table(url, header=TRUE)
mod.allison <- coxph(Surv(week, arrest) ~ fin, data=df)

plot(survfit(mod.allison, newdata = data.frame(fin = c(0, 1))),
     conf.int = TRUE, ylim = c(0.6, 1), col = c(2, 4),
     xlab = "Weeks", ylab = "Proportion Not Rearrested")
legend("bottomleft", legend=c("fin = no", "fin = yes"), lty = c(1 ,1), col = c(2, 4), inset = 0.02)

対数尤度を計算する場合は下記のコードを使う。

stancode <- "
data {
  int N ;
  int week[N] ;
  int arrest[N] ;
}

parameters {
  real shape ;
  real scale ;
}

model {
  for(n in 1:N){
    if(arrest[n] == 0){
      target += weibull_lccdf(week[n]| shape, scale) ;
    }else{
      target += weibull_lpdf(week[n]| shape, scale) ;
    }
  }
}

generated quantities {
  real log_lik[N] ;
  real pred_Y[N];

  for(n in 1:N){
    if(arrest[n] == 0){
      log_lik[n] = weibull_lccdf(week[n]| shape, scale) ;
    }else{
      log_lik[n] = weibull_lpdf(week[n]| shape, scale) ;
    }
  }

  for(n in 1:N){
    pred_Y[n] = (1 - weibull_cdf(week[n], shape, scale));
  }
}
"

stancode <- "
data {
  int N ;
  int week[N] ;
  int arrest[N] ;
  int fin[N] ;
}

parameters {
  real shape ;
  real beta[2] ;
}

model {
  for(n in 1:N){
    if(arrest[n] == 0){
      target += weibull_lccdf(week[n]| shape, exp(- (beta[1] + fin[n] * beta[2]) / shape)) ;
    }else{
      target += weibull_lpdf(week[n]| shape, exp(- (beta[1] + fin[n] * beta[2]) / shape)) ;
    }
  }
}

generated quantities {
  real log_lik[N] ;
  real pred_Y1[N];
  real pred_Y2[N];

  for(n in 1:N){
    if(arrest[n] == 0){
      log_lik[n] = weibull_lccdf(week[n]| shape, exp(- (beta[1] + fin[n] * beta[2]) / shape)) ;
    }else{
      log_lik[n] = weibull_lpdf(week[n]| shape, exp(- (beta[1] + fin[n] * beta[2]) / shape)) ;
    }
  }

  for(n in 1:N){
    pred_Y1[n] = (1 - weibull_cdf(week[n], shape, exp(- (beta[1] + beta[2]) / shape)));
    pred_Y2[n] = (1 - weibull_cdf(week[n], shape, exp(- (beta[1]) / shape)));
  }
}
"

参考文献

Rによる生存時間解析

SaaSサブスクの平均契約期間をワイブル分布で推定する試み

はじめに

平均契約期間の問題点

ワイブル分布

おまけ

参考文献