๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Statistics/Bayesian Statistics

์ž„์˜์ ˆํŽธ ๋ชจ๋ธ(Random intercept model)

R์˜ car ํŒจํ‚ค์ง€์˜ Leinhardt ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž„์˜์ ˆํŽธ ๋ชจ๋ธ(Random intercept model)์„ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณด์ž. ์ˆ˜ํ–‰๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

1. ๋ฐ์ดํ„ฐ ํ™•์ธ

2. ๋ชจ๋ธ๋ง

3. ๋ชจ๋ธํ™•์ธ

 

1. ๋ฐ์ดํ„ฐ ํ™•์ธ

 

In:

library(car)

data('Leinhardt')

pairs(Leinhardt)

head(Leinhardt)

 

Out:

          income infant   region oil
Australia   3426   26.7     Asia  no
Austria     3350   23.7   Europe  no
Belgium     3346   17.0   Europe  no
Canada      4751   16.8 Americas  no
Denmark     5029   13.5   Europe  no
Finland     3312   10.1   Europe  no

 

โ–ท ์—ฐ์†ํ˜• ๋ณ€์ˆ˜๋Š” income๊ณผ infant, ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ๋Š” region๊ณผ oil๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

 

โ–ท ๊ทธ๋ฆผ์—์„œ ๋ณด๋‹ค์‹œํ”ผ infant์™€ income์€ ๋น„์„ ํ˜•์ ์ธ ์Œ์˜ ๊ด€๊ณ„, region๋ณ„๋กœ infant๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒฝํ–ฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ oil์ด yes์ธ ๊ฒฝ์šฐ, infant๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๊ฒฝํ–ฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

2. ๋ชจ๋ธ๋ง

 

์ž„์˜์ ˆํŽธ ๋ชจ๋ธ์˜ ์„ ํƒ๋œ ๋ณ€์ˆ˜์˜ ์ˆ˜์ค€์— ๋”ฐ๋ผ ์ ˆํŽธ์ด ๋‹ฌ๋ผ์ง€๋Š” ๋ชจํ˜•์ด๋‹ค. ๋นˆ๋„์ฃผ์˜ ๊ด€์ ์—์„œ์˜ ์ž„์˜์ ˆํŽธ ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

โ–ท beta 0๋Š” ์ ˆํŽธ์˜ ์ถ”์ •๋Ÿ‰, beta 1์€ ์„ค๋ช…๋ณ€์ˆ˜์˜ ์ถ”์ •๋Ÿ‰์ด๋‹ค. ๊ธฐ์กด์˜ ์„ ํ˜•ํšŒ๊ท€ ๋ถ„์„๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ์ˆ˜์ค€์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ์ ˆํŽธ์˜ ์˜ค์ฐจ๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

 

โ–ถ ๋ฒ ์ด์ง€์•ˆ ๊ด€์ ์—์„œ์˜ ์ž„์˜์ ˆํŽธ ๋ชจ๋ธ์€ ๋ฒ ์ด์ง€์•ˆ ์„ ํ˜•ํšŒ๊ท€ ๋ถ„์„์—์„œ ์ ˆํŽธ์˜ ์˜ค์ฐจ์ธ u์˜ ๋‘ ๋ชจ์ˆ˜์— ์‚ฌ์ „๋ถ„ํฌ์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ชจ๋ธ๋งํ•˜๊ฒŒ ๋œ๋‹ค.

 

์œ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋ชจ๋ธ์—์„œ๋Š” region์˜ ์ˆ˜์ค€์— ๋”ฐ๋ผ ์ ˆํŽธ์ด ๋‹ฌ๋ผ์ง€๋„๋ก ๋ชจ๋ธ๋งํ•  ๊ฒƒ์ด๋‹ค.

 

In:

library(rjags)

data = na.omit(Leinhardt)

data$log_income = log(data$income)
data$log_infant = log(data$infant)

mod_string = " model {
  for (i in 1:length(y)) {
    y[i] ~ dnorm(mu[i], prec)
    mu[i] = a[region[i]] + b[1]*log_income[i] + b[2]*is_oil[i]
  }
  
  for (j in 1:max(region)) {
    a[j] ~ dnorm(a0, prec_a)
  }
  
  a0 ~ dnorm(0.0, 1.0/1.0e6)
  prec_a ~ dgamma(1/2.0, 1*10.0/2.0)
  tau = sqrt(1.0/prec_a)
  
  for (j in 1:2) {
    b[j] ~ dnorm(0.0, 1.0/1.0e6)
  }
  
  prec ~ dgamma(5/2.0, 5*10.0/2.0)
  sig = sqrt(1.0/prec)
} "

data_jags = list(y = data$log_infant, 
                 log_income = data$log_income,
                 is_oil = as.numeric(data$oil == 'yes'), 
                 region = as.numeric(data$region))

params = c('a0', 'a', 'b', 'sig', 'tau')

mod = jags.model(textConnection(mod_string), 
                 data = data_jags, 
                 n.chains = 3)

update(mod, 1e3)

mod_sim = coda.samples(model = mod,
                       variable.names = params,
                       n.iter = 5e3)

mod_comb_sim = as.mcmc(do.call(rbind, mod_sim))

 

Out:

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 101
   Unobserved stochastic nodes: 9
   Total graph size: 622

Initializing model

  |**************************************************| 100%
  |**************************************************| 100%

 

โ–ท ๋ฐ์ดํ„ฐ ํ™•์ธ์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ income๊ณผ infant์— ๋กœ๊ทธ๋ฅผ ์ทจํ•˜์—ฌ ๋‘˜์˜ ๊ด€๊ณ„๊ฐ€ ์„ ํ˜•์ด ๋˜๋„๋ก ํ•˜์˜€๋‹ค.

 

โ–ท ์ด์™ธ์˜ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ ์ƒ๋žตํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค. JAGS์˜ ์‚ฌ์šฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ ์ž ํ•œ๋‹ค๋ฉด, ๋‹ค์Œ์˜ ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค.

 

 

JAGS(Just Another Gibbs Sampler) ์‚ฌ์šฉ๋ฒ•

R์˜ JAGS(Just Another Gibbs Sampler)์˜ ์‚ฌ์šฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณผ ๊ฒƒ์ด๋‹ค. JAGS๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๊ณผ์ •์€ 4๋‹จ๊ณ„๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. 1. Specify the model 2. Set up the model 3. Run the MCMC(Markov Chain Monte C..

rooney-song.tistory.com

 

3. ๋ชจ๋ธ ํ™•์ธ

 

Gelman-Rubin diagostic์„ ํ†ตํ•ด ๋งˆ๋ฅด์ฝ”ํ”„ ์ฒด์ธ์ด ์ˆ˜๋ ดํ•˜์˜€๋Š”์ง€ ํ™•์ธํ•˜์—ฌ ๋ณด์ž.

 

In:

gelman.diag(mod_sim)

# plot(mod_sim)

 

Out:

Potential scale reduction factors:

     Point est. Upper C.I.
a[1]       1.04       1.12
a[2]       1.05       1.13
a[3]       1.05       1.13
a[4]       1.05       1.13
a0         1.01       1.03
b[1]       1.05       1.15
b[2]       1.00       1.01
sig        1.00       1.00
tau        1.01       1.02

Multivariate psrf

1.03

 

โ–ท ๋ชจ๋“  ๊ฒฐ๊ณผ๊ฐ€ 1 ๊ทผ์ฒ˜๋กœ ์ˆ˜๋ ดํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๋ชจ๋ธ์˜ DIC๋ฅผ ๊ตฌํ•˜์—ฌ ๋ณด์ž.

 

In:

dic.samples(mod, n.iter = 1e3)

 

Out:

  |**************************************************| 100%
Mean deviance:  213.7 
penalty 7.035 
Penalized deviance: 220.7 

 

โ–ท ๋ชจ๋ธ์˜ DIC๋Š” 221.7๋กœ ์œ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฒ ์ด์ง€์•ˆ ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ(230.1)๋ณด๋‹ค ๋” ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋”ฐ๋ผ์„œ DIC๋ฅผ ๊ณ ๋ คํ•˜์˜€ ์„๋•Œ, ์ž„์˜์ ˆํŽธ ๋ชจ๋ธ์ด ๋” ๋‚˜์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

โ–ท ์œ„์˜ ๋ชจ๋ธ์˜ ์‹ค์ œ ๋ชจ์ˆ˜๋Š” 9๊ฐœ์ง€๋งŒ, Effective number of parameters๋Š” 7.452๊ฐœ์ธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ด๋Š” ๋ชจ์ˆ˜๋“ค๊ฐ„ ์ •๋ณด๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ ๋ชจ์ˆ˜์˜ ๊ฐฏ์ˆ˜๋ณด๋‹ค ๋” ์ ๊ฒŒ ๋‚˜ํƒ€๋‚œ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

์ถ”์ •๋œ ๋ชจ์ˆ˜์˜ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์—ฌ ๋ณด์ž.

 

In:

summary(mod_sim)

 

Out:

Iterations = 1001:6000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

        Mean      SD  Naive SE Time-series SE
a[1]  6.5393 0.55904 0.0045646      0.0430400
a[2]  5.9943 0.70695 0.0057722      0.0559740
a[3]  5.8331 0.62960 0.0051407      0.0490414
a[4]  5.5144 0.86077 0.0070282      0.0682821
a0    5.9842 1.30529 0.0106576      0.0556278
b[1] -0.3383 0.10649 0.0008694      0.0086461
b[2]  0.6352 0.35149 0.0028699      0.0046075
sig   0.9197 0.06615 0.0005401      0.0006532
tau   2.0316 1.00292 0.0081888      0.0107208

2. Quantiles for each variable:

        2.5%     25%     50%     75%   97.5%
a[1]  5.4841  6.1575  6.5237  6.9064  7.6567
a[2]  4.6616  5.5152  5.9707  6.4650  7.4010
a[3]  4.6407  5.3993  5.8127  6.2531  7.1027
a[4]  3.8707  4.9235  5.4843  6.0860  7.2300
a0    3.4451  5.2072  5.9754  6.7482  8.5718
b[1] -0.5529 -0.4096 -0.3348 -0.2664 -0.1363
b[2] -0.0626  0.4034  0.6363  0.8676  1.3372
sig   0.8013  0.8733  0.9152  0.9623  1.0603
tau   0.9855  1.4085  1.7742  2.3407  4.6289

 

โ–ท ์œ„์˜ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ๋กœ๋ถ€ํ„ฐ ์ˆ˜์ค€๋ณ„ ์ ˆํŽธ์— ๋”ฐ๋ฅธ ์‹ค์ œ์˜ ํ•ด์„์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์ ˆํŽธ์€ ๋ชจ๋“  ์„ค๋ช…๋ณ€์ˆ˜๊ฐ€ 0์ธ ์ƒํƒœ์—์„œ์˜ ๊ฒฐ๊ณผ๋ฅผ ์˜๋ฏธํ•˜๋Š”๋ฐ, ์œ„์˜ ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์œ„์—์„œ๋Š” ๊ทธ๋Ÿฐ ๊ฒฝ์šฐ๋Š” ์กด์žฌํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

โ–ท a0์™€ tau๋Š” ์ ˆํŽธ์— ๋Œ€ํ•œ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

โ–ท ์ด์™ธ์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ํ•ด์„์€ ๋ฒ ์ด์ง€์•ˆ ์„ ํ˜•ํšŒ๊ท€ ๋ถ„์„๊ณผ ๊ฐ™๋‹ค.

 


Reference:

"Bayesian Statistics: From Concept to Data AnalysisTechniques and Models," Coursera, https://www.coursera.org/learn/bayesian-statistics/.