๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Statistics/Bayesian Statistics

ํ˜ผํ•ฉ ๋ชจ๋ธ(Mixture model)

๋‹ค์Œ์˜ ๋ฐ์ดํ„ฐ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ํ™•์ธํ•˜๊ณ , ์ด๋ฅผ ์ ํ•ฉํ•  ๋ถ„ํฌ์— ๋Œ€ํ•˜์—ฌ ์ƒ๊ฐํ•ด ๋ณด์ž.

 

In:

data = read.csv('../input/mixture.csv', header = F)

y = data$V1
n = length(y)

hist(y, breaks = 20)

 

Out:

 

โ–ท ์ผ๋ฐ˜์ ์ธ ๋ถ„ํฌ์™€๋Š” ๋‹ค๋ฅด๊ฒŒ -2์™€ 1 ๊ทผ์ฒ˜์— ๋‘ ๊ฐœ์˜ ๋ด‰์šฐ๋ฆฌ๋ฅผ ํ˜•์„ฑํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” ์ •๊ทœ๋ถ„ํฌ, ์ง€์ˆ˜๋ถ„ํฌ, ๊ฐ๋งˆ๋ถ„ํฌ๋Š” ๋‹จ๋ด‰ํ˜•ํƒœ๋ฅผ ๋„๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ์œ„์˜ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•  ๊ฒฝ์šฐ, ๋‘ ๊ฐœ์˜ ๋ด‰์šฐ๋ฆฌ์— ๋Œ€ํ•ด ์ ํ•ฉํ•œ ๋ถ„ํฌ๋ฅผ ์–ป์„ ์ˆ˜ ์—†๋‹ค.

 

โ–ถ ํ˜ผํ•ฉ ๋ชจ๋ธ(Mixture model)์€ ๋‘ ๊ฐœ์ด์ƒ์˜ ๋ถ„ํฌ๋ฅผ ํ•ฉ์ณ์„œ ๋งŒ๋“  ๋ชจ๋ธ๋กœ ๊ธฐ์กด ๋ถ„ํฌ์— ๋น„ํ•˜์—ฌ ์ž์œ ๋„ ๋†’์€ ์ ํ•ฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ ๋ณต์žกํ•œ ๋ถ„ํฌ์˜ ์ ํ•ฉ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ตฐ์ง‘ํ™”์— ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•œ๋‹ค.

 

์œ„์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•˜์—ฌ ํ˜ผํ•ฉ ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๋ถ„ํฌ๋ฅผ ์ ํ•ฉ์‹œ์ผœ๋ณด์ž.

 

In:

library('rjags')

mod_string = " model {
  for (i in 1:length(y)) {
    y[i] ~ dnorm(mu[z[i]], prec[z[i]])
    z[i] ~ dcat(omega)
  }
  
  mu[1] ~ dnorm(-1.0, 1.0/100.0)
  mu[2] ~ dnorm(1.0, 1.0/100.0) T(mu[1],)
  
  prec[1] ~ dgamma(1.0/2.0, 1.0*1.0/2.0)
  prec[2] ~ dgamma(1.0/2.0, 1.0*1.0/2.0)
  
  sig[1] = sqrt(1.0/prec[1])
  sig[2] = sqrt(1.0/prec[2])
  
  omega ~ ddirich(c(1.0, 1.0))
} "

data_jags = list(y = y)

params = c('mu', 'sig', 'omega')

mod = jags.model(textConnection(mod_string), 
                 data = data_jags, 
                 n.chains = 3)

update(mod, 1e3)

mod_sim = coda.samples(model = mod,
                       variable.names = params,
                       n.iter = 5e3)

mod_comb_sim = as.mcmc(do.call(rbind, mod_sim))

colMeans(mod_comb_sim)

 

Out:

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 200
   Unobserved stochastic nodes: 205
   Total graph size: 817

Initializing model

  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
  |**************************************************| 100%
  |**************************************************| 100%
     mu[1]      mu[2]   omega[1]   omega[2]     sig[1]     sig[2] 
-2.0624518  1.5038066  0.3987204  0.6012796  1.1804035  1.1344625 

 

โ–ท ํ˜ผํ•ฉ ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์š”์†Œ๋กœ์จ ๋‘ ๊ฐœ์˜ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์˜€๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ํ˜ผํ•ฉ ๋ชจ๋ธ์˜ ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

์œ„์˜ ์‹์„ ์ด์šฉํ•˜์—ฌ ํ˜ผํ•ฉ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์—ฌ ๋ณด์ž.

 

In:

dmixnorm = function(x) {
  return(0.40 * dnorm(x, -2.06, 1.18) + 0.60 * dnorm(x, 1.50, 1.13))
}

x = seq(-6, 6, by = 0.01)

plot(x, dmixnorm(x), type = 'l', col = 'blue')

hist(y, breaks = 20, freq = F, add = TRUE)

 

Out:

 

โ–ท ํ˜ผํ•ฉ ๋ชจ๋ธ์ด ๋‘ ๊ฐœ์˜ ๋ด‰์šฐ๋ฆฌ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์— ์ž˜ ์ ํ•ฉํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 


Reference:

"Bayesian Statistics: From Concept to Data AnalysisTechniques and Models," Coursera, https://www.coursera.org/learn/bayesian-statistics/.