๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Statistics/Probabilistic Graphical Model

๋ฒ ์ด์ง€์•ˆ ๋„คํŠธ์›Œํฌ(Bayesian network)๋ฅผ ํ™œ์šฉํ•œ King County์˜ ์ง‘๊ฐ’ ๋ถ„์„

๋ฒ ์ด์ง€์•ˆ ๋„คํŠธ์›Œํฌ(Bayesian network)๋ฅผ ํ™œ์šฉํ•˜์—ฌ King County์˜ ์ง‘๊ฐ’์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋‹ค์–‘ํ•œ ์š”์†Œ์˜ ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ํ™•์ธํ•˜๊ณ , ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์ด ์ด ํ”„๋กœ์ ํŠธ์˜ ๋ชฉ์ ์ด๋‹ค. ๋ฐ์ดํ„ฐ์˜ ์ถœ์ฒ˜๋Š” ์—ฌ๊ธฐ(www.kaggle.com/harlfoxem/housesalesprediction)์ด๊ณ , ์•„๋ž˜์˜ ๊ตฌ์„ฑ ์ˆœ์„œ๋Œ€๋กœ ๋ถ„์„ ๋ฐ ๋ชจ๋ธ๋ง ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•  ๊ฒƒ์ด๋‹ค. ๋ชจ๋“  ์ฝ”๋“œ๋Š” R๋กœ ์ž‘์„ฑ๋˜์—ˆ๋‹ค.

 

1. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

2. ์‹œ๊ฐํ™” ๋ฐ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„

3. ๋‹ค์ค‘ํšŒ๊ท€๋ถ„์„

4. ๋ฒ ์ด์ง€์•ˆ ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ๋ง

 

1. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

 

In:

# Statistic
library(car)

# Data manipulation
library(dplyr)
library(tidyr)

# Visualization
library(ggplot2)
library(corrplot)
library(GGally)
library(scales)

# Bayesian network
library(bnlearn)
library(Rgraphviz)

# Setting for ggplot theme
theme_set(theme_minimal() + 
            theme(plot.title = element_text(face = 'bold', colour = 'grey10'), 
                  plot.subtitle = element_text(colour = 'grey25'), 
                  panel.grid.major = element_line(colour = 'grey90', size = 1), 
                  panel.grid.minor = element_line(colour = 'grey80', size = 0.5, linetype = 'dashed'), 
                  legend.position = 'top', 
                  legend.spacing.x = unit(0.125, 'cm'), 
                  legend.background = element_rect(fill = NULL, linetype = 'dotted'), 
                  strip.background = element_blank(), 
                  strip.text = element_text(face = 'bold', colour = 'grey25', size = 11.25)))
                  
data_kc = read.csv('../input/kc_house_data.csv')

str(data_kc)

summary(data_kc)

 

Out:

'data.frame':	21613 obs. of  21 variables:
 $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
 $ date         : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
 $ price        : num  221900 538000 180000 604000 510000 ...
 $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
 $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
 $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
 $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
 $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
 $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
 $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
 $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
 $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
 $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
 $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
 $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
 $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
 $ long         : num  -122 -122 -122 -122 -122 ...
 $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
 $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
 
        id                         date           price            bedrooms        bathrooms      sqft_living       sqft_lot      
 Min.   :1.000e+06   20140623T000000:  142   Min.   :  75000   Min.   : 0.000   Min.   :0.000   Min.   :  290   Min.   :    520  
 1st Qu.:2.123e+09   20140625T000000:  131   1st Qu.: 321950   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040  
 Median :3.905e+09   20140626T000000:  131   Median : 450000   Median : 3.000   Median :2.250   Median : 1910   Median :   7618  
 Mean   :4.580e+09   20140708T000000:  127   Mean   : 540088   Mean   : 3.371   Mean   :2.115   Mean   : 2080   Mean   :  15107  
 3rd Qu.:7.309e+09   20150427T000000:  126   3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688  
 Max.   :9.900e+09   20150325T000000:  123   Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :13540   Max.   :1651359  
                     (Other)        :20833                                                                                       
     floors        waterfront            view          condition         grade          sqft_above   sqft_basement       yr_built   
 Min.   :1.000   Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000   Min.   : 290   Min.   :   0.0   Min.   :1900  
 1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
 Median :1.500   Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000   Median :1560   Median :   0.0   Median :1975  
 Mean   :1.494   Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657   Mean   :1788   Mean   : 291.5   Mean   :1971  
 3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997  
 Max.   :3.500   Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000   Max.   :9410   Max.   :4820.0   Max.   :2015  
                                                                                                                                    
  yr_renovated       zipcode           lat             long        sqft_living15    sqft_lot15    
 Min.   :   0.0   Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399   Min.   :   651  
 1st Qu.:   0.0   1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100  
 Median :   0.0   Median :98065   Median :47.57   Median :-122.2   Median :1840   Median :  7620  
 Mean   :  84.4   Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987   Mean   : 12768  
 3rd Qu.:   0.0   3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10083  
 Max.   :2015.0   Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210   Max.   :871200  

 

โ–ท KC County์˜ ์ง‘๊ฐ’ ๋ฐ์ดํ„ฐ๋Š” 21,613๊ฐœ์˜ ํ–‰๊ณผ 21๊ฐœ์˜ ์—ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ , ๊ฐ ์—ด์˜ ์˜๋ฏธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

[id]: Unique ID for each home sold
[date]: Date of the home sale
[price]: Price of each home sold
[bedrooms]: Number of bedrooms
[bathrooms]: Number of bathrooms, where .5 accounts for a room with a toilet but no shower
[sqft_living]: Square footage of the apartments interior living space
[sqft_lot]: Square footage of the land space
[floors]: Number of floors
[waterfront]: A dummy variable for whether the apartment was overlooking the waterfront or not
[view]: An index from 0 to 4 of how good the view of the property was
[condition]: An index from 1 to 5 on the condition of the apartment,
[grade]: An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high-quality level of construction and design
[sqft_above]: The square footage of the interior housing space that is above ground level
[sqft_basement]: The square footage of the interior housing space that is below ground level
[yr_built]: The year the house was initially built
[yr_renovated]: The year of the house’s last renovation
[zipcode]: What zip code area the house is in
[lat]: Lattitude
[long]: Longitude
[sqft_living15]: The square footage of interior housing living space for the nearest 15 neighbors
[sqft_lot15]: The square footage of the land lots of the nearest 15 neighbors

 

๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์˜ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๊ฑฐ์ณค๋‹ค.

 

In:

df_kc = data_kc %>% 
  select(-id, -date, -zipcode, -lat, -long)

df_kc$age = 2015 - df_kc$yr_built + 1

df_kc = df_kc %>% 
  select(-yr_built)

df_kc = df_kc %>% 
  mutate(yr_renovated = if_else(yr_renovated == 0, 'No Renovation', 
                                if_else(yr_renovated < 1980, 'Under 1980', 
                                        if_else(yr_renovated < 2000, '1980 - 2000', '2000 - 2015'))))

df_kc$yr_renovated = as.factor(df_kc$yr_renovated)

df_kc$sqft_above_ratio = with(df_kc, sqft_above/sqft_living))

df_kc = df_kc %>% 
  select(-sqft_above, -sqft_basement)

df_kc$waterfront = as.factor(df_kc$waterfront)

df_kc[, sapply(df_kc, function(x) is.integer(x))] = lapply(df_kc[, sapply(df_kc, function(x) is.integer(x))], as.double)

 

โ–ท id, date, zipcode, lat, long์€ ๋ถ„์„ ๋Œ€์ƒ์—์„œ ์ œ์™ธํ•˜์˜€๋‹ค.

 

โ–ท ์ง‘์˜ ๋‚˜์ด๋ฅผ ์˜๋ฏธํ•˜๋Š” age๋Š” ๋ฐ์ดํ„ฐ ์ธก์ • ์‹œ์ ์ธ 2015๋…„์— yr_built๋ฅผ ๋นผ์„œ ์ƒ์„ฑํ•˜๊ณ , yr_built๋ฅผ ์ œ์™ธํ•˜์˜€๋‹ค.

 

โ–ท yr_renovated์˜ ๊ฒฝ์šฐ, ๋ณด์ˆ˜๊ฐ€ ์•ˆ ๋œ ์ง‘์€ 0, ๋ณด์ˆ˜๊ฐ€ ๋œ ์ง‘์€ ๊ณต์‚ฌ๋œ ํ•ด๋กœ ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค. ์ด๋ฅผ ๋ชจ๋ธ์— ํ•˜๋‚˜์˜ ๋ณ€์ˆ˜๋กœ ๋„ฃ๊ฒŒ๋˜๋ฉด ํ•ด์„์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๋„ค ๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ(No Renovation, Under 1980 - 2000, 2000 - 2015)๋กœ ๋ฐ”๊พธ์—ˆ๋‹ค.

 

โ–ท sqft_above์™€ sqft_basement์˜ ํ•ฉ์€ sqft_living์ด๋‹ค. ์ฆ‰, ์ด ์„ธ ์—ด์˜ ์ •๋ณด๋Š” ์ค‘๋ณต๋˜์–ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ sqft_living๋งŒ ๋‚จ๊ฒจ๋‘๊ณ , sqft_above๋ฅผ sqft_living์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฑฐ์ฃผ ๊ณต๊ฐ„ ์ค‘ ์ง€์ƒ์˜ ์ฐจ์ง€ ๋น„์œจ์„ ์˜๋ฏธํ•˜๋Š” sqft_above_ratio๋ผ๋Š” ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค.

 

โ–ท ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋Š” factor๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์ด์™ธ์˜ ๋ณ€์ˆ˜๋Š” ๋ชจ๋‘ ์‹ค์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜์˜€๋‹ค.

 

2. ์‹œ๊ฐํ™” ๋ฐ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„

 

In:

df_temp = df_kc %>% 
  select(-yr_renovated, -waterfront)

df_temp %>% 
  gather(key = key, value = value, bedrooms:view) %>% 
  ggplot(aes(value, price)) + 
  geom_bin2d() + 
  scale_x_continuous(label = comma) + 
  scale_fill_gradient(trans = 'log') + 
  guides(fill = F) + 
  facet_wrap(~key, scales = 'free', ncol = 2) + 
  labs(x = NULL)
  
df_temp %>% 
  gather(key = key, value = value, condition:sqft_above_ratio) %>% 
  ggplot(aes(value, price)) + 
  geom_bin2d() + 
  scale_x_continuous(label = comma) + 
  scale_fill_gradient(trans = 'log') + 
  guides(fill = F) + 
  facet_wrap(~key, scales = 'free', ncol = 2) + 
  labs(x = NULL)
  
df_temp = df_kc %>% 
  mutate(waterfront = if_else(waterfront == '1', 'TRUE', 'FALSE')) %>% 
  select(price, waterfront, yr_renovated)

df_temp %>% 
  gather(key = key, value = value, waterfront, yr_renovated) %>% 
  mutate(value = factor(value, levels = c('Under 1980', '1980 - 2000', '2000 - 2015', 'No Renovation', 'TRUE', 'FALSE'))) %>% 
  ggplot(aes(value, price, colour = value)) + 
  geom_boxplot(size = 1) + 
  scale_y_log10() + 
  facet_wrap(~ key, scale = 'free', ncol = 2) + 
  guides(colour = F)

 

Out:

 

โ–ท sqft_lot, sqft_lot15, sqft_living15๋Š” price์™€ ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋‹ค.

 

โ–ท ์œ„์˜ ์‹œ๊ฐํ™”๋กœ๋ถ€ํ„ฐ ๋‹ค์Œ์˜ ๊ฒฝํ–ฅ์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

- (์š•์‹ค์˜ ์ˆ˜ / ์นจ์‹ค์˜ ์ˆ˜ / ๋‚ด๋ถ€ ๊ณต๊ฐ„ / ํ‰๊ฐ€ ๋“ฑ๊ธ‰ / ์ฃผ๋ณ€ ์ด์›ƒ์˜ ๋‚ด๋ถ€ ๊ณต๊ฐ„)์ด ํด์ˆ˜๋ก ๊ฐ€๊ฒฉ์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.

- (๋•…์˜ ํฌ๊ธฐ / ์ฃผ๋ณ€ ์ด์›ƒ์˜ ๋•…์˜ ํฌ๊ธฐ)๊ฐ€ ํด์ˆ˜๋ก ๊ฐ€๊ฒฉ์ด ๊ฐ์†Œํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.

- (๋ฐ”๋‹ค ๋˜๋Š” ๊ฐ•์ด ๋ณด์ด๋Š” ๋ทฐ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒฝ์šฐ / ์ตœ๊ทผ์— ๋ณด์ˆ˜๊ณต์‚ฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ฒฝ์šฐ์—)์— ๊ฐ€๊ฒฉ์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.

 

In:

cor.mtest <- function(mat, conf.level = 0.95) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  diag(lowCI.mat) <- diag(uppCI.mat) <- 1
  
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], conf.level = conf.level)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      lowCI.mat[i, j] <- lowCI.mat[j, i] <- tmp$conf.int[1]
      uppCI.mat[i, j] <- uppCI.mat[j, i] <- tmp$conf.int[2]
    }
  }
  
  return(list(p.mat, lowCI.mat, uppCI.mat))
}

res <- cor.mtest(df_kc %>% 
                   select(-waterfront, -yr_renovated), 0.95)

corrplot(df_kc %>% 
           select(-waterfront, -yr_renovated) %>% 
           cor(method = 'spearman'), p.mat = res[[1]], sig.level = 0.05, method = 'number', type = 'lower')

 

Out:

 

โ–ท ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ธ waterfont์™€ yr_renovated๋ฅผ ์ œ์™ธํ•˜๊ณ , ์—ฐ์†ํ˜• ๋ณ€์ˆ˜์— ๋Œ€ํ•˜์—ฌ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋Š” ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋Š” ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค.

 

โ–ท X ํ‘œ์‹œ๋Š” ์ƒ๊ด€๊ณ„์ˆ˜ ๊ฒ€์ •๊ฒฐ๊ณผ๊ฐ€ ์œ ์˜์ˆ˜์ค€ 0.05์—์„œ ์œ ์˜ํ•˜์ง€ ์•Š์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

โ–ท age์™€ sqft_above_ratio๋ฅผ ์ œ์™ธํ•œ ์—ฐ์†ํ˜• ๋ณ€์ˆ˜์™€ price๊ฐ€ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋„๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

 

3. ๋‹ค์ค‘ํšŒ๊ท€๋ถ„์„

 

In:

model_lm = lm(log(price) ~ ., df_kc)

summary(model_lm)

 

Out:

Call:
lm(formula = log(price) ~ ., data = df_kc)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.38962 -0.20882  0.01414  0.20767  1.28996 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1.058e+01  3.298e-02 320.814  < 2e-16 ***
bedrooms                  -2.382e-02  2.895e-03  -8.230  < 2e-16 ***
bathrooms                  7.230e-02  5.005e-03  14.445  < 2e-16 ***
sqft_living                1.078e-04  5.245e-06  20.551  < 2e-16 ***
sqft_lot                   2.391e-07  7.321e-08   3.266 0.001091 ** 
floors                     1.301e-01  5.390e-03  24.141  < 2e-16 ***
waterfront1                3.595e-01  2.664e-02  13.497  < 2e-16 ***
view                       3.975e-02  3.215e-03  12.364  < 2e-16 ***
condition                  4.368e-02  3.570e-03  12.236  < 2e-16 ***
grade                      2.065e-01  3.208e-03  64.386  < 2e-16 ***
yr_renovated2000 - 2015    7.886e-02  2.233e-02   3.532 0.000414 ***
yr_renovatedNo Renovation -7.368e-03  1.753e-02  -0.420 0.674227    
yr_renovatedUnder 1980    -1.155e-01  3.270e-02  -3.532 0.000413 ***
sqft_living15              1.141e-04  5.110e-06  22.338  < 2e-16 ***
sqft_lot15                -5.139e-07  1.119e-07  -4.592 4.41e-06 ***
age                        5.423e-03  1.013e-04  53.525  < 2e-16 ***
sqft_above_ratio          -2.597e-01  1.502e-02 -17.293  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3086 on 21596 degrees of freedom
Multiple R-squared:  0.657,	Adjusted R-squared:  0.6567 
F-statistic:  2585 on 16 and 21596 DF,  p-value: < 2.2e-16

 

โ–ท ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด price์™€ ๋‹ค๋ฅธ ์—ฐ์†ํ˜• ๋ณ€์ˆ˜์˜ ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ log๋ฅผ ์ทจํ•œ price๋ฅผ ๋ชจ๋ธ์˜ ๋ฐ˜์‘๋ณ€์ˆ˜๋กœ ์ •ํ•˜์˜€๋‹ค.

 

โ–ท ์ด ๋ชจ๋ธ์˜ R-squared๋Š” 0.6567๋กœ ์ „์ฒด ๋ณ€๋™์˜ ์•ฝ 66%๋ฅผ ์„ค๋ช…ํ•œ๋‹ค. F-Statistic๋ฅผ ํ†ตํ•ด ์œ ์˜์ˆ˜์ค€ 0.05์—์„œ ์ด ๋ชจ๋ธ์ด ์œ ์˜ํ•˜๋Š” ๊ฒฐ๋ก ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

 

โ–ท yr_renovated ๋ณ€์ˆ˜์˜ No Renovation๋Š” ์œ ์˜์ˆ˜์ค€ 0.05์—์„œ ์œ ์˜ํ•˜์ง€ ์•Š์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ด์™ธ์˜ ๋ชจ๋“  ๋ณ€์ˆ˜๋Š” ์œ ์˜์ˆ˜์ค€ 0.05์—์„œ ์œ ์˜ํ•œ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

 

In:

plot(model_lm, 2)

 

Out:

 

โ–ท ์™ผ์ชฝ ์•„๋ž˜ ๋ ๋ถ€๋ถ„์„ ์ œ์™ธํ•œ ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ๋Š” ์ง์„  ์ƒ์— ์œ„์น˜ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ •๊ทœ์„ฑ ๊ฐ€์ •์„ ๋งŒ์กฑํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

In:

vif(model_lm)

 

Out:

                     GVIF Df GVIF^(1/(2*Df))
bedrooms         1.645304  1        1.282694
bathrooms        3.372189  1        1.836352
sqft_living      5.266456  1        2.294876
sqft_lot         2.087090  1        1.444676
floors           1.922529  1        1.386553
waterfront       1.205512  1        1.097958
view             1.377756  1        1.173778
condition        1.224774  1        1.106695
grade            3.226952  1        1.796372
yr_renovated     1.164329  3        1.025682
sqft_living15    2.783465  1        1.668372
sqft_lot15       2.118644  1        1.455556
age              2.010378  1        1.417878
sqft_above_ratio 1.496085  1        1.223146

 

โ–ท ๋ชจ๋“  ๋ณ€์ˆ˜์˜ VIF๊ฐ€ 10 ์ดํ•˜์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค์ค‘๊ณต์„ ์„ฑ ๋ฌธ์ œ๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

4. ๋ฒ ์ด์ง€์•ˆ ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ๋ง

 

In:

df_input = df_kc %>% 
  mutate(log_price = log(price)) %>% 
  select(-price)

model_bn_prototype = hc(df_input, score = 'bic-cg')

model_bn_prototype

plot_bn = layoutGraph(as.graphNEL(model_bn_prototype))
nodeRenderInfo(plot_bn) <- list(fontsize = 150)
renderGraph(plot_bn)

 

Out:

  Bayesian network learned via Score-based methods

  model:
   [waterfront][yr_renovated|waterfront][sqft_lot|yr_renovated][sqft_lot15|sqft_lot:waterfront:yr_renovated]
   [bathrooms|sqft_lot:waterfront:sqft_lot15][sqft_living|bathrooms:sqft_lot:waterfront:sqft_lot15]
   [sqft_above_ratio|bathrooms:sqft_living:sqft_lot:sqft_lot15][floors|bathrooms:sqft_living:sqft_lot:sqft_lot15:sqft_above_ratio]
   [age|bathrooms:sqft_living:floors:yr_renovated:sqft_lot15:sqft_above_ratio][condition|bathrooms:floors:yr_renovated:age:sqft_above_ratio]
   [sqft_living15|sqft_living:sqft_lot:floors:yr_renovated:sqft_lot15:age:sqft_above_ratio]
   [grade|bathrooms:sqft_living:floors:sqft_living15:sqft_lot15:age:sqft_above_ratio]
   [log_price|bathrooms:sqft_living:floors:waterfront:condition:grade:sqft_living15:age:sqft_above_ratio]
   [view|waterfront:yr_renovated:sqft_living15:age:sqft_above_ratio:log_price]
   [bedrooms|bathrooms:sqft_living:floors:view:condition:grade:sqft_lot15:age:sqft_above_ratio:log_price]
  nodes:                                 15 
  arcs:                                  71 
    undirected arcs:                     0 
    directed arcs:                       71 
  average markov blanket size:           12.80 
  average neighbourhood size:            9.47 
  average branching factor:              4.73 

  learning algorithm:                    Hill-Climbing 
  score:                                 BIC (cond. Gauss.) 
  penalization coefficient:              4.990525 
  tests used in the learning procedure:  1260 
  optimized:                             TRUE 

 

โ–ท Score๋Š” BIC๋กœ, ์ด๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด Hill-Climbing ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

 

โ–ท ์œ„์˜ ๊ทธ๋ฆผ์€ ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ์ด๊ณ , ์šฐ๋ฆฌ์˜ ์ง๊ด€๊ณผ ๋งž์ง€ ์•Š๋Š” ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์œ„์˜ ๋„คํŠธ์›Œํฌ์— ๋”ฐ๋ฅด๋ฉด log_price๊ฐ€ bedrooms์— ์˜ํ–ฅ์„ ์ฃผ๋‹ค. ์ฆ‰, ๊ฐ€๊ฒฉ์ด ์ง‘์˜ ์นจ๋Œ€์˜ ๊ฐœ์ˆ˜์— ์˜ํ–ฅ์„ ์ค€๋‹ค๊ณ  ๋ณด์•„์•ผ ํ•œ๋‹ค. ์ด ์ธ๊ณผ๊ด€๊ณ„๋Š” ์šฐ๋ฆฌ์˜ ์ง๊ด€๊ณผ ๋งž์ง€ ์•Š๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์— ๋งž๋Š” ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ๋ฏธ๋ฆฌ ์ง€์ •ํ•˜์—ฌ ํ•ฉ๋ฆฌ์ ์ธ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค.

 

๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•œ ๋ฒ ์ด์ง€์•ˆ ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์„ฑ ์กฐ๊ฑด์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

- bedrooms, bathrooms, sqft_living, waterfront, view, condition, grade, yr_renovated๋Š” log_price์— ์˜ํ–ฅ์„ ์ค€๋‹ค.- floors, bedrooms, bathrooms, sqft_living, sqft_lot, sqft_above_ratio, sqft_living15, sqft_lot15, waterfront, view, age๋Š” ๋…ธ๋“œ์˜ ์ตœ์ƒ๋‹จ์— ์œ„์น˜ํ•˜๊ณ , ์„œ๋กœ ์˜ํ–ฅ์„ ์ฃผ๊ณ  ๋ฐ›์„ ์ˆ˜ ์—†๋‹ค.

- log_price๋Š” ์–ด๋Š ๋ณ€์ˆ˜์—๋„ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์—†๋‹ค.

- condition, grade๋Š” ์ตœ์ƒ๋‹จ์— ์œ„์น˜ํ•œ ๋…ธ๋“œ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์—†๋‹ค.- yr_renovated๋Š” ์ตœ์ƒ๋‹จ ๋…ธ๋“œ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์—†๋‹ค.

 

์œ„์˜ ์กฐ๊ฑด์„ ๊ณ ๋ คํ•˜์—ฌ ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜์—ฌ ๋ณด์ž.

 

In:

top_node = c('floors', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 
             'sqft_above_ratio', 'sqft_living15', 'sqft_lot15', 'waterfront', 'view', 
             'age')

df_wl = data.frame(from = c('bedrooms', 'bathrooms', 'sqft_living', 'waterfront', 'view', 'condition', 'grade', 'yr_renovated'), to = 'log_price')

df_bl = NULL

for (i in top_node) {
  df_bl = df_bl %>% 
    bind_rows(data.frame(from = top_node[top_node != i], to = i))
}

df_bl = df_bl %>% 
  bind_rows(data.frame(from = 'log_price', to = names(df_kc)[names(df_kc) != 'price']))

df_bl = df_bl %>% 
  bind_rows(data.frame(from = 'condition', to = top_node))

df_bl = df_bl %>% 
  bind_rows(data.frame(from = 'grade', to = top_node))

df_bl = df_bl %>%
  bind_rows(data.frame(from = 'yr_renovated', to = top_node))

model_bn = hc(df_input, blacklist = df_bl, whitelist = df_wl, score = 'bic-cg')

plot_bn = layoutGraph(as.graphNEL(model_bn))
nodeRenderInfo(plot_bn) <- list(fontsize = 50)
renderGraph(plot_bn)

 

Out:

 

โ–ท ์•ž์˜ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ํ•™์Šต ๋ฐฉ๋ฒ•๊ณผ ์ฐจ์ด์ ์€ blacklist, whitelist ์ธ์ž์— ๋„คํŠธ์›Œํฌ ๊ตฌ์„ฑ ์กฐ๊ฑด์„ ์ค€ ๊ฒƒ์ด๋‹ค.

 

โ–ท sqft_lot๋Š” ๋„คํŠธ์›Œํฌ์—์„œ ๊ณ ๋ฆฝ๋˜์–ด ์–ด๋Š ๋…ธ๋“œ์—๋„ ์˜ํ–ฅ์„ ์ฃผ๊ณ  ๋ฐ›์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

 

In:

fit_bn = bn.fit(model_bn, df_input)

fit_bn

 

Out:

  Bayesian network parameters

  Parameters of node bedrooms (Gaussian distribution)

Conditional density: bedrooms
Coefficients:
(Intercept)  
   3.370842  
Standard deviation of the residuals: 0.9300618 

  Parameters of node bathrooms (Gaussian distribution)

Conditional density: bathrooms
Coefficients:
(Intercept)  
   2.114757  
Standard deviation of the residuals: 0.7701632 

  Parameters of node sqft_living (Gaussian distribution)

Conditional density: sqft_living
Coefficients:
(Intercept)  
     2079.9  
Standard deviation of the residuals: 918.4409 

  Parameters of node sqft_lot (Gaussian distribution)

Conditional density: sqft_lot
Coefficients:
(Intercept)  
   15106.97  
Standard deviation of the residuals: 41420.51 

  Parameters of node floors (Gaussian distribution)

Conditional density: floors
Coefficients:
(Intercept)  
   1.494309  
Standard deviation of the residuals: 0.5399889 

  Parameters of node waterfront (multinomial distribution)

Conditional probability table:
           0           1 
0.992458243 0.007541757 

  Parameters of node view (Gaussian distribution)

Conditional density: view
Coefficients:
(Intercept)  
  0.2343034  
Standard deviation of the residuals: 0.7663176 

  Parameters of node condition (conditional Gaussian distribution)

Conditional density: condition | bathrooms + floors + yr_renovated + age + sqft_above_ratio
Coefficients:
                              0              1              2              3
(Intercept)        3.0563791874   3.1737338235   3.1973525083   3.8059657879
bathrooms          0.0312814109  -0.0095737742   0.1204390398   0.2832070755
floors             0.0269270784   0.0154644257  -0.1347846515  -0.0639222984
age                0.0024137791  -0.0007246165   0.0090552567  -0.0025437365
sqft_above_ratio  -0.0275911248  -0.0854984414  -0.2587305161  -0.4778581940
Standard deviation of the residuals:
        0          1          2          3  
0.6009327  0.2598014  0.5917577  0.6960968  
Discrete parents' configurations:
    yr_renovated
0    1980 - 2000
1    2000 - 2015
2  No Renovation
3     Under 1980

  Parameters of node grade (Gaussian distribution)

Conditional density: grade | bedrooms + bathrooms + sqft_living + floors + view + sqft_living15 + sqft_lot15 + age + sqft_above_ratio
Coefficients:
     (Intercept)          bedrooms         bathrooms       sqft_living            floors              view     sqft_living15        sqft_lot15  
    5.196123e+00     -1.262738e-01      1.360192e-01      6.117853e-04      2.420609e-01      7.028056e-02      4.541821e-04     -1.453307e-06  
             age  sqft_above_ratio  
   -4.804130e-03      3.204805e-01  
Standard deviation of the residuals: 0.6555349 

  Parameters of node yr_renovated (multinomial distribution)

Conditional probability table:
 
               waterfront
yr_renovated              0           1
  1980 - 2000   0.014358974 0.159509202
  2000 - 2015   0.020885781 0.061349693
  No Renovation 0.959347319 0.742331288
  Under 1980    0.005407925 0.036809816

  Parameters of node sqft_living15 (Gaussian distribution)

Conditional density: sqft_living15
Coefficients:
(Intercept)  
   1986.552  
Standard deviation of the residuals: 685.3913 

  Parameters of node sqft_lot15 (Gaussian distribution)

Conditional density: sqft_lot15
Coefficients:
(Intercept)  
   12768.46  
Standard deviation of the residuals: 27304.18 

  Parameters of node age (Gaussian distribution)

Conditional density: age
Coefficients:
(Intercept)  
   44.99486  
Standard deviation of the residuals: 29.37341 

  Parameters of node sqft_above_ratio (Gaussian distribution)

Conditional density: sqft_above_ratio
Coefficients:
(Intercept)  
  0.8754978  
Standard deviation of the residuals: 0.1709693 

  Parameters of node log_price (conditional Gaussian distribution)

Conditional density: log_price | bedrooms + bathrooms + sqft_living + waterfront + view + condition + grade + yr_renovated
Coefficients:
                         0              1              2              3              4              5              6              7
(Intercept)   1.066013e+01   1.139515e+01   1.121704e+01   1.244367e+01   1.083384e+01   1.153135e+01   1.097135e+01  -2.108439e+01
bedrooms      8.894919e-04   1.359918e-01  -2.512079e-02   1.356706e-01  -1.509238e-02   1.677411e-02  -4.520446e-02   1.120495e+00
bathrooms     5.170345e-02  -2.965593e-02   4.840945e-02   1.804738e-01  -9.093745e-04   8.195861e-02   5.097185e-02   2.878328e+01
sqft_living   1.569443e-04   3.193704e-04   1.891389e-04  -9.619998e-05   1.984457e-04   1.835918e-04   1.766649e-04  -1.798751e-03
view          4.760609e-02  -2.261726e-02   3.788562e-02  -5.117922e-01   7.375035e-02   8.165891e-02   1.121705e-01   2.645102e+01
condition     7.673433e-02   1.196717e-01  -1.013576e-01   6.484588e-02   1.019852e-01   4.940656e-02  -7.447056e-02  -4.044361e+01
grade         2.443866e-01   1.168730e-01   2.517358e-01   3.300950e-01   1.929564e-01   1.407172e-01   2.775896e-01   0.000000e+00
Standard deviation of the residuals:
        0          1          2          3          4          5          6          7  
0.3428534  0.3922664  0.3361898  0.5091354  0.3331445  0.3407674  0.3259330  0.0000000  
Discrete parents' configurations:
   waterfront   yr_renovated
0           0    1980 - 2000
1           1    1980 - 2000
2           0    2000 - 2015
3           1    2000 - 2015
4           0  No Renovation
5           1  No Renovation
6           0     Under 1980
7           1     Under 1980

 

โ–ท bn.fit ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, ํ•™์Šต๋œ ๊ตฌ์กฐ๋กœ๋ถ€ํ„ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ํ•™์Šต ๊ฒฐ๊ณผ์ด๋‹ค. ๊ฐ Conditional density์˜ ๊ณ„์ˆ˜๋กœ๋ถ€ํ„ฐ ๋ณ€์ˆ˜๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

In:

pred_output = predict(model_lm, df_kc)

rmse_lm = sum(abs(pred_output - log(df_kc$price))^2)/nrow(df_kc)

df_lm = data.frame(log_price = log(df_kc$price), pred_log_price = pred_output, model = 'Multiple Linear Regression')

pred_output = predict(fit_bn, node = 'log_price', data = df_input)

rmse_bn = sum(abs(pred_output - df_input$log_price)^2)/nrow(df_input)

df_bn = data.frame(log_price = df_input$log_price, pred_log_price = pred_output, model = 'Bayeisan Network')

df_total = df_lm %>% 
  bind_rows(df_bn)
  
print(rmse_lm)

print(rmse_bn)

df_total %>% 
  ggplot(aes(log_price, pred_log_price, colour = model)) + 
  geom_point(alpha = 0.05) + 
  geom_abline(slope = 1, intercept = 0, linetype = 'dashed', size = 0.75) + 
  guides(colour = F) + 
  facet_wrap(~ model)

 

Out:

[1] 0.09514861

[1] 0.1108972

 

 

โ–ท ๋‹ค์ค‘ํšŒ๊ท€ ๋ชจ๋ธ๊ณผ ํ•™์Šต ๊ฒฐ๊ณผ์˜ RMSE๊ฐ€ ๋ฒ ์ด์ง€์•ˆ ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์ง€๋งŒ, ํฐ ์ฐจ์ด๋Š” ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

โ–ท ๋‘ ๋ชจ๋ธ์˜ ์ž”์ฐจ์˜ ๋ถ„ํฌ๊ฐ€ ๋น„์Šทํ•œ ํ˜•ํƒœ๋ฅผ ๋„๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.