๋ฒ ์ด์ง์ ๋คํธ์ํฌ(Bayesian network)๋ฅผ ํ์ฉํ์ฌ King County์ ์ง๊ฐ์ ์ํฅ์ ๋ฏธ์น๋ ๋ค์ํ ์์์ ์ธ๊ณผ๊ด๊ณ๋ฅผ ํ์ธํ๊ณ , ๋ถ์ํ๋ ๊ฒ์ด ์ด ํ๋ก์ ํธ์ ๋ชฉ์ ์ด๋ค. ๋ฐ์ดํฐ์ ์ถ์ฒ๋ ์ฌ๊ธฐ(www.kaggle.com/harlfoxem/housesalesprediction)์ด๊ณ , ์๋์ ๊ตฌ์ฑ ์์๋๋ก ๋ถ์ ๋ฐ ๋ชจ๋ธ๋ง ๊ณผ์ ์ ์ํํ ๊ฒ์ด๋ค. ๋ชจ๋ ์ฝ๋๋ R๋ก ์์ฑ๋์๋ค.
1. ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
2. ์๊ฐํ ๋ฐ ์๊ด๊ด๊ณ ๋ถ์
3. ๋ค์คํ๊ท๋ถ์
4. ๋ฒ ์ด์ง์ ๋คํธ์ํฌ ๋ชจ๋ธ๋ง
1. ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
In:
# Statistic
library(car)
# Data manipulation
library(dplyr)
library(tidyr)
# Visualization
library(ggplot2)
library(corrplot)
library(GGally)
library(scales)
# Bayesian network
library(bnlearn)
library(Rgraphviz)
# Setting for ggplot theme
theme_set(theme_minimal() +
theme(plot.title = element_text(face = 'bold', colour = 'grey10'),
plot.subtitle = element_text(colour = 'grey25'),
panel.grid.major = element_line(colour = 'grey90', size = 1),
panel.grid.minor = element_line(colour = 'grey80', size = 0.5, linetype = 'dashed'),
legend.position = 'top',
legend.spacing.x = unit(0.125, 'cm'),
legend.background = element_rect(fill = NULL, linetype = 'dotted'),
strip.background = element_blank(),
strip.text = element_text(face = 'bold', colour = 'grey25', size = 11.25)))
data_kc = read.csv('../input/kc_house_data.csv')
str(data_kc)
summary(data_kc)
Out:
'data.frame': 21613 obs. of 21 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
$ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
id date price bedrooms bathrooms sqft_living sqft_lot
Min. :1.000e+06 20140623T000000: 142 Min. : 75000 Min. : 0.000 Min. :0.000 Min. : 290 Min. : 520
1st Qu.:2.123e+09 20140625T000000: 131 1st Qu.: 321950 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040
Median :3.905e+09 20140626T000000: 131 Median : 450000 Median : 3.000 Median :2.250 Median : 1910 Median : 7618
Mean :4.580e+09 20140708T000000: 127 Mean : 540088 Mean : 3.371 Mean :2.115 Mean : 2080 Mean : 15107
3rd Qu.:7.309e+09 20150427T000000: 126 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688
Max. :9.900e+09 20150325T000000: 123 Max. :7700000 Max. :33.000 Max. :8.000 Max. :13540 Max. :1651359
(Other) :20833
floors waterfront view condition grade sqft_above sqft_basement yr_built
Min. :1.000 Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900
1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
Median :1.500 Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000 Median :1560 Median : 0.0 Median :1975
Mean :1.494 Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657 Mean :1788 Mean : 291.5 Mean :1971
3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
Max. :3.500 Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
yr_renovated zipcode lat long sqft_living15 sqft_lot15
Min. : 0.0 Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399 Min. : 651
1st Qu.: 0.0 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490 1st Qu.: 5100
Median : 0.0 Median :98065 Median :47.57 Median :-122.2 Median :1840 Median : 7620
Mean : 84.4 Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987 Mean : 12768
3rd Qu.: 0.0 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360 3rd Qu.: 10083
Max. :2015.0 Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210 Max. :871200
โท KC County์ ์ง๊ฐ ๋ฐ์ดํฐ๋ 21,613๊ฐ์ ํ๊ณผ 21๊ฐ์ ์ด๋ก ๊ตฌ์ฑ๋์ด ์๊ณ , ๊ฐ ์ด์ ์๋ฏธ๋ ๋ค์๊ณผ ๊ฐ๋ค.
[id]: Unique ID for each home sold
[date]: Date of the home sale
[price]: Price of each home sold
[bedrooms]: Number of bedrooms
[bathrooms]: Number of bathrooms, where .5 accounts for a room with a toilet but no shower
[sqft_living]: Square footage of the apartments interior living space
[sqft_lot]: Square footage of the land space
[floors]: Number of floors
[waterfront]: A dummy variable for whether the apartment was overlooking the waterfront or not
[view]: An index from 0 to 4 of how good the view of the property was
[condition]: An index from 1 to 5 on the condition of the apartment,
[grade]: An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high-quality level of construction and design
[sqft_above]: The square footage of the interior housing space that is above ground level
[sqft_basement]: The square footage of the interior housing space that is below ground level
[yr_built]: The year the house was initially built
[yr_renovated]: The year of the house’s last renovation
[zipcode]: What zip code area the house is in
[lat]: Lattitude
[long]: Longitude
[sqft_living15]: The square footage of interior housing living space for the nearest 15 neighbors
[sqft_lot15]: The square footage of the land lots of the nearest 15 neighbors
๋ถ์์ ์ํํ๊ธฐ ์ํด ๋ค์์ ์ ์ฒ๋ฆฌ ๊ณผ์ ์ ๊ฑฐ์ณค๋ค.
In:
df_kc = data_kc %>%
select(-id, -date, -zipcode, -lat, -long)
df_kc$age = 2015 - df_kc$yr_built + 1
df_kc = df_kc %>%
select(-yr_built)
df_kc = df_kc %>%
mutate(yr_renovated = if_else(yr_renovated == 0, 'No Renovation',
if_else(yr_renovated < 1980, 'Under 1980',
if_else(yr_renovated < 2000, '1980 - 2000', '2000 - 2015'))))
df_kc$yr_renovated = as.factor(df_kc$yr_renovated)
df_kc$sqft_above_ratio = with(df_kc, sqft_above/sqft_living))
df_kc = df_kc %>%
select(-sqft_above, -sqft_basement)
df_kc$waterfront = as.factor(df_kc$waterfront)
df_kc[, sapply(df_kc, function(x) is.integer(x))] = lapply(df_kc[, sapply(df_kc, function(x) is.integer(x))], as.double)
โท id, date, zipcode, lat, long์ ๋ถ์ ๋์์์ ์ ์ธํ์๋ค.
โท ์ง์ ๋์ด๋ฅผ ์๋ฏธํ๋ age๋ ๋ฐ์ดํฐ ์ธก์ ์์ ์ธ 2015๋ ์ yr_built๋ฅผ ๋นผ์ ์์ฑํ๊ณ , yr_built๋ฅผ ์ ์ธํ์๋ค.
โท yr_renovated์ ๊ฒฝ์ฐ, ๋ณด์๊ฐ ์ ๋ ์ง์ 0, ๋ณด์๊ฐ ๋ ์ง์ ๊ณต์ฌ๋ ํด๋ก ๋ํ๋ ์๋ค. ์ด๋ฅผ ๋ชจ๋ธ์ ํ๋์ ๋ณ์๋ก ๋ฃ๊ฒ๋๋ฉด ํด์์ ๋ฌธ์ ๊ฐ ๋ฐ์ํ๊ฒ ๋๋ค. ๋ฐ๋ผ์ ๋ค ๊ฐ์ ์นดํ ๊ณ ๋ฆฌ(No Renovation, Under 1980 - 2000, 2000 - 2015)๋ก ๋ฐ๊พธ์๋ค.
โท sqft_above์ sqft_basement์ ํฉ์ sqft_living์ด๋ค. ์ฆ, ์ด ์ธ ์ด์ ์ ๋ณด๋ ์ค๋ณต๋์ด ์๋ค. ๋ฐ๋ผ์ sqft_living๋ง ๋จ๊ฒจ๋๊ณ , sqft_above๋ฅผ sqft_living์ผ๋ก ๋๋์ด ๊ฑฐ์ฃผ ๊ณต๊ฐ ์ค ์ง์์ ์ฐจ์ง ๋น์จ์ ์๋ฏธํ๋ sqft_above_ratio๋ผ๋ ๋ณ์๋ฅผ ์์ฑํ์๋ค.
โท ๋ฒ์ฃผํ ๋ณ์๋ factor๋ก ๋ณํํ๊ณ , ์ด์ธ์ ๋ณ์๋ ๋ชจ๋ ์ค์๋ก ๋ณํํ์๋ค.
2. ์๊ฐํ ๋ฐ ์๊ด๊ด๊ณ ๋ถ์
In:
df_temp = df_kc %>%
select(-yr_renovated, -waterfront)
df_temp %>%
gather(key = key, value = value, bedrooms:view) %>%
ggplot(aes(value, price)) +
geom_bin2d() +
scale_x_continuous(label = comma) +
scale_fill_gradient(trans = 'log') +
guides(fill = F) +
facet_wrap(~key, scales = 'free', ncol = 2) +
labs(x = NULL)
df_temp %>%
gather(key = key, value = value, condition:sqft_above_ratio) %>%
ggplot(aes(value, price)) +
geom_bin2d() +
scale_x_continuous(label = comma) +
scale_fill_gradient(trans = 'log') +
guides(fill = F) +
facet_wrap(~key, scales = 'free', ncol = 2) +
labs(x = NULL)
df_temp = df_kc %>%
mutate(waterfront = if_else(waterfront == '1', 'TRUE', 'FALSE')) %>%
select(price, waterfront, yr_renovated)
df_temp %>%
gather(key = key, value = value, waterfront, yr_renovated) %>%
mutate(value = factor(value, levels = c('Under 1980', '1980 - 2000', '2000 - 2015', 'No Renovation', 'TRUE', 'FALSE'))) %>%
ggplot(aes(value, price, colour = value)) +
geom_boxplot(size = 1) +
scale_y_log10() +
facet_wrap(~ key, scale = 'free', ncol = 2) +
guides(colour = F)
Out:
โท sqft_lot, sqft_lot15, sqft_living15๋ price์ ๋น์ ํ ๊ด๊ณ๋ฅผ ๋ณด์ด๊ณ ์๋ค.
โท ์์ ์๊ฐํ๋ก๋ถํฐ ๋ค์์ ๊ฒฝํฅ์ฑ์ ํ์ธํ ์ ์๋ค.
- (์์ค์ ์ / ์นจ์ค์ ์ / ๋ด๋ถ ๊ณต๊ฐ / ํ๊ฐ ๋ฑ๊ธ / ์ฃผ๋ณ ์ด์์ ๋ด๋ถ ๊ณต๊ฐ)์ด ํด์๋ก ๊ฐ๊ฒฉ์ด ์ฆ๊ฐํ๋ ๊ฒฝํฅ์ ๋ณด์ธ๋ค.
- (๋ ์ ํฌ๊ธฐ / ์ฃผ๋ณ ์ด์์ ๋ ์ ํฌ๊ธฐ)๊ฐ ํด์๋ก ๊ฐ๊ฒฉ์ด ๊ฐ์ํ๋ ๊ฒฝํฅ์ ๋ณด์ธ๋ค.
- (๋ฐ๋ค ๋๋ ๊ฐ์ด ๋ณด์ด๋ ๋ทฐ๋ฅผ ๊ฐ์ง๋ ๊ฒฝ์ฐ / ์ต๊ทผ์ ๋ณด์๊ณต์ฌ๋ฅผ ์งํํ ๊ฒฝ์ฐ์)์ ๊ฐ๊ฒฉ์ด ์ฆ๊ฐํ๋ ๊ฒฝํฅ์ ๋ณด์ธ๋ค.
In:
cor.mtest <- function(mat, conf.level = 0.95) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
diag(lowCI.mat) <- diag(uppCI.mat) <- 1
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], conf.level = conf.level)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
lowCI.mat[i, j] <- lowCI.mat[j, i] <- tmp$conf.int[1]
uppCI.mat[i, j] <- uppCI.mat[j, i] <- tmp$conf.int[2]
}
}
return(list(p.mat, lowCI.mat, uppCI.mat))
}
res <- cor.mtest(df_kc %>%
select(-waterfront, -yr_renovated), 0.95)
corrplot(df_kc %>%
select(-waterfront, -yr_renovated) %>%
cor(method = 'spearman'), p.mat = res[[1]], sig.level = 0.05, method = 'number', type = 'lower')
Out:
โท ๋ฒ์ฃผํ ๋ณ์์ธ waterfont์ yr_renovated๋ฅผ ์ ์ธํ๊ณ , ์ฐ์ํ ๋ณ์์ ๋ํ์ฌ ์๊ด๊ด๊ณ ๋ถ์์ ์ํํ์๋ค. ์์ ๊ฒฐ๊ณผ๋ ๋น์ ํ ๊ด๊ณ๋ฅผ ๊ณ ๋ คํ ์ ์๋ ์คํผ์ด๋ง ์๊ด๊ณ์๋ฅผ ๋ํ๋ธ ๊ฒ์ด๋ค.
โท X ํ์๋ ์๊ด๊ณ์ ๊ฒ์ ๊ฒฐ๊ณผ๊ฐ ์ ์์์ค 0.05์์ ์ ์ํ์ง ์์ ์๊ด๊ด๊ณ๋ฅผ ์๋ฏธํ๋ค.
โท age์ sqft_above_ratio๋ฅผ ์ ์ธํ ์ฐ์ํ ๋ณ์์ price๊ฐ ์์ ์๊ด๊ด๊ณ๋ฅผ ๋๋ ๊ฒ์ผ๋ก ๋ํ๋ฌ๋ค.
3. ๋ค์คํ๊ท๋ถ์
In:
model_lm = lm(log(price) ~ ., df_kc)
summary(model_lm)
Out:
Call:
lm(formula = log(price) ~ ., data = df_kc)
Residuals:
Min 1Q Median 3Q Max
-1.38962 -0.20882 0.01414 0.20767 1.28996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.058e+01 3.298e-02 320.814 < 2e-16 ***
bedrooms -2.382e-02 2.895e-03 -8.230 < 2e-16 ***
bathrooms 7.230e-02 5.005e-03 14.445 < 2e-16 ***
sqft_living 1.078e-04 5.245e-06 20.551 < 2e-16 ***
sqft_lot 2.391e-07 7.321e-08 3.266 0.001091 **
floors 1.301e-01 5.390e-03 24.141 < 2e-16 ***
waterfront1 3.595e-01 2.664e-02 13.497 < 2e-16 ***
view 3.975e-02 3.215e-03 12.364 < 2e-16 ***
condition 4.368e-02 3.570e-03 12.236 < 2e-16 ***
grade 2.065e-01 3.208e-03 64.386 < 2e-16 ***
yr_renovated2000 - 2015 7.886e-02 2.233e-02 3.532 0.000414 ***
yr_renovatedNo Renovation -7.368e-03 1.753e-02 -0.420 0.674227
yr_renovatedUnder 1980 -1.155e-01 3.270e-02 -3.532 0.000413 ***
sqft_living15 1.141e-04 5.110e-06 22.338 < 2e-16 ***
sqft_lot15 -5.139e-07 1.119e-07 -4.592 4.41e-06 ***
age 5.423e-03 1.013e-04 53.525 < 2e-16 ***
sqft_above_ratio -2.597e-01 1.502e-02 -17.293 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3086 on 21596 degrees of freedom
Multiple R-squared: 0.657, Adjusted R-squared: 0.6567
F-statistic: 2585 on 16 and 21596 DF, p-value: < 2.2e-16
โท ์๊ฐํ๋ฅผ ํตํด price์ ๋ค๋ฅธ ์ฐ์ํ ๋ณ์์ ๋น์ ํ ๊ด๊ณ๋ฅผ ํ์ธํ ์ ์์๋ค. ๋ฐ๋ผ์ log๋ฅผ ์ทจํ price๋ฅผ ๋ชจ๋ธ์ ๋ฐ์๋ณ์๋ก ์ ํ์๋ค.
โท ์ด ๋ชจ๋ธ์ R-squared๋ 0.6567๋ก ์ ์ฒด ๋ณ๋์ ์ฝ 66%๋ฅผ ์ค๋ช ํ๋ค. F-Statistic๋ฅผ ํตํด ์ ์์์ค 0.05์์ ์ด ๋ชจ๋ธ์ด ์ ์ํ๋ ๊ฒฐ๋ก ์ ์ป์ ์ ์๋ค.
โท yr_renovated ๋ณ์์ No Renovation๋ ์ ์์์ค 0.05์์ ์ ์ํ์ง ์์ ๊ฒ์ผ๋ก ๋ํ๋ฌ๋ค. ์ด์ธ์ ๋ชจ๋ ๋ณ์๋ ์ ์์์ค 0.05์์ ์ ์ํ ๊ฒ์ผ๋ก ๋ํ๋ฌ๋ค.
In:
plot(model_lm, 2)
Out:
โท ์ผ์ชฝ ์๋ ๋ ๋ถ๋ถ์ ์ ์ธํ ๋๋ถ๋ถ์ ๋ฐ์ดํฐ๋ ์ง์ ์์ ์์นํ๋ ๊ฒ์ ํ์ธํ ์ ์๋ค. ๋ฐ๋ผ์ ์ ๊ท์ฑ ๊ฐ์ ์ ๋ง์กฑํ๋ค๊ณ ๋ณผ ์ ์๋ค.
In:
vif(model_lm)
Out:
GVIF Df GVIF^(1/(2*Df))
bedrooms 1.645304 1 1.282694
bathrooms 3.372189 1 1.836352
sqft_living 5.266456 1 2.294876
sqft_lot 2.087090 1 1.444676
floors 1.922529 1 1.386553
waterfront 1.205512 1 1.097958
view 1.377756 1 1.173778
condition 1.224774 1 1.106695
grade 3.226952 1 1.796372
yr_renovated 1.164329 3 1.025682
sqft_living15 2.783465 1 1.668372
sqft_lot15 2.118644 1 1.455556
age 2.010378 1 1.417878
sqft_above_ratio 1.496085 1 1.223146
โท ๋ชจ๋ ๋ณ์์ VIF๊ฐ 10 ์ดํ์ด๊ธฐ ๋๋ฌธ์, ๋ค์ค๊ณต์ ์ฑ ๋ฌธ์ ๊ฐ ๋ํ๋์ง ์๋๋ค๊ณ ๋ณผ ์ ์๋ค.
4. ๋ฒ ์ด์ง์ ๋คํธ์ํฌ ๋ชจ๋ธ๋ง
In:
df_input = df_kc %>%
mutate(log_price = log(price)) %>%
select(-price)
model_bn_prototype = hc(df_input, score = 'bic-cg')
model_bn_prototype
plot_bn = layoutGraph(as.graphNEL(model_bn_prototype))
nodeRenderInfo(plot_bn) <- list(fontsize = 150)
renderGraph(plot_bn)
Out:
Bayesian network learned via Score-based methods
model:
[waterfront][yr_renovated|waterfront][sqft_lot|yr_renovated][sqft_lot15|sqft_lot:waterfront:yr_renovated]
[bathrooms|sqft_lot:waterfront:sqft_lot15][sqft_living|bathrooms:sqft_lot:waterfront:sqft_lot15]
[sqft_above_ratio|bathrooms:sqft_living:sqft_lot:sqft_lot15][floors|bathrooms:sqft_living:sqft_lot:sqft_lot15:sqft_above_ratio]
[age|bathrooms:sqft_living:floors:yr_renovated:sqft_lot15:sqft_above_ratio][condition|bathrooms:floors:yr_renovated:age:sqft_above_ratio]
[sqft_living15|sqft_living:sqft_lot:floors:yr_renovated:sqft_lot15:age:sqft_above_ratio]
[grade|bathrooms:sqft_living:floors:sqft_living15:sqft_lot15:age:sqft_above_ratio]
[log_price|bathrooms:sqft_living:floors:waterfront:condition:grade:sqft_living15:age:sqft_above_ratio]
[view|waterfront:yr_renovated:sqft_living15:age:sqft_above_ratio:log_price]
[bedrooms|bathrooms:sqft_living:floors:view:condition:grade:sqft_lot15:age:sqft_above_ratio:log_price]
nodes: 15
arcs: 71
undirected arcs: 0
directed arcs: 71
average markov blanket size: 12.80
average neighbourhood size: 9.47
average branching factor: 4.73
learning algorithm: Hill-Climbing
score: BIC (cond. Gauss.)
penalization coefficient: 4.990525
tests used in the learning procedure: 1260
optimized: TRUE
โท Score๋ BIC๋ก, ์ด๋ฅผ ์ต๋ํํ๋ ๊ตฌ์กฐ๋ฅผ ์ฐพ๊ธฐ ์ํด Hill-Climbing ์๊ณ ๋ฆฌ์ฆ์ ์ฌ์ฉํ์๋ค.
โท ์์ ๊ทธ๋ฆผ์ ํ์ต๋ ๋คํธ์ํฌ์ ๊ตฌ์กฐ์ด๊ณ , ์ฐ๋ฆฌ์ ์ง๊ด๊ณผ ๋ง์ง ์๋ ๊ด๊ณ๋ฅผ ๊ฐ์ง๊ณ ์๋ค. ์์ ๋คํธ์ํฌ์ ๋ฐ๋ฅด๋ฉด log_price๊ฐ bedrooms์ ์ํฅ์ ์ฃผ๋ค. ์ฆ, ๊ฐ๊ฒฉ์ด ์ง์ ์นจ๋์ ๊ฐ์์ ์ํฅ์ ์ค๋ค๊ณ ๋ณด์์ผ ํ๋ค. ์ด ์ธ๊ณผ๊ด๊ณ๋ ์ฐ๋ฆฌ์ ์ง๊ด๊ณผ ๋ง์ง ์๋ค. ๋ฐ๋ผ์ ๋ฐ์ดํฐ์ ํน์ฑ์ ๋ง๋ ์ธ๊ณผ๊ด๊ณ๋ฅผ ๋ฏธ๋ฆฌ ์ง์ ํ์ฌ ํฉ๋ฆฌ์ ์ธ ๋คํธ์ํฌ๋ฅผ ๋ง๋ค์ด์ผ ํ๋ค.
๋ฐ์ดํฐ์ ํน์ฑ์ ๊ณ ๋ คํ ๋ฒ ์ด์ง์ ๋คํธ์ํฌ์ ๊ตฌ์ฑ ์กฐ๊ฑด์ ๋ค์๊ณผ ๊ฐ๋ค.
- bedrooms, bathrooms, sqft_living, waterfront, view, condition, grade, yr_renovated๋ log_price์ ์ํฅ์ ์ค๋ค.- floors, bedrooms, bathrooms, sqft_living, sqft_lot, sqft_above_ratio, sqft_living15, sqft_lot15, waterfront, view, age๋ ๋ ธ๋์ ์ต์๋จ์ ์์นํ๊ณ , ์๋ก ์ํฅ์ ์ฃผ๊ณ ๋ฐ์ ์ ์๋ค.
- log_price๋ ์ด๋ ๋ณ์์๋ ์ํฅ์ ์ค ์ ์๋ค.
- condition, grade๋ ์ต์๋จ์ ์์นํ ๋ ธ๋์ ์ํฅ์ ์ค ์ ์๋ค.- yr_renovated๋ ์ต์๋จ ๋ ธ๋์ ์ํฅ์ ์ค ์ ์๋ค.
์์ ์กฐ๊ฑด์ ๊ณ ๋ คํ์ฌ ๋คํธ์ํฌ์ ๊ตฌ์กฐ๋ฅผ ํ์ตํ์ฌ ๋ณด์.
In:
top_node = c('floors', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'sqft_above_ratio', 'sqft_living15', 'sqft_lot15', 'waterfront', 'view',
'age')
df_wl = data.frame(from = c('bedrooms', 'bathrooms', 'sqft_living', 'waterfront', 'view', 'condition', 'grade', 'yr_renovated'), to = 'log_price')
df_bl = NULL
for (i in top_node) {
df_bl = df_bl %>%
bind_rows(data.frame(from = top_node[top_node != i], to = i))
}
df_bl = df_bl %>%
bind_rows(data.frame(from = 'log_price', to = names(df_kc)[names(df_kc) != 'price']))
df_bl = df_bl %>%
bind_rows(data.frame(from = 'condition', to = top_node))
df_bl = df_bl %>%
bind_rows(data.frame(from = 'grade', to = top_node))
df_bl = df_bl %>%
bind_rows(data.frame(from = 'yr_renovated', to = top_node))
model_bn = hc(df_input, blacklist = df_bl, whitelist = df_wl, score = 'bic-cg')
plot_bn = layoutGraph(as.graphNEL(model_bn))
nodeRenderInfo(plot_bn) <- list(fontsize = 50)
renderGraph(plot_bn)
Out:
โท ์์ ๋คํธ์ํฌ ๊ตฌ์กฐ ํ์ต ๋ฐฉ๋ฒ๊ณผ ์ฐจ์ด์ ์ blacklist, whitelist ์ธ์์ ๋คํธ์ํฌ ๊ตฌ์ฑ ์กฐ๊ฑด์ ์ค ๊ฒ์ด๋ค.
โท sqft_lot๋ ๋คํธ์ํฌ์์ ๊ณ ๋ฆฝ๋์ด ์ด๋ ๋ ธ๋์๋ ์ํฅ์ ์ฃผ๊ณ ๋ฐ์ง ์๋ ๊ฒ์ผ๋ก ๋ํ๋ฌ๋ค.
In:
fit_bn = bn.fit(model_bn, df_input)
fit_bn
Out:
Bayesian network parameters
Parameters of node bedrooms (Gaussian distribution)
Conditional density: bedrooms
Coefficients:
(Intercept)
3.370842
Standard deviation of the residuals: 0.9300618
Parameters of node bathrooms (Gaussian distribution)
Conditional density: bathrooms
Coefficients:
(Intercept)
2.114757
Standard deviation of the residuals: 0.7701632
Parameters of node sqft_living (Gaussian distribution)
Conditional density: sqft_living
Coefficients:
(Intercept)
2079.9
Standard deviation of the residuals: 918.4409
Parameters of node sqft_lot (Gaussian distribution)
Conditional density: sqft_lot
Coefficients:
(Intercept)
15106.97
Standard deviation of the residuals: 41420.51
Parameters of node floors (Gaussian distribution)
Conditional density: floors
Coefficients:
(Intercept)
1.494309
Standard deviation of the residuals: 0.5399889
Parameters of node waterfront (multinomial distribution)
Conditional probability table:
0 1
0.992458243 0.007541757
Parameters of node view (Gaussian distribution)
Conditional density: view
Coefficients:
(Intercept)
0.2343034
Standard deviation of the residuals: 0.7663176
Parameters of node condition (conditional Gaussian distribution)
Conditional density: condition | bathrooms + floors + yr_renovated + age + sqft_above_ratio
Coefficients:
0 1 2 3
(Intercept) 3.0563791874 3.1737338235 3.1973525083 3.8059657879
bathrooms 0.0312814109 -0.0095737742 0.1204390398 0.2832070755
floors 0.0269270784 0.0154644257 -0.1347846515 -0.0639222984
age 0.0024137791 -0.0007246165 0.0090552567 -0.0025437365
sqft_above_ratio -0.0275911248 -0.0854984414 -0.2587305161 -0.4778581940
Standard deviation of the residuals:
0 1 2 3
0.6009327 0.2598014 0.5917577 0.6960968
Discrete parents' configurations:
yr_renovated
0 1980 - 2000
1 2000 - 2015
2 No Renovation
3 Under 1980
Parameters of node grade (Gaussian distribution)
Conditional density: grade | bedrooms + bathrooms + sqft_living + floors + view + sqft_living15 + sqft_lot15 + age + sqft_above_ratio
Coefficients:
(Intercept) bedrooms bathrooms sqft_living floors view sqft_living15 sqft_lot15
5.196123e+00 -1.262738e-01 1.360192e-01 6.117853e-04 2.420609e-01 7.028056e-02 4.541821e-04 -1.453307e-06
age sqft_above_ratio
-4.804130e-03 3.204805e-01
Standard deviation of the residuals: 0.6555349
Parameters of node yr_renovated (multinomial distribution)
Conditional probability table:
waterfront
yr_renovated 0 1
1980 - 2000 0.014358974 0.159509202
2000 - 2015 0.020885781 0.061349693
No Renovation 0.959347319 0.742331288
Under 1980 0.005407925 0.036809816
Parameters of node sqft_living15 (Gaussian distribution)
Conditional density: sqft_living15
Coefficients:
(Intercept)
1986.552
Standard deviation of the residuals: 685.3913
Parameters of node sqft_lot15 (Gaussian distribution)
Conditional density: sqft_lot15
Coefficients:
(Intercept)
12768.46
Standard deviation of the residuals: 27304.18
Parameters of node age (Gaussian distribution)
Conditional density: age
Coefficients:
(Intercept)
44.99486
Standard deviation of the residuals: 29.37341
Parameters of node sqft_above_ratio (Gaussian distribution)
Conditional density: sqft_above_ratio
Coefficients:
(Intercept)
0.8754978
Standard deviation of the residuals: 0.1709693
Parameters of node log_price (conditional Gaussian distribution)
Conditional density: log_price | bedrooms + bathrooms + sqft_living + waterfront + view + condition + grade + yr_renovated
Coefficients:
0 1 2 3 4 5 6 7
(Intercept) 1.066013e+01 1.139515e+01 1.121704e+01 1.244367e+01 1.083384e+01 1.153135e+01 1.097135e+01 -2.108439e+01
bedrooms 8.894919e-04 1.359918e-01 -2.512079e-02 1.356706e-01 -1.509238e-02 1.677411e-02 -4.520446e-02 1.120495e+00
bathrooms 5.170345e-02 -2.965593e-02 4.840945e-02 1.804738e-01 -9.093745e-04 8.195861e-02 5.097185e-02 2.878328e+01
sqft_living 1.569443e-04 3.193704e-04 1.891389e-04 -9.619998e-05 1.984457e-04 1.835918e-04 1.766649e-04 -1.798751e-03
view 4.760609e-02 -2.261726e-02 3.788562e-02 -5.117922e-01 7.375035e-02 8.165891e-02 1.121705e-01 2.645102e+01
condition 7.673433e-02 1.196717e-01 -1.013576e-01 6.484588e-02 1.019852e-01 4.940656e-02 -7.447056e-02 -4.044361e+01
grade 2.443866e-01 1.168730e-01 2.517358e-01 3.300950e-01 1.929564e-01 1.407172e-01 2.775896e-01 0.000000e+00
Standard deviation of the residuals:
0 1 2 3 4 5 6 7
0.3428534 0.3922664 0.3361898 0.5091354 0.3331445 0.3407674 0.3259330 0.0000000
Discrete parents' configurations:
waterfront yr_renovated
0 0 1980 - 2000
1 1 1980 - 2000
2 0 2000 - 2015
3 1 2000 - 2015
4 0 No Renovation
5 1 No Renovation
6 0 Under 1980
7 1 Under 1980
โท bn.fit ํจ์๋ฅผ ์ด์ฉํ์ฌ, ํ์ต๋ ๊ตฌ์กฐ๋ก๋ถํฐ ํ๋ผ๋ฏธํฐ๋ฅผ ํ์ตํ ์ ์๋ค. ์์ ๊ฒฐ๊ณผ๋ ํ๋ผ๋ฏธํฐ์ ํ์ต ๊ฒฐ๊ณผ์ด๋ค. ๊ฐ Conditional density์ ๊ณ์๋ก๋ถํฐ ๋ณ์๊ฐ์ ๊ด๊ณ๋ฅผ ํ์ธํ ์ ์๋ค.
In:
pred_output = predict(model_lm, df_kc)
rmse_lm = sum(abs(pred_output - log(df_kc$price))^2)/nrow(df_kc)
df_lm = data.frame(log_price = log(df_kc$price), pred_log_price = pred_output, model = 'Multiple Linear Regression')
pred_output = predict(fit_bn, node = 'log_price', data = df_input)
rmse_bn = sum(abs(pred_output - df_input$log_price)^2)/nrow(df_input)
df_bn = data.frame(log_price = df_input$log_price, pred_log_price = pred_output, model = 'Bayeisan Network')
df_total = df_lm %>%
bind_rows(df_bn)
print(rmse_lm)
print(rmse_bn)
df_total %>%
ggplot(aes(log_price, pred_log_price, colour = model)) +
geom_point(alpha = 0.05) +
geom_abline(slope = 1, intercept = 0, linetype = 'dashed', size = 0.75) +
guides(colour = F) +
facet_wrap(~ model)
Out:
[1] 0.09514861
[1] 0.1108972
โท ๋ค์คํ๊ท ๋ชจ๋ธ๊ณผ ํ์ต ๊ฒฐ๊ณผ์ RMSE๊ฐ ๋ฒ ์ด์ง์ ๋คํธ์ํฌ ๋ชจ๋ธ๋ณด๋ค ๋ ๋ฎ์ ๊ฒ์ผ๋ก ๋ํ๋ฌ์ง๋ง, ํฐ ์ฐจ์ด๋ ์๋ ๊ฒ์ ์ ์ ์๋ค.
โท ๋ ๋ชจ๋ธ์ ์์ฐจ์ ๋ถํฌ๊ฐ ๋น์ทํ ํํ๋ฅผ ๋๋ ๊ฒ์ ํ์ธํ ์ ์๋ค.
'Statistics > Probabilistic Graphical Model' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๋ฒ ์ด์ง์ ๋คํธ์ํฌ(Bayesian network) (3) (0) | 2020.09.19 |
---|---|
๋ฒ ์ด์ง์ ๋คํธ์ํฌ(Bayesian network) (2) (0) | 2020.09.12 |
๋ฒ ์ด์ง์ ๋คํธ์ํฌ(Bayesian network) (1) (1) | 2020.09.09 |