๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Programming/R

๊ฒฐ์ธก์น˜ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•

๊ฒฐ์ธก์น˜ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•˜์—ฌ ์•Œ์•„๋ณด์ž.

 

โ–ก ๊ฒฐ์ธก์น˜ ํ™•์ธ

 

In:

library(MASS)

df_car = Cars93

df_car %>% 
  sapply(function(x) sum(is.na(x)))

 

Out:

      Manufacturer              Model               Type 
                 0                  0                  0 
         Min.Price              Price          Max.Price 
                 0                  0                  0 
          MPG.city        MPG.highway            AirBags 
                 0                  0                  0 
        DriveTrain          Cylinders         EngineSize 
                 0                  0                  0 
        Horsepower                RPM       Rev.per.mile 
                 0                  0                  0 
   Man.trans.avail Fuel.tank.capacity         Passengers 
                 0                  0                  0 
            Length          Wheelbase              Width 
                 0                  0                  0 
       Turn.circle     Rear.seat.room       Luggage.room 
                 0                  2                 11 
            Weight             Origin               Make 
                 0                  0                  0 

 

โ–ท MASS ํŒจํ‚ค์ง€์— ํฌํ•จ๋œ Cars93 ๋ฐ์ดํ„ฐ๋ฅผ ์‹คํ—˜์— ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

 

โ–ท ์œ„์˜ ๊ฒฐ๊ณผ๋Š” is.na ํ•จ์ˆ˜์™€ sum ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, ๊ฐ ์—ด์— ํฌํ•จ๋œ ๊ฒฐ์ธก์น˜์˜ ์ˆ˜์ด๋‹ค. Luggage.room๊ณผ Rear.seat.room๋งŒ ๊ฒฐ์ธก์น˜๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

 

In:

complete.cases(df_car)

 

Out:

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[11]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
[21]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[31]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[41]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[51]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[61]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
[71]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[81]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
[91]  TRUE  TRUE  TRUE

 

โ–ท complete.cases ํ•จ์ˆ˜๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€ ํ–‰์—๋Š” TRUE๋ฅผ, ์•„๋‹Œ ๊ฒฝ์šฐ๋Š” FALSE๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.

 

โ–ก ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ

 

In:

cnt_1 = sum(complete.cases(df_car))

cnt_2 = df_car %>% 
  na.omit() %>% 
  nrow()

cnt_1 == cnt_2

 

Out:

[1] TRUE

 

โ–ท cnt_1 ๋ณ€์ˆ˜๋Š” complete.cases ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๊ตฌํ•œ ํ–‰์˜ ๊ฐœ์ˆ˜, cnt_2 ๋ณ€์ˆ˜๋Š” na.omit ํ•จ์ˆ˜๋ฅผ ์ˆ˜ํ–‰ํ•œ ๋’ค์˜ ํ–‰์˜ ๊ฐœ์ˆ˜์ด๋‹ค. ์œ„ ๊ฒฐ๊ณผ์—์„œ ๋ณด๋‹ค์‹œํ”ผ ์ด ๋‘˜์€ ๊ฐ™๋‹ค. ์ฆ‰, na.omit ํ•จ์ˆ˜๋Š” ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋œ ํ–‰์„ ์ œ์™ธ์‹œ์ผœ์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

 

โ–ก ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋œ ๋ฒกํ„ฐ์˜ ์—ฐ์‚ฐ

 

In:

sum(df_car$Luggage.room)
sum(df_car$Luggage.room, na.rm = TRUE)

median(df_car$Luggage.room)
median(df_car$Luggage.room, na.rm = TRUE)

 

Out:

[1] NA

[1] 1139

[1] NA

[1] 14

 

โ–ท ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋œ ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ๋Š” ์—ฐ์‚ฐ์˜ na.rm ์ธ์ž๋ฅผ TRUE๋กœ ์ฃผ์–ด, ๊ณ„์‚ฐํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์„ ๊ฒฝ์šฐ, NA ๊ฐ’์ด ์ถœ๋ ฅ๋œ๋‹ค.

 

โ–ก ๊ฒฐ์ธก์น˜ ๋Œ€์ฒด

 

In:

df_car_compl = sapply(df_car, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x)) %>% 
  as.data.frame()

 

โ–ท ์›์†Œ์— ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋  ๊ฒฝ์šฐ, ํ•ด๋‹น ์—ด์˜ ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ์ด๋‹ค.

 

In:

group = c(rep('a', 5), rep('b', 5))
value = c(1, 2, 3, NaN, 6, 2, 4, NaN, 10, 8)
df = data.frame(group, value)

df

df %>%
  group_by(group) %>%
  mutate(value = ifelse(is.na(value), mean(value, na.rm = TRUE), value))

 

Out:

   group value
1      a     1
2      a     2
3      a     3
4      a   NaN
5      a     6
6      b     2
7      b     4
8      b   NaN
9      b    10
10     b     8

# A tibble: 10 x 2
# Groups:   group [2]
   group value
   <fct> <dbl>
 1 a         1
 2 a         2
 3 a         3
 4 a         3
 5 a         6
 6 b         2
 7 b         4
 8 b         6
 9 b        10
10 b         8

 

โ–ท value ์—ด์˜ ๊ฒฐ์ธก์น˜๋ฅผ ํ•ด๋‹นํ•˜๋Š” group์˜ ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ์ด๋‹ค. ์ด๋•Œ ์•ž์˜ ์ฝ”๋“œ์™€ ๋‹ค๋ฅธ ์ ์€ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „, group_by ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐํ•˜๋Š” ๋Œ€์ƒ์˜ ๋ฒ”์œ„๋ฅผ ํ•œ์ •์‹œ์ผฐ๋‹ค๋Š” ์ ์ด๋‹ค. ๋งŒ์•ฝ ์ด ์ฝ”๋“œ๊ฐ€ ์—†๋‹ค๋ฉด value ์—ด์˜ ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด๋œ๋‹ค.