๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Programming/R

caret ์‚ฌ์šฉ๋ฒ•

caret ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์˜ ์ ์šฉ ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.

 

In:

df_iris = iris

str(df_iris)

 

Out:

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

 

โ–ท ์‹คํ—˜์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋Š” R์— ๋‚ด์žฅ๋˜์–ด ์žˆ๋Š” Iris ๋ฐ์ดํ„ฐ์ด๋‹ค.

 

1. ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ

 

In:

tr_idx = createDataPartition(y = iris$Species, p = 0.7, list = F)

train_iris = df_iris[tr_idx, ]
test_iris = df_iris[-tr_idx, ]

table(train_iris$Species)

 

Out:

    setosa versicolor  virginica 
        35         35         35 

 

โ–ท createDataPartition ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ p ์ธ์ž์— ํ• ๋‹น๋œ ๊ฐ’์˜ ๋น„์œจ๋กœ ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ์ด๋•Œ, y ์ธ์ž์— ํ• ๋‹น๋œ ๊ฐ’์ด ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ผ ๊ฒฝ์šฐ, ๋น„์œจ์„ ๊ณ ๋ คํ•˜์—ฌ ์ถ”์ถœํ•˜๊ฒŒ ๋œ๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋Š” Species ์—ด์˜ ๋ ˆ์ด๋ธ”์˜ ๋น„์œจ์— ๋งž๊ฒŒ ์ถ”์ถœ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

2. ๋ฐ์ดํ„ฐ ํ•™์Šตํ•˜๊ธฐ

 

In:

library(caret)

model_rf = train(Species ~ ., data = train_iris, method = 'rf')

print(model_rf)

 

Out:

Random Forest 

105 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9571786  0.9348322
  3     0.9585120  0.9368726
  4     0.9562244  0.9334105

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was mtry = 3.

 

โ–ท train ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ธ์ž์—๋Š” ๋ชจ๋ธ์— ์‚ฌ์šฉํ•  ๋ชฉ์ ๋ณ€์ˆ˜์™€ ์„ค๋ช…๋ณ€์ˆ˜๋ฅผ, data ์ธ์ž์—๋Š” ํ•™์Šตํ•  ๋ฐ์ดํ„ฐ๋ฅผ, method ์ธ์ž์—๋Š” ํ•™์Šตํ•  ๋ชจ๋ธ์˜ ์ด๋ฆ„์„ ์ฃผ์–ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

 

โ–ท ์—ฌ๊ธฐ(topepo.github.io/caret/train-models-by-tag.html)์—์„œ caret ํŒจํ‚ค์ง€์˜ train ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ •๋ณด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

In:

varImp(model_rf)

 

Out:

rf variable importance

              Overall
Petal.Width  100.0000
Petal.Length  81.6058
Sepal.Width    0.4088
Sepal.Length   0.0000

 

โ–ท varImp ํ•จ์ˆ˜์˜ ์ธ์ž๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ฃผ๋ฉด, ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

3. ์˜ˆ์ธก ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€

 

In:

pred = predict(object = model_rf, test_iris)

print(pred)

confusionMatrix(data = pred, reference = test_iris$Species)

 

Out:

 [1] setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa    
[13] setosa     setosa     setosa     versicolor versicolor versicolor versicolor versicolor versicolor virginica  versicolor versicolor
[25] virginica  versicolor versicolor versicolor versicolor versicolor virginica  virginica  virginica  virginica  virginica  virginica 
[37] virginica  versicolor virginica  virginica  virginica  versicolor virginica  virginica  virginica 
Levels: setosa versicolor virginica

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         2
  virginica       0          2        13

Overall Statistics
                                          
               Accuracy : 0.9111          
                 95% CI : (0.7878, 0.9752)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 8.467e-16       
                                          
                  Kappa : 0.8667          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           0.8667
Specificity                 1.0000            0.9333           0.9333
Pos Pred Value              1.0000            0.8667           0.8667
Neg Pred Value              1.0000            0.9333           0.9333
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.2889
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9000           0.9000

 

โ–ท predict ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. object ์ธ์ž์— ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ฃผ๊ณ , ๋‘ ๋ฒˆ์งธ ์ธ์ž์— ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ๋ฉด ๋์ด ๋‚œ๋‹ค.

 

โ–ท ์œ„์˜ ์ฒซ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” ์˜ˆ์ธก ๊ฒฐ๊ณผ์ด๋‹ค. ๋‘ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” confusionMatrix ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. data ์ธ์ž์—๋Š” ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ, reference ์ธ์ž์—๋Š” ์‹ค์ œ ๊ฐ’์„ ์ฃผ์–ด ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์š”์•ฝ๋œ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.