๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Machine Learning/Recommendation

ํ˜‘์—… ํ•„ํ„ฐ๋ง(Collaborative Filtering)์„ ์ด์šฉํ•œ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ถ”์ฒœ

์œ ์ €์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์•ˆ ๋ณธ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ค‘ ์–ด๋–ค ๊ฒƒ์„ ์ถ”์ฒœํ•  ์ง€์— ๋Œ€ํ•œ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด๋‹ค. ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ฐ์ดํ„ฐ๋Š” ์—ฌ๊ธฐ(www.kaggle.com/CooperUnion/anime-recommendations-database)์—์„œ ์–ป์—ˆ๋‹ค. ํ˜‘์—… ํ•„ํ„ฐ๋ง(Collaborative filtering)์˜ ๋Œ€ํ‘œ์ ์ธ 3๊ฐ€์ง€ ๋ฐฉ์‹์„ R๋กœ ์ง์ ‘ ๊ตฌํ˜„ํ•˜๊ณ , ์ด๋ฅผ ์ ์šฉํ•˜์—ฌ ์ถ”์ฒœํ•˜์—ฌ ๋ณด์ž. ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š”๋ฐ ์•„๋ž˜์˜ Reference์˜ ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•˜์˜€๋‹ค.

 

1. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

2. ํ˜‘์—… ํ•„ํ„ฐ๋ง

     2-1. ์œ ์ € ๊ธฐ๋ฐ˜ ํ˜‘์—… ํ•„ํ„ฐ๋ง

     2-2. ์•„์ดํ…œ ๊ธฐ๋ฐ˜ ํ˜‘์—… ํ•„ํ„ฐ๋ง

     2-3. ํ–‰๋ ฌ ์ธ์ˆ˜๋ถ„ํ•ด ํ˜‘์—… ํ•„ํ„ฐ๋ง

3. ์„ฑ๋Šฅ ๋น„๊ต

4. ์ถ”์ฒœ ๊ฒฐ๊ณผ

 

1. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

 

In:

library(dplyr)
library(tidyr)

data_anime = read.csv('../input/anime.csv')
data_rating = read.csv('../input/rating.csv')

str(data_anime)
str(data_rating)

 

Out:

'data.frame':	12294 obs. of  7 variables:
 $ anime_id: int  32281 5114 28977 9253 9969 32935 11061 820 15335 15417 ...
 $ name    : Factor w/ 12292 levels "Hidamari Sketch x \xe2์ฟ\xe2์ฟ\xe2์ฟ Recap",..: 5520 2992 3480 10246 3471 3759 4424 3413 3468 3472 ...
 $ genre   : Factor w/ 3269 levels "","Action","Action, Adventure",..: 2687 163 536 3244 536 1820 365 2624 536 536 ...
 $ type    : Factor w/ 18 levels "","1","12","13",..: 12 17 17 17 17 17 17 15 12 17 ...
 $ episodes: Factor w/ 218 levels "","1","10","100",..: 2 166 141 86 141 3 43 14 2 28 ...
 $ rating  : num  9.37 9.26 9.25 9.17 9.16 9.15 9.13 9.11 9.1 9.11 ...
 $ members : int  200630 793665 114262 673572 151266 93351 425855 80679 72534 81109 ...

'data.frame':	7813737 obs. of  3 variables:
 $ user_id : int  1 1 1 1 1 1 1 1 1 1 ...
 $ anime_id: int  20 24 79 226 241 355 356 442 487 846 ...
 $ rating  : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...

 

โ–ท ์• ๋‹ˆ๋ฉ”์ด์…˜์˜ ๊ฐœ์ˆ˜๋Š” 12,294๊ฐœ์ด๊ณ , ์• ๋‹ˆ๋ฉ”์ด์…˜์— ๋Œ€ํ•œ ๊ฐ ์œ ์ €์˜ ํ‰๊ฐ€๋Š” 7,813,737๊ฐœ์ด๋‹ค.

 

โ–ท ์• ๋‹ˆ๋ฉ”์ด์…˜์˜ ์ •๋ณด์™€ ์œ ์ €์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜ ํ‰๊ฐ€์˜ ์ •๋ณด๋Š” ๋‘ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋‹ค.

 

์œ„ ๋ฐ์ดํ„ฐ ์ค‘ ์„ ํ˜ธํ•˜๋Š” 10๊ฐœ์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜๊ณผ ํ‰๊ฐ€๋ฅผ ๋งŽ์ดํ•œ 10๋ช…์˜ ์œ ์ €์˜ ๋ฐ์ดํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

 

In:

data_rating = data_rating %>% 
  filter(rating != -1)

sel_anime_id = data_anime %>% 
  arrange(-members) %>% 
  head(10) %>% 
  select(anime_id) %>% 
  unlist()

sel_user_id = data_rating %>% 
  group_by(user_id) %>% 
  count() %>% 
  arrange(-n) %>% 
  head(10) %>% 
  select(user_id) %>% 
  unlist()

df_anime_rating = data_rating %>% 
  filter(user_id %in% sel_user_id, anime_id %in% sel_anime_id) %>% 
  spread(anime_id, rating)

names(df_anime_rating)[2:ncol(df_anime_rating)] = paste0('anime_', names(df_anime_rating)[2:ncol(df_anime_rating)])

print(df_anime_rating)

 

Out:

   user_id anime_20 anime_1535 anime_1575 anime_4224 anime_5114 anime_6547 anime_9253 anime_10620 anime_11757 anime_16498
1     7345        6         10          7          9          9          6          9           5           4           6
2    12431       NA          6         NA          7          7          7          7           6           7           6
3    22434        8         10         10         10         10         10         10           8           9           9
4    42635        7          9          7          7          6          6          8           7           7           7
5    45659        8          8          9         10         10          7          9           9           8           9
6    51693       NA          9          9          9         10          8          8           7           8           8
7    53698       NA          7          9          8         NA          6          8           7           9           8
8    57620        6         10          8         10         10         10         10           8           9           9
9    59643        7         10         10          9          9          9          9           7           8           9
10   65840        8         10         10          9          8          9         10           7           9           8

 

โ–ท ์œ„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ์œ ์ €๋ณ„ ์• ๋‹ˆ๋ฉ”์ด์…˜์— ๋Œ€ํ•œ ํ‰๊ฐ€์— ๋Œ€ํ•œ ๊ฒƒ์ด๋‹ค. ๋ช‡ ๊ตฐ๋ฐ NA ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” ํ•ด๋‹น ์• ๋‹ˆ๋ฉ”์ด์…˜์— ๋Œ€ํ•œ ์œ ์ €์˜ ํ‰๊ฐ€๊ฐ€ ์•„์ง ๋˜์ง€ ์•Š์€ ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

 

์œ ์ € 53698์€ ์• ๋‹ˆ๋ฉ”์ด์…˜ 20๊ณผ 5114๋ฅผ ํ‰๊ฐ€ ํ•˜์ง€ ์•Š์•˜๋‹ค. ๋”ฐ๋ผ์„œ ๋ณด์ง€ ์•Š์€ ๋‘ ๊ฐœ์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ค‘ ์–ด๋–ค ๊ฒƒ์„ ์ถ”์ฒœํ• ์ง€ ์ดํ›„์˜ 3๊ฐ€์ง€ ๋ฐฉ์‹์„ ํ†ตํ•ด ์ •ํ•  ๊ฒƒ์ด๋‹ค.

 

2. ํ˜‘์—… ํ•„ํ„ฐ๋ง

 

In:

mean_user_rating = df_anime_rating %>% 
  select(-user_id) %>% 
  apply(1, function(x) mean(x, na.rm = T))

df_anime_rating_adj = df_anime_rating[, -1] - mean_user_rating

row.names(df_anime_rating_adj) = df_anime_rating$user_id

 

โ–ท mean_user_rating์€ ๊ฐœ์„ ๋œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Adjusted cosine similarity)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์œ ์ €-์œ ์ € ๋˜๋Š” ์•„์ดํ…œ-์•„์ดํ…œ ํ–‰๋ ฌ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ฏธ๋ฆฌ ์ „์ฒ˜๋ฆฌํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด๋‹ค.

 

โ–ถ ์ฐธ๊ณ ๋กœ ์ด ๋…ผ๋ฌธ[1]์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ  ์œ ์‚ฌ๋„๋กœ ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜, ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„, ๊ฐœ์„ ๋œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์ค‘ ๊ฐœ์„ ๋œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์—์„œ MAE(Mean Absolute Error)๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

 

In:

get.cos = function(x, y) {
  is_completed = !is.na(x*y)
  
  x_hat = x[is_completed]
  y_hat = y[is_completed]
  
  return (sum(x_hat*y_hat)/(sqrt(sum(x_hat*x_hat))*sqrt(sum(y_hat*y_hat))))
}

get.pred = function(w, r) {
  is_completed = !is.na(w*r)
  
  w_hat = w[is_completed]
  r_hat = r[is_completed]
  
  return (sum(w_hat*r_hat)/sum(abs(w_hat)))
}

 

โ–ท get.cos๋Š” ๋‘ ๊ฐœ์˜ ๋ฒกํ„ฐ๊ฐ€ ์ฃผ์–ด์งˆ ๋•Œ, ์ด๋“ค์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜์ด๋‹ค.

 

โ–ท get.pred๋Š” ์œ ์ €-์œ ์ € ๋˜๋Š” ์•„์ดํ…œ-์•„์ดํ…œ ํ–‰๋ ฌ๊ณผ ์œ ์ €-์•„์ดํ…œ ํ–‰๋ ฌ์ด ์ฃผ์–ด์งˆ ๋•Œ, ์ด๋“ค์˜ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜์ด๋‹ค. 

 

2-1. ์œ ์ € ๊ธฐ๋ฐ˜ ํ˜‘์—… ํ•„ํ„ฐ๋ง

 

In:

num_user = nrow(df_anime_rating_adj)
num_item = ncol(df_anime_rating_adj)

mtx_user_cos = matrix(rep(NA, num_user^2), ncol = num_user)

for (i in seq(num_user)) {
  for (j in seq(num_user)) {
    mtx_user_cos[i, j] = get.cos(df_anime_rating_adj[i, ], df_anime_rating_adj[j, ])
  }
}

print(mtx_user_cos)

 

Out:

             [,1]        [,2]       [,3]        [,4]        [,5]        [,6]        [,7]        [,8]       [,9]      [,10]
 [1,]  1.00000000  0.09169554 0.68976745  0.43242557  0.42224758  0.67850468 -0.12893837  0.53502020 0.63713467  0.4888088
 [2,]  0.09169554  1.00000000 0.54232614 -0.45243741  0.06657796  0.33785211  0.24345389  0.44721360 0.07770873  0.3328898
 [3,]  0.68976745  0.54232614 1.00000000  0.09028939  0.16666667  0.71874058  0.00000000  0.79056942 0.89553347  0.7399500
 [4,]  0.43242557 -0.45243741 0.09028939  1.00000000 -0.09363344 -0.06787895  0.07660643  0.09517337 0.27551332  0.5160468
 [5,]  0.42224758  0.06657796 0.16666667 -0.09363344  1.00000000  0.41982254  0.48774393  0.17568209 0.12161566 -0.1814437
 [6,]  0.67850468  0.33785211 0.71874058 -0.06787895  0.41982254  1.00000000  0.28088450  0.40406102 0.66355525  0.3273810
 [7,] -0.12893837  0.24345389 0.00000000  0.07660643  0.48774393  0.28088450  1.00000000 -0.37267800 0.10263385  0.2531848
 [8,]  0.53502020  0.44721360 0.79056942  0.09517337  0.17568209  0.40406102 -0.37267800  1.00000000 0.62931678  0.4034358
 [9,]  0.63713467  0.07770873 0.89553347  0.27551332  0.12161566  0.66355525  0.10263385  0.62931678 1.00000000  0.7515111
[10,]  0.48880881  0.33288978 0.73995003  0.51604685 -0.18144368  0.32738095  0.25318484  0.40343577 0.75151113  1.0000000

 

โ–ท ์œ„ ๊ฒฐ๊ณผ๋Š” ์œ ์ € ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ์ด๋‹ค.

 

In:

mtx_pred_user = matrix(rep(NA, num_user*num_item), ncol = num_item)

for (u in seq(num_user)) {
  for (i in seq(num_item)) {
    mtx_pred_user[u, i] = get.pred(mtx_user_cos[u, ], df_anime_rating_adj[, i]) + mean_user_rating[u]
  }
}

print(mtx_pred_user %>% 
        round(2))

 

Out:

      [,1]  [,2] [,3]  [,4]  [,5] [,6]  [,7] [,8] [,9] [,10]
 [1,] 5.76  8.32 7.38  7.89  7.87 6.92  7.85 5.84 6.23  6.74
 [2,] 5.33  6.63 6.98  7.12  7.28 6.89  6.98 5.69 6.64  6.29
 [3,] 7.87 10.43 9.86 10.09 10.07 9.52 10.10 7.96 8.80  8.99
 [4,] 6.37  8.56 7.42  7.34  6.82 6.68  7.88 6.34 6.61  6.90
 [5,] 8.20  8.61 9.00  9.56 10.02 7.60  8.92 8.31 8.15  8.69
 [6,] 7.03  9.21 8.94  9.20  9.39 8.22  8.96 7.16 7.82  8.13
 [7,] 8.13  7.33 8.63  7.95  7.90 6.78  7.80 7.36 8.17  7.78
 [8,] 7.22 10.04 9.00  9.69  9.78 9.34  9.70 7.83 8.34  8.62
 [9,] 7.33  9.91 9.27  9.32  9.24 8.68  9.40 7.22 8.06  8.35
[10,] 7.70  9.86 9.36  9.21  8.97 8.82  9.51 7.51 8.47  8.44

 

โ–ท ์œ„ ํ–‰๋ ฌ์€ ์œ ์ €-์œ ์ € ํ–‰๋ ฌ๊ณผ ์œ ์ €-์•„์ดํ…œ ํ–‰๋ ฌ์˜ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํ‰๊ฐ€๋ฅผ ์˜ˆ์ธกํ•œ ๊ฒƒ์ด๋‹ค.

 

2-2. ์•„์ดํ…œ ๊ธฐ๋ฐ˜ ํ˜‘์—… ํ•„ํ„ฐ๋ง

 

In:

mtx_item_cos = matrix(rep(NA, num_item^2), ncol = num_item)

for (i in seq(num_item)) {
  for (j in seq(num_item)) {
    mtx_item_cos[i, j] = get.cos(df_anime_rating_adj[, i], df_anime_rating_adj[, j])
  }
}

print(mtx_item_cos)

 

Out:

            [,1]        [,2]         [,3]       [,4]         [,5]         [,6]       [,7]        [,8]        [,9]        [,10]
 [1,]  1.0000000 -0.59655876 -0.127425475 -0.7145774 -0.559161425 -0.190553983 -0.7032623  0.76802948  0.41580844  0.284760714
 [2,] -0.5965588  1.00000000  0.108903774  0.5440575  0.279612025 -0.092961648  0.7985666 -0.71991117 -0.72351806 -0.615899135
 [3,] -0.1274255  0.10890377  1.000000000  0.1205407 -0.004867953 -0.338901961  0.1513502 -0.55001434  0.05718941 -0.133941989
 [4,] -0.7145774  0.54405749  0.120540678  1.0000000  0.856596191 -0.375601503  0.7357022 -0.68216080 -0.72575202 -0.529802914
 [5,] -0.5591614  0.27961202 -0.004867953  0.8565962  1.000000000 -0.263372752  0.3173627 -0.51372087 -0.70414826 -0.363148480
 [6,] -0.1905540 -0.09296165 -0.338901961 -0.3756015 -0.263372752  1.000000000 -0.2241341  0.07062369  0.21530554  0.004757735
 [7,] -0.7032623  0.79856661  0.151350184  0.7357022  0.317362711 -0.224134105  1.0000000 -0.72675628 -0.57761661 -0.673156513
 [8,]  0.7680295 -0.71991117 -0.550014345 -0.6821608 -0.513720867  0.070623690 -0.7267563  1.00000000  0.49322086  0.688973125
 [9,]  0.4158084 -0.72351806  0.057189415 -0.7257520 -0.704148258  0.215305541 -0.5776166  0.49322086  1.00000000  0.544393334
[10,]  0.2847607 -0.61589914 -0.133941989 -0.5298029 -0.363148480  0.004757735 -0.6731565  0.68897313  0.54439333  1.000000000

 

โ–ท ์œ„ ๊ฒฐ๊ณผ๋Š” ์•„์ดํ…œ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ์ด๋‹ค.

 

 

In:

mtx_pred_item = matrix(rep(NA, num_user*num_item), ncol = num_item)

for (u in seq(num_user)) {
  for (i in seq(num_item)) {
    mtx_pred_item[u, i] = get.pred(mtx_item_cos[i, ], df_anime_rating_adj[u, ]) + mean_user_rating[u]
  }
}

print(mtx_pred_item %>% 
        round(2))

 

Out:

      [,1]  [,2]  [,3]  [,4]  [,5] [,6]  [,7] [,8] [,9] [,10]
 [1,] 5.31  9.14  8.01  9.00  9.06 5.81  9.01 5.33 5.02  5.18
 [2,] 6.40  6.73  6.87  6.80  6.82 6.69  6.79 6.40 6.53  6.39
 [3,] 8.56 10.12 10.02 10.07 10.07 9.37 10.11 8.62 8.78  8.70
 [4,] 6.89  7.58  7.36  7.31  7.13 6.69  7.52 6.83 6.82  6.78
 [5,] 8.36  8.94  9.02  9.27  9.40 7.75  9.04 8.39 8.23  8.48
 [6,] 7.83  8.99  9.09  9.10  9.23 7.90  8.96 7.79 7.83  7.86
 [7,] 7.82  7.61  8.68  7.81  7.74 6.87  7.76 7.63 7.96  7.85
 [8,] 7.81  9.90  8.99  9.87  9.90 9.32  9.90 8.14 8.21  8.26
 [9,] 7.80  9.51  9.66  9.38  9.35 8.53  9.45 7.84 8.06  8.04
[10,] 8.10  9.58  9.83  9.31  9.12 8.66  9.56 8.00 8.34  8.04

 

โ–ท ์œ„ ํ–‰๋ ฌ์€ ์•„์ดํ…œ-์•„์ดํ…œ ํ–‰๋ ฌ๊ณผ ์œ ์ €-์•„์ดํ…œ ํ–‰๋ ฌ์˜ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํ‰๊ฐ€๋ฅผ ์˜ˆ์ธกํ•œ ๊ฒƒ์ด๋‹ค.

 

2-3. ํ–‰๋ ฌ ์ธ์ˆ˜๋ถ„ํ•ด ํ˜‘์—… ํ•„ํ„ฐ๋ง

 

In:

k = 5
lambda = 0.1
learning_rate = 0.01
epoch = 1e3

R = as.matrix(df_anime_rating[, -1])
P = matrix(runif(num_user*k), nrow = num_user)
Q = matrix(runif(num_item*k), nrow = num_item)

for (e in seq(epoch)) {
  for (u in seq(num_user)) {
    for (i in seq(num_item)) {
      if (!is.na(R[u, i])) {
        e_ui = R[u, i]-t(Q[i, ])%*%P[u, ] %>% 
          as.vector()
        
        P[u, ] = P[u, ]+learning_rate*(e_ui*Q[i, ]-lambda*P[u, ])
        Q[i, ] = Q[i, ]+learning_rate*(e_ui*P[u, ]-lambda*Q[i, ])
      }
    }
  }
  
  if (e%%1e2 == 0) {
    cat('Epoch:', e, 'MAE:', mean(abs(R-P%*%t(Q)), na.rm = T), '\n')
  }
}

 

Out:

Epoch: 100 MAE: 0.330485 
Epoch: 200 MAE: 0.2766949 
Epoch: 300 MAE: 0.2508011 
Epoch: 400 MAE: 0.2476656 
Epoch: 500 MAE: 0.2472082 
Epoch: 600 MAE: 0.247104 
Epoch: 700 MAE: 0.2470732 
Epoch: 800 MAE: 0.2470628 
Epoch: 900 MAE: 0.247059 
Epoch: 1000 MAE: 0.2470575 

 

โ–ท ์œ„ ํ•™์Šต ๊ณผ์ •์€ ์•„๋ž˜ ์ˆ˜์‹์˜ ์ตœ์ ํ™” ๊ณผ์ •์ด๋‹ค. ํ•ด๋‹น ์ˆ˜์‹์˜ ํ‘œํ˜„์— ๋Œ€ํ•œ ํ‘œ๊ธฐ๋Š” ์ด ๋…ผ๋ฌธ[2]์„ ํ†ตํ•ด ์ฐธ์กฐํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค.

 

 

โ–ท ์ด ํ•™์Šต ๊ณผ์ •์„ ํ†ตํ•ด ์œ ์ € ๋˜๋Š” ์•„์ดํ…œ์˜ ์ž ์žฌ์š”์ธ(Latent factor)์„ ์ถ”์ •ํ•˜์—ฌ ์œ ์ €-์ž ์žฌ์š”์ธ ๋˜๋Š” ์•„์ดํ…œ-์ž ์žฌ์š”์ธ ํ–‰๋ ฌ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์œ ์ €-์œ ์ € ํ–‰๋ ฌ ๋˜๋Š” ์•„์ดํ…œ-์•„์ดํ…œ ํ–‰๋ ฌ์„ ๊ตฌํ•˜์ง€ ์•Š๊ณ  ํ‰๊ฐ€๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค.

 

In:

print(P%*%t(Q) %>% 
        round(2))

 

Out:

      [,1]  [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 5.94  9.64 6.97 8.51 8.63 6.16 8.39 5.22 4.29  6.21
 [2,] 4.85  6.33 6.48 6.82 6.90 6.74 6.62 5.83 6.74  6.41
 [3,] 7.60 10.08 9.90 9.93 9.86 9.41 9.91 7.88 9.11  9.05
 [4,] 6.73  8.30 7.13 7.21 6.22 6.14 8.07 6.51 6.92  6.82
 [5,] 7.70  8.34 8.95 9.72 9.81 7.07 8.95 8.37 8.31  8.77
 [6,] 6.67  8.56 8.85 9.05 9.35 7.99 8.53 6.93 7.72  8.01
 [7,] 7.41  7.18 8.83 7.81 7.40 6.50 7.76 7.23 8.44  7.86
 [8,] 6.39  9.88 8.45 9.83 9.89 9.83 9.79 7.95 8.70  8.73
 [9,] 7.21  9.67 9.51 9.12 9.01 8.77 9.25 6.95 8.30  8.27
[10,] 7.72  9.99 9.68 8.81 8.20 8.70 9.50 7.11 8.72  8.34

 

โ–ท ์œ„ ํ–‰๋ ฌ์€ ์•„์ดํ…œ-์ž ์žฌ์š”์ธ ํ–‰๋ ฌ๊ณผ ์œ ์ €-์•„์ดํ…œ ํ–‰๋ ฌ์˜ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํ‰๊ฐ€๋ฅผ ์˜ˆ์ธกํ•œ ๊ฒƒ์ด๋‹ค.

 

 

3. ์„ฑ๋Šฅ ๋น„๊ต

 

In:

cat('MAE of User-Based CF:', mean(abs((mtx_pred_user-as.matrix(df_anime_rating)[, -1])), na.rm = T))
cat('MAE of Item-Based CF:', mean(abs(mtx_pred_item-as.matrix(df_anime_rating)[, -1]), na.rm = T))
cat('MAE of Matrix Factorization CF: ', mean(abs(R-P%*%t(Q)), na.rm = T))

 

Out:

MAE of User-Based CF: 0.4385365
MAE of Item-Based CF: 0.452148
MAE of Matrix Factorization CF:  0.2470575

 

โ–ท 3๊ฐ€์ง€ ๋ฐฉ์‹์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ์™€ ์‹ค์ œ ํ‰๊ฐ€์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ํ‰๊ฐ€์ง€ํ‘œ๋กœ์จ MAE๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ํ–‰๋ ฌ ์ธ์ˆ˜๋ถ„ํ•ด ํ•„ํ„ฐ๋ง ๋ฐฉ๋ฒ•์—์„œ ์‹ค์ œ ํ‰๊ฐ€์™€ ๊ฐ€์žฅ ๋น„์Šทํ•œ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

 

4. ์ถ”์ฒœ ๊ฒฐ๊ณผ

 

์œ ์ € 53698์€ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์• ๋‹ˆ๋ฉ”์ด์…˜ 20๊ณผ 5114 ์ค‘ ์–ด๋–ค ๊ฒƒ์„ ์ถ”์ฒœํ•ด๋ณด์ž. 

 

2-1, 2-2, 2-3์—์„œ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ ํ–‰๋ ฌ์—์„œ 7ํ–‰(์œ ์ € 53698์˜ ์œ„์น˜) 1์—ด(์• ๋‹ˆ๋ฉ”์ด์…˜ 20์˜ ์œ„์น˜)๊ณผ 7ํ–‰ 5์—ด(์• ๋‹ˆ๋ฉ”์ด์…˜ 5114์˜ ์œ„์น˜)์˜ ๊ฐ’์„ ๋น„๊ตํ•˜๋ฉด ๋œ๋‹ค.

 

(1) ์œ ์ € ๊ธฐ๋ฐ˜ ํ˜‘์—… ํ•„ํ„ฐ๋ง: ์• ๋‹ˆ๋ฉ”์ด์…˜ 20(8.13) > ์• ๋‹ˆ๋ฉ”์ด์…˜ 5114(7.90)

(2) ์•„์ดํ…œ ๊ธฐ๋ฐ˜ ํ˜‘์—… ํ•„ํ„ฐ๋ง: ์• ๋‹ˆ๋ฉ”์ด์…˜ 20(7.82) > ์• ๋‹ˆ๋ฉ”์ด์…˜ 5114(7.74)

(3) ํ–‰๋ ฌ ์ธ์ˆ˜๋ถ„ํ•ด ํ˜‘์—… ํ•„ํ„ฐ๋ง: ์• ๋‹ˆ๋ฉ”์ด์…˜ 20(7.41) > ์• ๋‹ˆ๋ฉ”์ด์…˜ 5114(7.40)

 

3๊ฐ€์ง€ ๋ฐฉ์‹ ๋ชจ๋‘ ์• ๋‹ˆ๋ฉ”์ด์…˜ 20์„ ๋” ์ถ”์ฒœํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

 

์• ๋‹ˆ๋ฉ”์ด์…˜ 20๊ณผ 5114์ด ๋ฌด์—‡์ธ์ง€ ์•Œ์•„๋ณด์ž.

 

In:

print(data_anime$name[data_anime$anime_id == 20])
print(data_anime$name[data_anime$anime_id == 5114])

 

Out:

[1] Naruto
[1] Fullmetal Alchemist: Brotherhood

 

โ–ท ์• ๋‹ˆ๋ฉ”์ด์…˜ 20์€ "๋‚˜๋ฃจํ† "๊ณ , ์• ๋‹ˆ๋ฉ”์ด์…˜ 5114๋Š” "๊ฐ•์ฒ ์˜ ์—ฐ๊ธˆ์ˆ ์‚ฌ"๋‹ค.

 

๊ฐœ์ธ์ ์œผ๋กœ๋Š” ๊ฐ•์ฒ ์˜ ์—ฐ๊ธˆ์ˆ ์‚ฌ๋ฅผ ๋” ์ถ”์ฒœํ•˜๊ณ  ์‹ถ์ง€๋งŒ, ์–ด์ฉ” ์ˆ˜ ์—†๋‹ค. ๋ฐ์ดํ„ฐ์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‚˜๋ฃจํ† ๋ฅผ ์ถ”์ฒœํ•˜๊ณ  ์žˆ์œผ๋‹ˆ... ๊ทธ๋ž˜๋„ ๋งŒ์•ฝ ์œ ์ € 53698! ๋งŒ์•ฝ ์ด ๊ธ€์„ ๋ณด๊ณ  ์žˆ๋‹ค๋ฉด ๊ฐ•์ฒ ์˜ ์—ฐ๊ธˆ์ˆ ์‚ฌ๋„ ๋ณด๊ธธ ๋ฐ”๋ž€๋‹ค.

 

๋งˆ์ง€๋ง‰ ์งค์€ ์Šน๋ฆฌ์˜ ๊ธฐ์จ์— ์ฐฌ ๋‚˜๋ฃจํ† ์™€ ์ ˆ๊ทœํ•˜๋Š” ์—๋“œ์›Œ๋“œ ์—˜๋ฆญ์œผ๋กœ ๋งˆ๋ฌด๋ฆฌ ํ•˜๊ฒ ๋‹ค.

 

 


 

Reference:

[1] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, "Item-Based Collaborative Filtering Recommendation Algorithms", 2001

[2] Y. Koren, R. Bell and C. Volinsky, “Matrix Factorization Techniques for Recommender Systems”, 2009