๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Programming/R

dplyr ์‚ฌ์šฉ๋ฒ•

dplyr์˜ ๋Œ€ํ‘œ์ ์ธ ํ•จ์ˆ˜ select, filter, mutate, summarise, group_by, sample_n, sample_frac์˜ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๊ณ  ์ ์šฉํ•˜์—ฌ ๋ณด์ž.

 

In:

library(dplyr)

df_iris = iris

str(df_iris)

 

Out:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  

 

โ–ท R์˜ ๋‚ด์žฅ๋œ Iris ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ๊ฒƒ์ด๋‹ค.

 

โ–ก select

 

In:

df_iris %>% 
  select(Sepal.Length) %>% 
  head()

df_iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  head()

df_iris %>% 
  select(1:3) %>% 
  head()

df_iris %>% 
  select(starts_with('Sepal')) %>% 
  head()

df_iris %>% 
  select(ends_with('Length')) %>% 
  head()

df_iris %>% 
  select(contains('.')) %>% 
  head()

 

Out:

  Sepal.Length
1          5.1
2          4.9
3          4.7
4          4.6
5          5.0
6          5.4

  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4
6          5.4         3.9          1.7

  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

 

โ–ท ์ฒซ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” select ํ•จ์ˆ˜์˜ ์ธ์ž๋กœ ์—ด์˜ ์ด๋ฆ„์„ ์ž…๋ ฅํ•˜๋ฉด ํ•ด๋‹น ์—ด๋งŒ ์ถœ๋ ฅ๋œ ๊ฒƒ์ด๋‹ค. ๋‘ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” ๋‘ ๊ฐœ์˜ ์—ด์„ ์ธ์ž๋กœ ์ฃผ์–ด, 2๊ฐœ์˜ ์—ด์„ ์ถœ๋ ฅํ•œ ๊ฒƒ์ด๋‹ค. ์„ธ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” ์—ด์˜ ์ด๋ฆ„์„ ๋ชจ๋ฅด๋Š” ๊ฒฝ์šฐ, ์ˆœ์„œ๋ฅผ ํ†ตํ•ด ํ•ด๋‹น ์ˆœ์„œ์˜ ์—ด์„ ์ถœ๋ ฅํ•œ ๊ฒƒ์ด๋‹ค. ๋„ค ๋ฒˆ์งธ๋Š” starts_with ํ•จ์ˆ˜์˜ ์ธ์ž๋กœ ์ฃผ์–ด์ง„ ๋ฌธ์ž์—ด์ด ์—ด์˜ ์ฒซ ๋ถ€๋ถ„์— ํฌํ•จ๋œ ์—ด์„ ์ถœ๋ ฅํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. ๋‹ค์„ฏ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” ends_with ํ•จ์ˆ˜์˜ ์ธ์ž๋กœ ์ฃผ์–ด์ง„ ๋ฌธ์ž์—ด์ด ์—ด์˜ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์— ํฌํ•จ๋œ ์—ด์„ ์ถœ๋ ฅํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. ๋งˆ์ง€๋ง‰ ๊ฒฐ๊ณผ๋Š” contains ํ•จ์ˆ˜์˜ ์ธ์ž๋กœ ์ฃผ์–ด์ง„ ๋ฌธ์ž์—ด์ด ํฌํ•จ๋œ ์—ด์„ ์ถœ๋ ฅํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

 

โ–ก filter

 

In:

df_iris %>% 
  filter(Species == 'virginica') %>% 
  head()

df_iris %>% 
  filter(Species == 'virginica' & Sepal.Width <= 3) %>% 
  head()

 

Out:

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          6.3         3.3          6.0         2.5 virginica
2          5.8         2.7          5.1         1.9 virginica
3          7.1         3.0          5.9         2.1 virginica
4          6.3         2.9          5.6         1.8 virginica
5          6.5         3.0          5.8         2.2 virginica
6          7.6         3.0          6.6         2.1 virginica

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          5.8         2.7          5.1         1.9 virginica
2          7.1         3.0          5.9         2.1 virginica
3          6.3         2.9          5.6         1.8 virginica
4          6.5         3.0          5.8         2.2 virginica
5          7.6         3.0          6.6         2.1 virginica
6          4.9         2.5          4.5         1.7 virginica

 

โ–ท ์ฒซ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” filter ํ•จ์ˆ˜์— ํ–‰์˜ ์กฐ๊ฑด์„ ์ธ์ž๋กœ ์ฃผ์–ด ์ถœ๋ ฅํ•œ ๊ฒƒ์ด๋‹ค. ๋‘ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋Š” ๋ณต์ˆ˜์˜ ์กฐ๊ฑด์„ ์ธ์ž๋กœ ์ฃผ์–ด ์ถœ๋ ฅํ•œ ๊ฒƒ์ด๋‹ค.

 

โ–ก mutate

 

In:

df_iris %>% 
  mutate(is_long_Sepal.Length = if_else(Sepal.Length >= 5, 'High', 'Low')) %>% 
  head()

 

Out:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
  is_long_Sepal.Length
1                 High
2                  Low
3                  Low
4                  Low
5                 High
6                 High

 

โ–ท mutate ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธฐ์กด์˜ ์—ด์˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์—ด์„ ์ƒ์„ฑํ•˜์˜€๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋Š” mutate, if_else ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ ์—ด์„ ๋งŒ๋“  ๊ฒƒ์ด๋‹ค.

 

โ–ก summarise

 

In:

df_iris %>% 
  summarise(count = n(), 
            n_species = n_distinct(Species), 
            max_Sepal.Length = max(Sepal.Length))

 

Out:

  count n_species max_Sepal.Length
1   150         3              7.9

 

โ–ท summarise ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์š”์•ฝ๋œ ์ •๋ณด๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋Š” summarise ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, df_iris ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํ–‰์˜ ๊ฐœ์ˆ˜, Sepecies ์—ด์˜ ์œ ์ผํ•œ ๊ฐ’์˜ ๊ฐœ์ˆ˜, Sepal.Length ์—ด์˜ ์ตœ๋Œ€๊ฐ’์„ ์š”์•ฝํ•˜์—ฌ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค.

 

โ–ก group_by

 

In:

df_iris %>% 
  group_by(Species) %>% 
  summarise(count = n(),
            max_Sepal.Length = max(Sepal.Length))

 

Out:

# A tibble: 3 x 3
  Species    count max_Sepal.Length
  <fct>      <int>            <dbl>
1 setosa        50              5.8
2 versicolor    50              7  
3 virginica     50              7.9

 

โ–ท group_by ํ•จ์ˆ˜์™€ summarise ํ•จ์ˆ˜๊ฐ€ ํ•ฉ์น˜๋ฉด ์•„์ฃผ ์œ ์šฉํ•œ ๊ธฐ๋Šฅ์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋Š” group_by ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, Species์— ๋”ฐ๋ฅธ ํ–‰์˜ ๊ฐœ์ˆ˜์™€ Sepal.Length ์—ด์˜ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ์š”์•ฝํ•˜์—ฌ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค.

 

โ–ก arrange

 

In:

df_iris %>% 
  arrange(Sepal.Length) %>% 
  head()

 

Out:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.3         3.0          1.1         0.1  setosa
2          4.4         2.9          1.4         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
4          4.4         3.2          1.3         0.2  setosa
5          4.5         2.3          1.3         0.3  setosa
6          4.6         3.1          1.5         0.2  setosa

 

โ–ท arrange ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, ํŠน์ • ์—ด์˜ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋Š” arrange ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ  Sepal.Length๊ฐ€ ์ž‘์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•˜์—ฌ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค.

 

โ–ก sample_n & sample_frac

 

In:

df_iris %>% 
  sample_n(10)

df_iris %>% 
  sample_n(10, replace = TRUE)

df_iris %>% 
  sample_frac(0.05)

df_iris %>% 
  sample_frac(0.05, replace = TRUE)

 

Out:

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           6.3         2.9          5.6         1.8  virginica
2           6.3         3.3          4.7         1.6 versicolor
3           5.4         3.7          1.5         0.2     setosa
4           5.7         3.8          1.7         0.3     setosa
5           5.1         3.5          1.4         0.2     setosa
6           6.7         3.0          5.0         1.7 versicolor
7           4.6         3.4          1.4         0.3     setosa
8           6.7         3.1          5.6         2.4  virginica
9           5.5         2.6          4.4         1.2 versicolor
10          5.8         2.8          5.1         2.4  virginica

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           4.9         3.0          1.4         0.2     setosa
2           6.4         2.7          5.3         1.9  virginica
3           6.0         2.9          4.5         1.5 versicolor
4           6.7         3.1          4.4         1.4 versicolor
5           4.8         3.0          1.4         0.1     setosa
6           4.8         3.4          1.6         0.2     setosa
7           6.1         2.8          4.0         1.3 versicolor
8           4.9         3.0          1.4         0.2     setosa
9           5.4         3.0          4.5         1.5 versicolor
10          5.7         2.6          3.5         1.0 versicolor

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          6.0         2.9          4.5         1.5 versicolor
2          7.0         3.2          4.7         1.4 versicolor
3          4.9         3.1          1.5         0.1     setosa
4          6.2         2.2          4.5         1.5 versicolor
5          4.5         2.3          1.3         0.3     setosa
6          6.4         3.2          4.5         1.5 versicolor
7          4.4         3.2          1.3         0.2     setosa
8          6.1         2.9          4.7         1.4 versicolor

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          6.9         3.1          4.9         1.5 versicolor
2          6.9         3.1          5.4         2.1  virginica
3          5.5         2.6          4.4         1.2 versicolor
4          5.6         2.9          3.6         1.3 versicolor
5          7.9         3.8          6.4         2.0  virginica
6          5.6         2.9          3.6         1.3 versicolor
7          5.1         3.8          1.6         0.2     setosa
8          4.9         3.6          1.4         0.1     setosa

 

โ–ท sample_n ํ•จ์ˆ˜๋Š” ์ธ์ž๋งŒํผ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์ถ”์ถœํ•œ๋‹ค. replace ์ธ์ž๋ฅผ TRUE๋กœ ์ฃผ๋ฉด, ๋ณต์›์ถ”์ถœ์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. sample_frac ํ•จ์ˆ˜๋Š” ์ธ์ž๋งŒํผ์˜ ๋น„์œจ์„ ๋žœ๋คํ•˜๊ฒŒ ์ถ”์ถœํ•œ๋‹ค. replace ์ธ์ž๋ฅผ TRUE๋กœ sample_n ํ•จ์ˆ˜์™€ ๊ฐ™์ด ๋ณต์›์ถ”์ถœ์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.