dplyr: A Grammar of Data Manipulation

データ操作のためのパッケージ

> library(dplyr)

バージョン: 0.6.0


関数名 概略
all_equal Flexible equality comparison for data frames
all_vars Apply predicate to all variables
arrange Arrange rows by variables
arrange_all Arrange rows by a selection of variables
as.table.tbl_cube Coerce a 'tbl_cube' to other data structures
as.tbl_cube Coerce an existing data structure into a 'tbl_cube'
auto_copy Copy tables to same source, if necessary
band_members Band membership
between Do values in a numeric vector fall in specified range?
bind Efficiently bind multiple data frames by row and column
case_when A general vectorised if
coalesce Find first non-missing element
compute Force computation of a database query
copy_to Copy a local data frame to a remote src
cumall Cumulativate versions of any, all, and mean
desc Descending order
distinct Select distinct/unique rows
do Do anything
dplyr-package dplyr: a grammar of data manipulation
explain Explain details of a tbl
filter Return rows with matching conditions
filter_all Filter within a selection of variables
funs Create a list of functions calls.
group_by Group by one or more variables
group_by_all Group by a selection of variables
groups Return grouping variables
ident Flag a character vector as SQL identifiers
if_else Vectorised if
join Join two tbls together
join.tbl_df Join data frame tbls
lead-lag Lead and lag.
mutate Add new variables
n The number of observations in the current group.
n_distinct Efficiently count the number of unique values in a set of vector
na_if Convert values to NA
nasa NASA spatio-temporal data
near Compare two numeric vectors
nth Extract the first, last or nth value from a vector
order_by A helper function for ordering window function output
pull Pull out a single variable
ranking Windowed rank functions.
recode Recode values
rowwise Group input by rows
sample Sample n rows from a table
scoped Operate on a selection of variables
select Select/rename variables by name
select_all Select and rename a selection of variables
select_helpers Select helpers
setops Set operations
slice Select rows by position
sql SQL escaping.
src_dbi Source for database backends
starwars Starwars characters
storms Storm tracks data
summarise Reduces multiple values down to a single value
summarise_all Summarise and mutate multiple columns.
tally Count/tally observations by group
tbl Create a table from a data source
tbl_cube A data cube tbl
top_n Select top (or bottom) n rows (by value)
vars Select variables

all_equal

データフレームの差分を確認する

Arguments

  • target, current... 比較対象のデータフレーム
  • ignore_col_order
  • ignore_row_order
  • convert
  • ...
> # データフレーム内のデータの並びを変更する関数を定義
> scramble <- function(x) x[sample(nrow(x)), sample(ncol(x))]
> 
> # 既定の挙動では行や列の並びが異なっていてもデータが等しければTRUEを返す
> all_equal(mtcars, scramble(mtcars))
[1] TRUE
> # 行の並びを考慮して差分を確認する
> all_equal(mtcars, scramble(mtcars), ignore_col_order = FALSE)
[1] "Same column names, but different order"

all_vars / any_vars

arrange

変数の値による並び替え

> arrange(mtcars, cyl, disp) %>% head()
   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
2 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
3 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
4 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
5 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
6 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
> arrange(mtcars, desc(disp)) %>% head()
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
3 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
4 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

as_data_frame

リストをデータフレームに変換

> list(x = 1:500, y = runif(500), z = 500:1) %>% as_data_frame()
# A tibble: 500 x 3
       x               y     z
   <int>           <dbl> <int>
 1     1 0.7483994697686   500
 2     2 0.3320873326156   499
 3     3 0.9686443330720   498
 4     4 0.7768150032498   497
 5     5 0.9165990713518   496
 6     6 0.5212027428206   495
 7     7 0.0969684200827   494
 8     8 0.0219207596965   493
 9     9 0.1633499194868   492
10    10 0.5003534886055   491
# ... with 490 more rows

between

指定した範囲内に数値があるか評価する

> x <- seq(1:10)
> dplyr::between(x, 2, 4)
 [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
> x[dplyr::between(x, 2, 4)]
[1] 2 3 4
> iris %>% dplyr::filter(., between(Sepal.Width, 3.4, 3.8)) %>% head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          5.0         3.6          1.4         0.2  setosa
3          4.6         3.4          1.4         0.3  setosa
4          5.0         3.4          1.5         0.2  setosa
5          5.4         3.7          1.5         0.2  setosa
6          4.8         3.4          1.6         0.2  setosa

bind

行と列の結合(共通の変数を持っていない場合でも結合する)

> X <- data_frame(x = 1:5)
> Y <- data_frame(y = 10:6)
> bind_rows(X, Y)
# A tibble: 10 x 2
       x     y
   <int> <int>
 1     1    NA
 2     2    NA
 3     3    NA
 4     4    NA
 5     5    NA
 6    NA    10
 7    NA     9
 8    NA     8
 9    NA     7
10    NA     6
> bind_cols(X, Y)
# A tibble: 5 x 2
      x     y
  <int> <int>
1     1    10
2     2     9
3     3     8
4     4     7
5     5     6
> combine(X, Y)
Error in combine_all(args): Columns of class data.frame not supported

copy_to

ローカル環境のデータフレームをリモートのデータソースにコピーする

Arguments

  • dest
  • df
  • name
  • ...
> copy_to(dest, df, name = deparse(substitute(df)), ...)

data_frame

データフレームの構築(base::data.frameの強化版)

> a <- 1:5
> data_frame(a, b = a * 2)
# A tibble: 5 x 2
      a     b
  <int> <dbl>
1     1     2
2     2     4
3     3     6
4     4     8
5     5    10
> data_frame(a, b = a * 2, c = 1)
# A tibble: 5 x 3
      a     b     c
  <int> <dbl> <dbl>
1     1     2     1
2     2     4     1
3     3     6     1
4     4     8     1
5     5    10     1
> data_frame(x = runif(10), y = x * 2)
# A tibble: 10 x 2
                x              y
            <dbl>          <dbl>
 1 0.980605010875 1.961210021749
 2 0.236343062948 0.472686125897
 3 0.977456531487 1.954913062975
 4 0.350839339662 0.701678679325
 5 0.779177646386 1.558355292771
 6 0.331191585399 0.662383170798
 7 0.113374165725 0.226748331450
 8 0.994318356970 1.988636713941
 9 0.760566237383 1.521132474765
10 0.613929431653 1.227858863305

desc

降順並び替え

> desc(1:10)
 [1]  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10
> factor(letters) %>% desc()
 [1]  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10 -11 -12 -13 -14 -15 -16 -17
[18] -18 -19 -20 -21 -22 -23 -24 -25 -26

do

> by_cyl <- group_by(mtcars, cyl)
> do(by_cyl, head(., 2))
# A tibble: 6 x 11
# Groups:   cyl [3]
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
2  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
3  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
4  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
6  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
> models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
> models %>% {
+   print(.)
+   class(.)
+ }
Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 x 2
    cyl      mod
* <dbl>   <list>
1     4 <S3: lm>
2     6 <S3: lm>
3     8 <S3: lm>
[1] "rowwise_df" "tbl_df"     "tbl"        "data.frame"
> summarise(models, rsq = summary(mod)$r.squared)
# A tibble: 3 x 1
              rsq
            <dbl>
1 0.6484051395593
2 0.0106260400812
3 0.2701577716900
> models %>% do(data.frame(coef = coef(.$mod)))
Source: local data frame [6 x 1]
Groups: <by row>

# A tibble: 6 x 1
               coef
*             <dbl>
1 40.87195532165298
2 -0.13514181455960
3 19.08198741874837
4  0.00360511850745
5 22.03279891365333
6 -0.01963409491264
> models %>% do(data.frame(
+   var = names(coef(.$mod)),
+   coef(summary(.$mod)))
+ )
Source: local data frame [6 x 5]
Groups: <by row>

# A tibble: 6 x 5
          var          Estimate       Std..Error         t.value
*      <fctr>             <dbl>            <dbl>           <dbl>
1 (Intercept) 40.87195532165298 3.58960539961242 11.386197303488
2        disp -0.13514181455960 0.03317160810475 -4.074020594143
3 (Intercept) 19.08198741874837 2.91399289197608  6.548398752547
4        disp  0.00360511850745 0.01555711478988  0.231734390094
5 (Intercept) 22.03279891365333 3.34524111451680  6.586311168436
6        disp -0.01963409491264 0.00931592587348 -2.107583849346
# ... with 1 more variables: Pr...t.. <dbl>
> models <- by_cyl %>% do(mod_linear = lm(mpg ~ disp, data = .),
+                         mod_quad = lm(mpg ~ poly(disp, 2), data = .))
> compare <- models %>% do(aov = anova(.$mod_linear, .$mod_quad))
> 
> Orange %>% group_by(Tree) %>% do(broom::tidy(lm(age ~ circumference, data = .)))
# A tibble: 10 x 6
# Groups:   Tree [5]
    Tree          term         estimate       std.error       statistic
   <ord>         <chr>            <dbl>           <dbl>           <dbl>
 1     3   (Intercept) -209.51232149301 85.268290402704 -2.457095369258
 2     3 circumference   12.03888487911  0.835344475434 14.411880647031
 3     1   (Intercept) -264.67343750000 98.620556898818 -2.683755251672
 4     1 circumference   11.91924542683  0.918802910620 12.972581267502
 5     5   (Intercept)  -54.48409709432 76.886278786109 -0.708632254734
 6     5 circumference    8.78713197900  0.621136519012 14.146860971848
 7     2   (Intercept) -132.43972525629 83.131414589342 -1.593136913530
 8     2 circumference    7.79522500189  0.559547938180 13.931290725948
 9     4   (Intercept)  -76.51367061555 88.294375729161 -0.866574682517
10     4 circumference    7.16984173775  0.571951632031 12.535748367901
# ... with 1 more variables: p.value <dbl>

group_by

水準をグループごとにまとめる

> group_by(mtcars, cyl, vs)
# A tibble: 32 x 11
# Groups:   cyl, vs [5]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
 * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
 5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
 6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
 7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
 8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
 9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# ... with 22 more rows

group_indices

水準のグループに対して新たなIDを付与する

> group_indices(mtcars, cyl)
 [1] 2 2 1 2 3 2 3 1 1 2 2 3 3 3 3 3 3 1 1 1 1 3 3 3 3 1 1 1 3 2 3 1

group_size

グループのサイズを得る

> group_size(mtcars)
[1] 32
> mtcars$cyl %>% factor()
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8
> group_by(mtcars, cyl) %>% group_size()
[1] 11  7 14

join / inner* / left / right_ / full* / semi / anti_

結合の方法

  • inner_join... 内部結合
  • left_join... 左外部結合
  • right_join... 右外部結合
  • full_join... 完全外部結合
  • semi_join... 準結合
  • anti_join... 逆結合

lead-lag / lead / lag

与えた値より前や後ろの値を出力する

Arguments

  • x... ベクター
  • n
  • default
  • order_by
> lead(1:10, n = 1)
 [1]  2  3  4  5  6  7  8  9 10 NA
> lag(1:10, n = 4)
 [1] NA NA NA NA  1  2  3  4  5  6
> set.seed(71)
> x <- runif(5)
> cbind(ahead = lead(x), x, behind = lag(x))
              ahead              x         behind
[1,] 0.555103868479 0.332928059390             NA
[2,] 0.327369962120 0.555103868479 0.332928059390
[3,] 0.211666960036 0.327369962120 0.555103868479
[4,] 0.316121358424 0.211666960036 0.327369962120
[5,]             NA 0.316121358424 0.211666960036
> # http://stackoverflow.com/questions/33147714/return-value-of-next-occurrence-based-on-multiple-columns
> readr::read_delim("Firm Time Date
+ A 08:00 1/1
+ B 09:00 1/1
+ A 09:30 1/1
+ B 10:00 1/1", 
+ delim = " ") %>% 
+   group_by(Firm, Date) %>% 
+   mutate(NextTime = lead(as.character(Time), default = "16:30"))
# A tibble: 4 x 4
# Groups:   Firm, Date [2]
   Firm     Time  Date NextTime
  <chr>   <time> <chr>    <chr>
1     A 08:00:00   1/1 09:30:00
2     B 09:00:00   1/1 10:00:00
3     A 09:30:00   1/1    16:30
4     B 10:00:00   1/1    16:30

n

グループ内での観測値を数える

> data("flights", package = "nycflights13")
Error in find.package(package, lib.loc, verbose = verbose): there is no package called 'nycflights13'
> flights %>% 
+   group_by(carrier) %>% 
+   summarise(carriers = n()) %>% 
+   sample_n(5)
Error in eval(lhs, parent, parent): object 'flights' not found

near

> sqrt(2) ^ 2 == 2
[1] FALSE
> near(sqrt(2) ^ 2, 2)
[1] TRUE

nth / first / last

> x <- 1:10
> nth(x, 4)
[1] 4
> first(x)
[1] 1
> last(x)
[1] 10

sample / sample_n / sample_frac

テーブルのデータから幾つかをサンプリングする

Arguments

  • tbl
  • size... サンプル
  • replace
  • weight
> sample_n(mtcars, 5)
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 450SL    17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

slice

任意の行を抽出する

> slice(mtcars, 1L)
# A tibble: 1 x 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    21     6   160   110   3.9  2.62 16.46     0     1     4     4
> dplyr::slice(iris, 2:4)
# A tibble: 3 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa

src_sqlite

sqliteデータベースとの接続

Arguments

  • path
  • create... 真偽値。FALSEの場合、パスは既存のものである必要がある。
  • src
  • from
  • ...
> src_sqlite(path, create = FALSE)

tally / count

Arguments

  • x
  • wt
  • sort
  • ..., vars
> data("Batting", package = "Lahman")
Error in find.package(package, lib.loc, verbose = verbose): there is no package called 'Lahman'
> batting_tbl <- tbl_df(Batting)
Error in as_data_frame(data): object 'Batting' not found
> tally(group_by(batting_tbl, yearID))
Error in group_by(batting_tbl, yearID): object 'batting_tbl' not found
> tally(group_by(batting_tbl, yearID), sort = TRUE)
Error in group_by(batting_tbl, yearID): object 'batting_tbl' not found
> # Multiple tallys progressively roll up the groups
> plays_by_year <- tally(group_by(batting_tbl, playerID, stint), sort = TRUE)
Error in group_by(batting_tbl, playerID, stint): object 'batting_tbl' not found
> tally(plays_by_year, sort = TRUE)
Error in "n" %in% names(x): object 'plays_by_year' not found
> tally(tally(plays_by_year))
Error in "n" %in% names(x): object 'plays_by_year' not found
> # This looks a little nicer if you use the infix %>% operator
> batting_tbl %>% group_by(playerID) %>% tally(sort = TRUE)
Error in eval(lhs, parent, parent): object 'batting_tbl' not found
> # count is even more succinct - it also does the grouping for you
> batting_tbl %>% count(playerID)
Error in eval(lhs, parent, parent): object 'batting_tbl' not found
> batting_tbl %>% count(playerID, wt = G)
Error in eval(lhs, parent, parent): object 'batting_tbl' not found
> batting_tbl %>% count(playerID, wt = G, sort = TRUE)
Error in eval(lhs, parent, parent): object 'batting_tbl' not found

tbl / is.tbl / as.tbl

データからテーブルを作成する

> as.tbl(mtcars)
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
 * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
 5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
 6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
 7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
 8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
 9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# ... with 22 more rows

tbl_dt

{data.table}のdata.tableオブジェクトの作成

> tbl_dt(mtcars) %>% {
+   print(class(.))
+   data.table::as.data.table(.) %>% class() %>% print()
+ }
Error in tbl_dt(mtcars): could not find function "tbl_dt"

tbl_df

データフレームテーブルの生成

> tbl_df(mtcars) %>% {
+   print(class(.))
+   as.data.frame(.) %>% class()
+ }
[1] "tbl_df"     "tbl"        "data.frame"
[1] "data.frame"

ranking

順位に関する関数群

> x <- c(5, 1, 3, 2, 2, NA)
> row_number(x) # 同位値を考慮しない

[1] 5 1 4 2 3 NA

> min_rank(x) # 同位値を考慮する

[1] 5 1 4 2 2 NA

> dense_rank(x) # 欠損値を無視

[1] 4 1 3 2 2 NA

> percent_rank(x) # 0から1の割合

[1] 1.00 0.00 0.75 0.25 0.25 NA

> cume_dist(x) # 順位割合の積み上げ

[1] 1.0 0.2 0.8 0.6 0.6 NA

> ntile(x, 2) # 指定した群数に分割

[1] 2 1 2 1 1 NA

> data_frame(size = c(100, 10, 1, 0.1, 0.01,
+                     100, 75, 50, 25, 0.1,
+                     100, 1, 1, 1, 1,
+                     20, 20, 20, 20, 20),
+            type = rep(c("log", "uniform", "dominant", "equality"), 
+                      each = 5)) -> DF.dummy
> DF.dummy %>% dplyr::group_by(type) %>% 
+   mutate(row.num = row_number(size),
+          rank.min = min_rank(size),
+          rank.dens = dense_rank(size),
+          rank.pct  = percent_rank(size),
+          cum.dst = cume_dist(size),
+          ntile = ntile(size, 50)) %>% kable()
size type row.num rank.min rank.dens rank.pct cum.dst ntile
100.00 log 3 5 5 1.00 1.0 41
10.00 log 4 4 4 0.75 0.8 31
1.00 log 1 3 3 0.50 0.6 21
0.10 log 5 2 2 0.25 0.4 11
0.01 log 2 1 1 0.00 0.2 1
100.00 uniform 2 5 5 1.00 1.0 41
75.00 uniform 1 4 4 0.75 0.8 31
50.00 uniform 3 3 3 0.50 0.6 21
25.00 uniform 5 2 2 0.25 0.4 11
0.10 uniform 4 1 1 0.00 0.2 1
100.00 dominant 4 5 2 1.00 1.0 41
1.00 dominant 5 1 1 0.00 0.8 1
1.00 dominant 2 1 1 0.00 0.8 11
1.00 dominant 1 1 1 0.00 0.8 21
1.00 dominant 3 1 1 0.00 0.8 31
20.00 equality 5 1 1 0.00 1.0 1
20.00 equality 1 1 1 0.00 1.0 11
20.00 equality 4 1 1 0.00 1.0 21
20.00 equality 3 1 1 0.00 1.0 31
20.00 equality 2 1 1 0.00 1.0 41

src_mysql

MySQLデータベースに接続する

> src_mysql(dbname, host = NULL, port = 0L, user = "root", password = "",
+   ...)

src_postgres

PostgreSQLに接続する

src_sqlite

SQLiteデータベースに接続する

sql

top_n

上位・下位からの行数を指定した抽出

> df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
> df %>% top_n(2)
Selecting by x
   x
1 10
2  6

メモ

NSEのvignettesより

{dplyr}の主要な関数(filter(), mutate(), summarise(), arrange(), select()そしてgroup_by()