RのTidy evaluation

はじめに

この記事はTidy evaluationについて学習した内容を自分の備忘録としてまとめたものです。ここでの主な話題はrlang 0.4.0から使えるようになったcurly-curly演算子。

rlang 0.4.0 is out! Meet curly-curly, a new operator to make it easier to create #rstats functions around #tidyverse pipelines. Blog post at https://t.co/BHJZJqeWO7 pic.twitter.com/zkpBWjmJQi
— lionel (@_lionelhenry) June 28, 2019

curly-curly演算子

curly-curly演算子はクオートとアンクオート(Quote-and-Unquote)を同時に行ってくれる演算子。つまり今までは、enquo()して、!!(bang-bang)する必要があった。まずは、普通に欠損値をカウントする場合は下記のようになる。パイプラインの中でMonthやOzoneと記述して機能するのはデータマスキング(Data Masking)のおかげである。

library(rlang)
library(tidyverse)

airquality %>%
  dplyr::group_by(Month) %>% 
  dplyr::filter(is.na(Ozone)) %>%
  dplyr::summarise(Missing_Ozen = n())

# A tibble: 5 x 2
  Month Missing_Ozone
  <int>         <int>
1     5             5
2     6            21
3     7             5
4     8             5
5     9             1

例えば、グループ毎に欠損値をカウントする関数cnt_na()の場合、下記のようになる。enquo()と!!(bang-bang)することで、コードをクオートして、展開される場所までパイプで流し(評価を遅延)、展開されるべき場所で評価を行うことでデータマスキングを実行する。

cnt_na <- function(data, by, col_nm) {
  col_nm <- enquo(col_nm)
  by <- enquo(by)
  missing_name <- paste0("Missing_", quo_name(col_nm))

  data %>%
    dplyr::group_by(!!by) %>% 
    dplyr::filter(is.na(!!col_nm)) %>%
    dplyr::summarise(!!(missing_name) := n())
}

airquality %>% 
  cnt_na(., by = Month, col_nm = Ozone)

# A tibble: 5 x 2
  Month Missing_Ozone
  <int>         <int>
1     5             5
2     6            21
3     7             5
4     8             5
5     9             1

このenquo()と!!(bang-bang)はコードを複雑にするという点で、rlang 0.4.0に指摘されているように、あまり良くないらしい。

We have come to realise that this pattern is difficult to teach and to learn because it involves a new, unfamiliar syntax, and because it introduces two new programming concepts (quote and unquote) that are hard to understand intuitively. This complexity is not really justified because this pattern is overly flexible for basic programming needs.(拙訳)このパターンは、新しく、なじみのない構文を含み、直感的に理解するのが難しい2つの新しいプログラミングの概念(quote and unquote)を導入するため、これを教えること、学ぶことは困難だと気付いた。このパターンは基本的なプログラミングのニーズに対して過度に柔軟であるため、この複雑さは正当化されない。

これらの問題を解消するために出てきたのがcurly-curly演算子。curly-curly演算子は、クオートとアンクオート(Quote-and-Unquote)を同時に行ってくれる演算子なので、enquo()の部分が不要となり、!!の部分をcurly-curly演算子で囲むだけで済む。{glue}の文字列補完に似ている。

cnt_na <- function(data, by, col_nm) {
  data %>%
    dplyr::group_by( {{by}} ) %>% 
    dplyr::filter(is.na( {{col_nm}} )) %>%
    dplyr::summarise( {{col_nm}} := n() )
}

airquality %>% 
  cnt_na(., by = Month, col_nm = Ozone)

# A tibble: 5 x 2
  Month Ozone
  <int> <int>
1     5     5
2     6    21
3     7     5
4     8     5
5     9     1

表示する名前を修正する方法がわからかったので、また次回。

参考サイト

rlang 0.4.0

RのTidy evaluation_0

はじめに

curly-curly演算子

参考サイト