tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

> library(tidytext)
> data("sentiments")

バージョン: 0.1.1

関数名	概略
`bind_tf_idf`	Bind the term frequency and inverse document frequency of a tidy text dataset to the dataset
`cast_sparse`	Create a sparse matrix from row names, column names, and values in a table.
`cast_sparse_`	Standard-evaluation version of cast_sparse
`cast_tdm_`	Casting a data frame to a DocumentTermMatrix, TermDocumentMatrix, or dfm
`corpus_tidiers`	Tidiers for a corpus object from the quanteda package
`dictionary_tidiers`	Tidy dictionary objects from the quanteda package
`lda_tidiers`	Tidiers for LDA objects from the topicmodels package
`pair_count`	Count pairs of items that cooccur within a group
`parts_of_speech`	Parts of speech for English words from the Moby Project
`sentiments`	Sentiment lexicons from three sources
`stop_words`	Various lexicons for English stop words
`tdm_tidiers`	Tidy DocumentTermMatrix, TermDocumentMatrix, and related objects from the tm package
`tidy.Corpus`	Tidy a Corpus object from the tm package
`tidy_triplet`	Utility function to tidy a simple triplet matrix
`unnest_tokens`	Split a column into tokens using the tokenizers package

sentiments

３つの辞書からなる感情極性のデータセット

> sentiments %>% dplyr::glimpse()

Observations: 23,165
Variables: 4
$ word      <chr> "abacus", "abandon", "abandon", "abandon", "abandone...
$ sentiment <chr> "trust", "fear", "negative", "sadness", "anger", "fe...
$ lexicon   <chr> "nrc", "nrc", "nrc", "nrc", "nrc", "nrc", "nrc", "nr...
$ score     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

> sentiments %>% dplyr::filter(lexicon == "AFINN") %$% range(score)

[1] -5  5

unnest_tokens

{tokenizers}パッケージを利用したテキストを含んだデータフレームの分割

Arguments

tbl
output, output_col
input, input_col
token... tokenizers:::basic-tokenizersの関数指定（既定値で単語区切りを指定するwordsとなる。charactersやsentencesなども指定可能）
to_lower... 追加した変数の単語や語句を小文字にするか否か
drop
collapse
...

> unnest_tokens

function (tbl, output, input, token = "words", to_lower = TRUE, 
    drop = TRUE, collapse = NULL, ...) 
{
    output_col <- col_name(substitute(output))
    input_col <- col_name(substitute(input))
    unnest_tokens_(tbl, output_col, input_col, token = token, 
        to_lower = to_lower, drop = drop, collapse = collapse, 
        ...)
}
<environment: namespace:tidytext>