tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

> library(tidytext)
> data("sentiments")

バージョン: 0.1.1


関数名 概略
bind_tf_idf Bind the term frequency and inverse document frequency of a tidy text dataset to the dataset
cast_sparse Create a sparse matrix from row names, column names, and values in a table.
cast_sparse_ Standard-evaluation version of cast_sparse
cast_tdm_ Casting a data frame to a DocumentTermMatrix, TermDocumentMatrix, or dfm
corpus_tidiers Tidiers for a corpus object from the quanteda package
dictionary_tidiers Tidy dictionary objects from the quanteda package
lda_tidiers Tidiers for LDA objects from the topicmodels package
pair_count Count pairs of items that cooccur within a group
parts_of_speech Parts of speech for English words from the Moby Project
sentiments Sentiment lexicons from three sources
stop_words Various lexicons for English stop words
tdm_tidiers Tidy DocumentTermMatrix, TermDocumentMatrix, and related objects from the tm package
tidy.Corpus Tidy a Corpus object from the tm package
tidy_triplet Utility function to tidy a simple triplet matrix
unnest_tokens Split a column into tokens using the tokenizers package

sentiments

3つの辞書からなる感情極性のデータセット

> sentiments %>% dplyr::glimpse()
Observations: 23,165
Variables: 4
$ word      <chr> "abacus", "abandon", "abandon", "abandon", "abandone...
$ sentiment <chr> "trust", "fear", "negative", "sadness", "anger", "fe...
$ lexicon   <chr> "nrc", "nrc", "nrc", "nrc", "nrc", "nrc", "nrc", "nr...
$ score     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
> sentiments %>% dplyr::filter(lexicon == "AFINN") %$% range(score)
[1] -5  5

unnest_tokens

{tokenizers}パッケージを利用したテキストを含んだデータフレームの分割

Arguments

  • tbl
  • output, output_col
  • input, input_col
  • token... tokenizers:::basic-tokenizersの関数指定(既定値で単語区切りを指定するwordsとなる。characterssentencesなども指定可能)
  • to_lower... 追加した変数の単語や語句を小文字にするか否か
  • drop
  • collapse
  • ...
> unnest_tokens
function (tbl, output, input, token = "words", to_lower = TRUE, 
    drop = TRUE, collapse = NULL, ...) 
{
    output_col <- col_name(substitute(output))
    input_col <- col_name(substitute(input))
    unnest_tokens_(tbl, output_col, input_col, token = token, 
        to_lower = to_lower, drop = drop, collapse = collapse, 
        ...)
}
<environment: namespace:tidytext>