textmineR: Functions for Text Mining and Topic Modeling

textmineR APIのラッパーパッケージ

> library(textmineR)
Loading required package: Matrix
Warning: package 'Matrix' was built under R version 3.2.4

Attaching package: 'Matrix'
The following object is masked from 'package:tidyr':

    expand
> data("acq2")

バージョン: 1.6.0


関数名 概略
CalcLikelihood Calculate the log likelihood of a document term matrix given a topic model
CalcLikelihoodC Internal helper functions for 'textmineR'
CalcTopicModelR2 Function to calculate R-squared of a topic model.
CorrectS Function to remove some forms of pluralization.
DepluralizeDtm Run the CorrectS function on columns of a document term matrix.
Dtm2Docs Convert a DTM to a Character Vector of documents
Files2Vec Function for reading text files into R
FitLdaModel Fit a topic model using Latent Dirichlet Allocation
FormatRawLdaOutput Format Raw Output from lda::lda.collapsed.gibbs.sampler()
GetPhiPrime Calculate a matrix whose rows represent P(topic_i|tokens)
GetProbableTerms Get cluster labels using a "more probable" method of terms
GetTopTerms Get Top Terms for each topic from a topic model
HellDist Hellinger Distance
JSD Jensen-Shannon Divergence
LabelTopics Get some topic labels using a "more probable" method of terms
MakeSparseDTM Convert a sparse simple triplet document term matrix to a sparse Matrix
NgramTokenizer Get n-grams when creating a document term matrix
ProbCoherence Probailistic coherence of topics
RecursiveRbind Recursively call rBind from the Matrix package.
TermDocFreq Get term frequencies and document frequencies from a document term matrix.
TmParallelApply An OS-independent parallel version of 'lapply'
Vec2Dtm Convert a character vector to a document term matrix of class Matrix.
acq2 50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq

Files2Vec

  • directory
  • ...
> Files2Vec(directory, ...)

GetTopTerms

> data("acq2")
> 
> (top_terms <- GetTopTerms(phi = model$phi, M = 5))
     t.1        t.2      t.3      t.4       t.5      
[1,] "shearson" "reuter" "dlrs"   "pct"     "offer"  
[2,] "american" "corp"   "mln"    "company" "dlrs"   
[3,] "express"  "dlrs"   "shares" "rmj"     "shares" 
[4,] "analysts" "multi"  "stock"  "stake"   "company"
[5,] "market"   "step"   "group"  "holding" "share"
> str(top_terms)
 chr [1:5, 1:5] "shearson" "american" "express" "analysts" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:5] "t.1" "t.2" "t.3" "t.4" ...

TermDocFreq

> data("acq2")
> TermDocFreq(dtm = dtm) %>% 
+   dplyr::arrange(-doc.freq) %>% 
+   head()
     term term.freq doc.freq       idf
1  reuter        50       50 0.0000000
2    dlrs       100       32 0.4462871
3     pct        70       30 0.5108256
4     mln        65       29 0.5447272
5 company        70       28 0.5798185
6  shares        52       22 0.8209806

Vec2Dtm

Arguments

  • vec
  • min.n.gram
  • max.n.gram
  • remove.stopwords
  • custom.stopwords
  • lower
  • remove.punctuation
  • remove.numbers
  • stem.document
> data("acq2")
> dtm <- Vec2Dtm(documents, min.n.gram = 1, max.n.gram = 2)
> dtm %>% {
+   dim(.) %>% print()
+   head(.)
+ }
[1]   50 4594
[1] 0 0 0 0 0 0

acq2

> data("acq2")
> acq %>% class()
[1] "VCorpus" "Corpus"
> documents %>% class()
[1] "character"
> dtm %>% str(max.level = 2)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:6145] 18 18 18 18 18 18 18 44 6 6 ...
  ..@ p       : int [1:4595] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ Dim     : int [1:2] 50 4594
  ..@ Dimnames:List of 2
  ..@ x       : num [1:6145] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()
> model %>% str(max.level = 2)
List of 2
 $ theta: num [1:50, 1:5] 0.08411269 0.50847196 0.07500156 0.21621617 0.00000323 ...
  ..- attr(*, "dimnames")=List of 2
 $ phi  : num [1:5, 1:1344] 0.0013512411 0.0000001976 0.0000000934 0.0000001618 0.0000001076 ...
  ..- attr(*, "dimnames")=List of 2