textmineR: Functions for Text Mining and Topic Modeling

textmineR APIのラッパーパッケージ

> library(textmineR)

Loading required package: Matrix

Warning: package 'Matrix' was built under R version 3.2.4


Attaching package: 'Matrix'

The following object is masked from 'package:tidyr':

    expand

> data("acq2")

バージョン: 1.6.0

関数名	概略
`CalcLikelihood`	Calculate the log likelihood of a document term matrix given a topic model
`CalcLikelihoodC`	Internal helper functions for 'textmineR'
`CalcTopicModelR2`	Function to calculate R-squared of a topic model.
`CorrectS`	Function to remove some forms of pluralization.
`DepluralizeDtm`	Run the CorrectS function on columns of a document term matrix.
`Dtm2Docs`	Convert a DTM to a Character Vector of documents
`Files2Vec`	Function for reading text files into R
`FitLdaModel`	Fit a topic model using Latent Dirichlet Allocation
`FormatRawLdaOutput`	Format Raw Output from lda::lda.collapsed.gibbs.sampler()
`GetPhiPrime`	Calculate a matrix whose rows represent P(topic_i\|tokens)
`GetProbableTerms`	Get cluster labels using a "more probable" method of terms
`GetTopTerms`	Get Top Terms for each topic from a topic model
`HellDist`	Hellinger Distance
`JSD`	Jensen-Shannon Divergence
`LabelTopics`	Get some topic labels using a "more probable" method of terms
`MakeSparseDTM`	Convert a sparse simple triplet document term matrix to a sparse Matrix
`NgramTokenizer`	Get n-grams when creating a document term matrix
`ProbCoherence`	Probailistic coherence of topics
`RecursiveRbind`	Recursively call rBind from the Matrix package.
`TermDocFreq`	Get term frequencies and document frequencies from a document term matrix.
`TmParallelApply`	An OS-independent parallel version of 'lapply'
`Vec2Dtm`	Convert a character vector to a document term matrix of class Matrix.
`acq2`	50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq

Files2Vec

directory
...

> Files2Vec(directory, ...)

GetTopTerms

> data("acq2")
> 
> (top_terms <- GetTopTerms(phi = model$phi, M = 5))

     t.1        t.2      t.3      t.4       t.5      
[1,] "shearson" "reuter" "dlrs"   "pct"     "offer"  
[2,] "american" "corp"   "mln"    "company" "dlrs"   
[3,] "express"  "dlrs"   "shares" "rmj"     "shares" 
[4,] "analysts" "multi"  "stock"  "stake"   "company"
[5,] "market"   "step"   "group"  "holding" "share"

> str(top_terms)

 chr [1:5, 1:5] "shearson" "american" "express" "analysts" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:5] "t.1" "t.2" "t.3" "t.4" ...

TermDocFreq

> data("acq2")
> TermDocFreq(dtm = dtm) %>% 
+   dplyr::arrange(-doc.freq) %>% 
+   head()

     term term.freq doc.freq       idf
1  reuter        50       50 0.0000000
2    dlrs       100       32 0.4462871
3     pct        70       30 0.5108256
4     mln        65       29 0.5447272
5 company        70       28 0.5798185
6  shares        52       22 0.8209806

Vec2Dtm

Arguments

vec
min.n.gram
max.n.gram
remove.stopwords
custom.stopwords
lower
remove.punctuation
remove.numbers
stem.document

> data("acq2")
> dtm <- Vec2Dtm(documents, min.n.gram = 1, max.n.gram = 2)
> dtm %>% {
+   dim(.) %>% print()
+   head(.)
+ }

[1]   50 4594

[1] 0 0 0 0 0 0

acq2

> data("acq2")
> acq %>% class()

[1] "VCorpus" "Corpus"

> documents %>% class()

[1] "character"

> dtm %>% str(max.level = 2)

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:6145] 18 18 18 18 18 18 18 44 6 6 ...
  ..@ p       : int [1:4595] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ Dim     : int [1:2] 50 4594
  ..@ Dimnames:List of 2
  ..@ x       : num [1:6145] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()

> model %>% str(max.level = 2)

List of 2
 $ theta: num [1:50, 1:5] 0.08411269 0.50847196 0.07500156 0.21621617 0.00000323 ...
  ..- attr(*, "dimnames")=List of 2
 $ phi  : num [1:5, 1:1344] 0.0013512411 0.0000001976 0.0000000934 0.0000001618 0.0000001076 ...
  ..- attr(*, "dimnames")=List of 2