textmineR: Functions for Text Mining and Topic Modeling
textmineR APIのラッパーパッケージ
- CRAN: http://cran.r-project.org/web/packages/textmineR/index.html
- GitHub: https://github.com/TommyJones/textmineR
> library(textmineR)
Loading required package: Matrix
Warning: package 'Matrix' was built under R version 3.2.4
Attaching package: 'Matrix'
The following object is masked from 'package:tidyr':
expand
> data("acq2")
バージョン: 1.6.0
関数名 | 概略 |
---|---|
CalcLikelihood |
Calculate the log likelihood of a document term matrix given a topic model |
CalcLikelihoodC |
Internal helper functions for 'textmineR' |
CalcTopicModelR2 |
Function to calculate R-squared of a topic model. |
CorrectS |
Function to remove some forms of pluralization. |
DepluralizeDtm |
Run the CorrectS function on columns of a document term matrix. |
Dtm2Docs |
Convert a DTM to a Character Vector of documents |
Files2Vec |
Function for reading text files into R |
FitLdaModel |
Fit a topic model using Latent Dirichlet Allocation |
FormatRawLdaOutput |
Format Raw Output from lda::lda.collapsed.gibbs.sampler() |
GetPhiPrime |
Calculate a matrix whose rows represent P(topic_i|tokens) |
GetProbableTerms |
Get cluster labels using a "more probable" method of terms |
GetTopTerms |
Get Top Terms for each topic from a topic model |
HellDist |
Hellinger Distance |
JSD |
Jensen-Shannon Divergence |
LabelTopics |
Get some topic labels using a "more probable" method of terms |
MakeSparseDTM |
Convert a sparse simple triplet document term matrix to a sparse Matrix |
NgramTokenizer |
Get n-grams when creating a document term matrix |
ProbCoherence |
Probailistic coherence of topics |
RecursiveRbind |
Recursively call rBind from the Matrix package. |
TermDocFreq |
Get term frequencies and document frequencies from a document term matrix. |
TmParallelApply |
An OS-independent parallel version of 'lapply' |
Vec2Dtm |
Convert a character vector to a document term matrix of class Matrix. |
acq2 |
50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq |
Files2Vec
- directory
- ...
> Files2Vec(directory, ...)
GetTopTerms
> data("acq2")
>
> (top_terms <- GetTopTerms(phi = model$phi, M = 5))
t.1 t.2 t.3 t.4 t.5
[1,] "shearson" "reuter" "dlrs" "pct" "offer"
[2,] "american" "corp" "mln" "company" "dlrs"
[3,] "express" "dlrs" "shares" "rmj" "shares"
[4,] "analysts" "multi" "stock" "stake" "company"
[5,] "market" "step" "group" "holding" "share"
> str(top_terms)
chr [1:5, 1:5] "shearson" "american" "express" "analysts" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "t.1" "t.2" "t.3" "t.4" ...
TermDocFreq
> data("acq2")
> TermDocFreq(dtm = dtm) %>%
+ dplyr::arrange(-doc.freq) %>%
+ head()
term term.freq doc.freq idf
1 reuter 50 50 0.0000000
2 dlrs 100 32 0.4462871
3 pct 70 30 0.5108256
4 mln 65 29 0.5447272
5 company 70 28 0.5798185
6 shares 52 22 0.8209806
Vec2Dtm
Arguments
- vec
- min.n.gram
- max.n.gram
- remove.stopwords
- custom.stopwords
- lower
- remove.punctuation
- remove.numbers
- stem.document
> data("acq2")
> dtm <- Vec2Dtm(documents, min.n.gram = 1, max.n.gram = 2)
> dtm %>% {
+ dim(.) %>% print()
+ head(.)
+ }
[1] 50 4594
[1] 0 0 0 0 0 0
acq2
> data("acq2")
> acq %>% class()
[1] "VCorpus" "Corpus"
> documents %>% class()
[1] "character"
> dtm %>% str(max.level = 2)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:6145] 18 18 18 18 18 18 18 44 6 6 ...
..@ p : int [1:4595] 0 1 2 3 4 5 6 7 8 9 ...
..@ Dim : int [1:2] 50 4594
..@ Dimnames:List of 2
..@ x : num [1:6145] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
> model %>% str(max.level = 2)
List of 2
$ theta: num [1:50, 1:5] 0.08411269 0.50847196 0.07500156 0.21621617 0.00000323 ...
..- attr(*, "dimnames")=List of 2
$ phi : num [1:5, 1:1344] 0.0013512411 0.0000001976 0.0000000934 0.0000001618 0.0000001076 ...
..- attr(*, "dimnames")=List of 2