tm: Text Mining Package
テキストマイニングのためのパッケージ
> library(tm)
> data("crude")
バージョン: 0.6.2
関数名 | 概略 |
---|---|
Corpus |
Corpora |
DataframeSource |
Data Frame Source |
DirSource |
Directory Source |
Docs |
Access Document IDs and Terms |
MC_tokenizer |
Tokenizers |
PCorpus |
Permanent Corpora |
PlainTextDocument |
Plain Text Documents |
Reader |
Readers |
Source |
Sources |
TermDocumentMatrix |
Term-Document Matrix |
TextDocument |
Text Documents |
URISource |
Uniform Resource Identifier Source |
VCorpus |
Volatile Corpora |
VectorSource |
Vector Source |
WeightFunction |
Weighting Function |
XMLSource |
XML Source |
XMLTextDocument |
XML Text Documents |
Zipf_plot |
Explore Corpus Term Frequency Characteristics |
acq |
50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq |
c.VCorpus |
Combine Corpora, Documents, Term-Document Matrices, and Term Frequency Vectors |
content_transformer |
Content Transformers |
crude |
20 Exemplary News Articles from the Reuters-21578 Data Set of Topic crude |
findAssocs |
Find Associations in a Term-Document Matrix |
findFreqTerms |
Find Frequent Terms |
getTokenizers |
Tokenizers |
getTransformations |
Transformations |
inspect |
Inspect Objects |
meta |
Metadata Management |
plot.TermDocumentMatrix |
Visualize a Term-Document Matrix |
readDOC |
Read In a MS Word Document |
readPDF |
Read In a PDF Document |
readPlain |
Read In a Text Document |
readRCV1 |
Read In a Reuters Corpus Volume 1 Document |
readReut21578XML |
Read In a Reuters-21578 XML Document |
readTabular |
Read In a Text Document |
readXML |
Read In an XML Document |
read_dtm_Blei_et_al |
Read Document-Term Matrices |
removeNumbers |
Remove Numbers from a Text Document |
removePunctuation |
Remove Punctuation Marks from a Text Document |
removeSparseTerms |
Remove Sparse Terms from a Term-Document Matrix |
removeWords |
Remove Words from a Text Document |
stemCompletion |
Complete Stems |
stemDocument |
Stem Words |
stopwords |
Stopwords |
stripWhitespace |
Strip Whitespace from a Text Document |
termFreq |
Term Frequency Vector |
tm_filter |
Filter and Index Functions on Corpora |
tm_map |
Transformations on Corpora |
tm_reduce |
Combine Transformations |
tm_term_score |
Compute Score for Matching Terms |
weightBin |
Weight Binary |
weightSMART |
SMART Weightings |
weightTf |
Weight by Term Frequency |
weightTfIdf |
Weight by Term Frequency - Inverse Document Frequency |
writeCorpus |
Write a Corpus to Disk |
VCorpus
> VCorpus(VectorSource(c("Hello world!")))
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
VectorSource
テキストのベクターを作成
> VectorSource(c("This is a text.", "This another one."))
$encoding
[1] ""
$length
[1] 2
$position
[1] 0
$reader
function (elem, language, id)
{
if (!is.null(elem$uri))
id <- basename(elem$uri)
PlainTextDocument(elem$content, id = id, language = language)
}
<environment: namespace:tm>
$content
[1] "This is a text." "This another one."
attr(,"class")
[1] "VectorSource" "SimpleSource" "Source"
crude
> crude
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20
readPDF
> readPDF(control = list(text = "-layout"))
stopwords
Arguments
- kind... french, german, hungarian, italian, norwegian, portuguese, russian, spanishとswedish
> stopwords(kind = "en")
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
[21] "herself" "it" "its" "itself" "they"
[26] "them" "their" "theirs" "themselves" "what"
[31] "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do"
[51] "does" "did" "doing" "would" "should"
[56] "could" "ought" "i'm" "you're" "he's"
[61] "she's" "it's" "we're" "they're" "i've"
[66] "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
[86] "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
[96] "cannot" "couldn't" "mustn't" "let's" "that's"
[101] "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an"
[111] "the" "and" "but" "if" "or"
[116] "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about"
[126] "against" "between" "into" "through" "during"
[131] "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again"
[146] "further" "then" "once" "here" "there"
[151] "when" "where" "why" "how" "all"
[156] "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no"
[166] "nor" "not" "only" "own" "same"
[171] "so" "than" "too" "very"