tm: Text Mining Package


> library(tm)
> data("crude")

バージョン: 0.6.2

関数名 概略
Corpus Corpora
DataframeSource Data Frame Source
DirSource Directory Source
Docs Access Document IDs and Terms
MC_tokenizer Tokenizers
PCorpus Permanent Corpora
PlainTextDocument Plain Text Documents
Reader Readers
Source Sources
TermDocumentMatrix Term-Document Matrix
TextDocument Text Documents
URISource Uniform Resource Identifier Source
VCorpus Volatile Corpora
VectorSource Vector Source
WeightFunction Weighting Function
XMLSource XML Source
XMLTextDocument XML Text Documents
Zipf_plot Explore Corpus Term Frequency Characteristics
acq 50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq
c.VCorpus Combine Corpora, Documents, Term-Document Matrices, and Term Frequency Vectors
content_transformer Content Transformers
crude 20 Exemplary News Articles from the Reuters-21578 Data Set of Topic crude
findAssocs Find Associations in a Term-Document Matrix
findFreqTerms Find Frequent Terms
getTokenizers Tokenizers
getTransformations Transformations
inspect Inspect Objects
meta Metadata Management
plot.TermDocumentMatrix Visualize a Term-Document Matrix
readDOC Read In a MS Word Document
readPDF Read In a PDF Document
readPlain Read In a Text Document
readRCV1 Read In a Reuters Corpus Volume 1 Document
readReut21578XML Read In a Reuters-21578 XML Document
readTabular Read In a Text Document
readXML Read In an XML Document
read_dtm_Blei_et_al Read Document-Term Matrices
removeNumbers Remove Numbers from a Text Document
removePunctuation Remove Punctuation Marks from a Text Document
removeSparseTerms Remove Sparse Terms from a Term-Document Matrix
removeWords Remove Words from a Text Document
stemCompletion Complete Stems
stemDocument Stem Words
stopwords Stopwords
stripWhitespace Strip Whitespace from a Text Document
termFreq Term Frequency Vector
tm_filter Filter and Index Functions on Corpora
tm_map Transformations on Corpora
tm_reduce Combine Transformations
tm_term_score Compute Score for Matching Terms
weightBin Weight Binary
weightSMART SMART Weightings
weightTf Weight by Term Frequency
weightTfIdf Weight by Term Frequency - Inverse Document Frequency
writeCorpus Write a Corpus to Disk


> VCorpus(VectorSource(c("Hello world!")))
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1



> VectorSource(c("This is a text.", "This another one."))
[1] ""

[1] 2

[1] 0

function (elem, language, id) 
    if (!is.null(elem$uri)) 
        id <- basename(elem$uri)
    PlainTextDocument(elem$content, id = id, language = language)
<environment: namespace:tm>

[1] "This is a text."   "This another one."

[1] "VectorSource" "SimpleSource" "Source"


> crude
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20


> readPDF(control = list(text = "-layout"))



  • kind... french, german, hungarian, italian, norwegian, portuguese, russian, spanishとswedish
> stopwords(kind = "en")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"