tm: Text Mining Package

テキストマイニングのためのパッケージ

> library(tm)
> data("crude")

バージョン: 0.6.2

関数名	概略
`Corpus`	Corpora
`DataframeSource`	Data Frame Source
`DirSource`	Directory Source
`Docs`	Access Document IDs and Terms
`MC_tokenizer`	Tokenizers
`PCorpus`	Permanent Corpora
`PlainTextDocument`	Plain Text Documents
`Reader`	Readers
`Source`	Sources
`TermDocumentMatrix`	Term-Document Matrix
`TextDocument`	Text Documents
`URISource`	Uniform Resource Identifier Source
`VCorpus`	Volatile Corpora
`VectorSource`	Vector Source
`WeightFunction`	Weighting Function
`XMLSource`	XML Source
`XMLTextDocument`	XML Text Documents
`Zipf_plot`	Explore Corpus Term Frequency Characteristics
`acq`	50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq
`c.VCorpus`	Combine Corpora, Documents, Term-Document Matrices, and Term Frequency Vectors
`content_transformer`	Content Transformers
`crude`	20 Exemplary News Articles from the Reuters-21578 Data Set of Topic crude
`findAssocs`	Find Associations in a Term-Document Matrix
`findFreqTerms`	Find Frequent Terms
`getTokenizers`	Tokenizers
`getTransformations`	Transformations
`inspect`	Inspect Objects
`meta`	Metadata Management
`plot.TermDocumentMatrix`	Visualize a Term-Document Matrix
`readDOC`	Read In a MS Word Document
`readPDF`	Read In a PDF Document
`readPlain`	Read In a Text Document
`readRCV1`	Read In a Reuters Corpus Volume 1 Document
`readReut21578XML`	Read In a Reuters-21578 XML Document
`readTabular`	Read In a Text Document
`readXML`	Read In an XML Document
`read_dtm_Blei_et_al`	Read Document-Term Matrices
`removeNumbers`	Remove Numbers from a Text Document
`removePunctuation`	Remove Punctuation Marks from a Text Document
`removeSparseTerms`	Remove Sparse Terms from a Term-Document Matrix
`removeWords`	Remove Words from a Text Document
`stemCompletion`	Complete Stems
`stemDocument`	Stem Words
`stopwords`	Stopwords
`stripWhitespace`	Strip Whitespace from a Text Document
`termFreq`	Term Frequency Vector
`tm_filter`	Filter and Index Functions on Corpora
`tm_map`	Transformations on Corpora
`tm_reduce`	Combine Transformations
`tm_term_score`	Compute Score for Matching Terms
`weightBin`	Weight Binary
`weightSMART`	SMART Weightings
`weightTf`	Weight by Term Frequency
`weightTfIdf`	Weight by Term Frequency - Inverse Document Frequency
`writeCorpus`	Write a Corpus to Disk

VCorpus

> VCorpus(VectorSource(c("Hello world!")))

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1

VectorSource

テキストのベクターを作成

> VectorSource(c("This is a text.", "This another one."))

$encoding
[1] ""

$length
[1] 2

$position
[1] 0

$reader
function (elem, language, id) 
{
    if (!is.null(elem$uri)) 
        id <- basename(elem$uri)
    PlainTextDocument(elem$content, id = id, language = language)
}
<environment: namespace:tm>

$content
[1] "This is a text."   "This another one."

attr(,"class")
[1] "VectorSource" "SimpleSource" "Source"

crude

> crude

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20

readPDF

> readPDF(control = list(text = "-layout"))

stopwords

Arguments

kind... french, german, hungarian, italian, norwegian, portuguese, russian, spanishとswedish

> stopwords(kind = "en")

  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"