tm: Text Mining Package
テキストマイニングのためのパッケージ
> library(tm)
> data("crude")
バージョン: 0.6.2
| 関数名 | 概略 | 
|---|---|
Corpus | 
Corpora | 
DataframeSource | 
Data Frame Source | 
DirSource | 
Directory Source | 
Docs | 
Access Document IDs and Terms | 
MC_tokenizer | 
Tokenizers | 
PCorpus | 
Permanent Corpora | 
PlainTextDocument | 
Plain Text Documents | 
Reader | 
Readers | 
Source | 
Sources | 
TermDocumentMatrix | 
Term-Document Matrix | 
TextDocument | 
Text Documents | 
URISource | 
Uniform Resource Identifier Source | 
VCorpus | 
Volatile Corpora | 
VectorSource | 
Vector Source | 
WeightFunction | 
Weighting Function | 
XMLSource | 
XML Source | 
XMLTextDocument | 
XML Text Documents | 
Zipf_plot | 
Explore Corpus Term Frequency Characteristics | 
acq | 
50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq | 
c.VCorpus | 
Combine Corpora, Documents, Term-Document Matrices, and Term Frequency Vectors | 
content_transformer | 
Content Transformers | 
crude | 
20 Exemplary News Articles from the Reuters-21578 Data Set of Topic crude | 
findAssocs | 
Find Associations in a Term-Document Matrix | 
findFreqTerms | 
Find Frequent Terms | 
getTokenizers | 
Tokenizers | 
getTransformations | 
Transformations | 
inspect | 
Inspect Objects | 
meta | 
Metadata Management | 
plot.TermDocumentMatrix | 
Visualize a Term-Document Matrix | 
readDOC | 
Read In a MS Word Document | 
readPDF | 
Read In a PDF Document | 
readPlain | 
Read In a Text Document | 
readRCV1 | 
Read In a Reuters Corpus Volume 1 Document | 
readReut21578XML | 
Read In a Reuters-21578 XML Document | 
readTabular | 
Read In a Text Document | 
readXML | 
Read In an XML Document | 
read_dtm_Blei_et_al | 
Read Document-Term Matrices | 
removeNumbers | 
Remove Numbers from a Text Document | 
removePunctuation | 
Remove Punctuation Marks from a Text Document | 
removeSparseTerms | 
Remove Sparse Terms from a Term-Document Matrix | 
removeWords | 
Remove Words from a Text Document | 
stemCompletion | 
Complete Stems | 
stemDocument | 
Stem Words | 
stopwords | 
Stopwords | 
stripWhitespace | 
Strip Whitespace from a Text Document | 
termFreq | 
Term Frequency Vector | 
tm_filter | 
Filter and Index Functions on Corpora | 
tm_map | 
Transformations on Corpora | 
tm_reduce | 
Combine Transformations | 
tm_term_score | 
Compute Score for Matching Terms | 
weightBin | 
Weight Binary | 
weightSMART | 
SMART Weightings | 
weightTf | 
Weight by Term Frequency | 
weightTfIdf | 
Weight by Term Frequency - Inverse Document Frequency | 
writeCorpus | 
Write a Corpus to Disk | 
VCorpus
> VCorpus(VectorSource(c("Hello world!")))
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1
VectorSource
テキストのベクターを作成
> VectorSource(c("This is a text.", "This another one."))
$encoding
[1] ""
$length
[1] 2
$position
[1] 0
$reader
function (elem, language, id) 
{
    if (!is.null(elem$uri)) 
        id <- basename(elem$uri)
    PlainTextDocument(elem$content, id = id, language = language)
}
<environment: namespace:tm>
$content
[1] "This is a text."   "This another one."
attr(,"class")
[1] "VectorSource" "SimpleSource" "Source"
crude
> crude
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
readPDF
> readPDF(control = list(text = "-layout"))
stopwords
Arguments
- kind... french, german, hungarian, italian, norwegian, portuguese, russian, spanishとswedish
 
> stopwords(kind = "en")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"