rvest: Easily Harvest (Scrape) Web Pages
Rでウェブスクレイピング
- CRAN: http://cran.r-project.org/web/packages/rvest/index.html
- GitHub: https://github.com/hadley/rvest
> library(rvest)
Loading required package: xml2
Error: package 'xml2' could not be loaded
バージョン: 0.2.0.9000
関数名 | 概略 |
---|---|
encoding |
Guess and repair faulty character encoding. |
google_form |
Make link to google form given id |
html |
Parse an HTML page. |
html_form |
Parse forms in a page. |
html_nodes |
Select nodes from an HTML document |
html_session |
Simulate a session in an html browser. |
html_table |
Parse an html table into a data frame. |
html_text |
Extract attributes, text and tag name from html. |
jump_to |
Navigate to a new url. |
pluck |
Extract elements of a list by position. |
session_history |
History navigation tools |
set_values |
Set values in a form. |
submit_form |
Submit a form back to the server. |
html
URLをもとにソースを分析する
> html(x = "http://google.com")
Error in as.matrix(object): argument "object" is missing, with no default
html_form
入力フォームの情報を抽出する
> html("https://hadley.wufoo.com/forms/r-journal-submission/") %>% html_form()
Error in object[, i]: incorrect number of dimensions
html_node, html_nodes
ノードを指定してスクレイピング
css, xpathどちらかの引数を指定する
> html("http://www.boxofficemojo.com/movies/?id=ateam.htm") %>% html_nodes(., "center")
Error in object[, i]: incorrect number of dimensions
html_session
セッション情報を取得する
> s <- html_session("http://had.co.nz")
Error in eval(expr, envir, enclos): could not find function "html_session"
> s %>% jump_to("thesis") %>% jump_to("/") %>% session_history()
Error in function_list[[1L]](value): could not find function "session_history"
> s %>% jump_to("thesis") %>% back() %>% session_history()
Error in function_list[[1L]](value): could not find function "session_history"
> s %>% follow_link(css = "p a")
Error in function_list[[1L]](value): could not find function "follow_link"
html_table
table要素からデータを取得する
> births <- html("http://www.ssa.gov/oact/babynames/numberUSbirths.html")
Error in object[, i]: incorrect number of dimensions
> html_nodes(births, "table")[[2]] %>% html_table() %>% head()
Error in eval(expr, envir, enclos): could not find function "html_nodes"
html_text
要素からテキストを抽出
> movie <- html("http://www.imdb.com/title/tt1490017/")
Error in object[, i]: incorrect number of dimensions
> cast <- html_nodes(movie, "#titleCast span.itemprop")
Error in eval(expr, envir, enclos): could not find function "html_nodes"
> html_text(cast)
Error in eval(expr, envir, enclos): could not find function "html_text"
jump_to / follow_link
現在のURLから新たなURLへジャンプする
> s <- html_session("http://had.co.nz")
Error in eval(expr, envir, enclos): could not find function "html_session"
> s %>% jump_to("thesis/")
Error in function_list[[1L]](value): could not find function "jump_to"
> s %>% follow_link("vita")
Error in function_list[[1L]](value): could not find function "follow_link"
> s %>% follow_link(3)
Error in function_list[[1L]](value): could not find function "follow_link"
> # s %>% jump_to("thesis/") %$% browseURL(url = url)
> # 新たなURLをもとにページを開く
session_history
set_values
フォームに送信する値を定義する
> search <- html_form(read_html("http://www.google.com"))[[1]]
Error in eval(expr, envir, enclos): could not find function "html_form"
> set_values(search, q = "My little pony")
Error in eval(expr, envir, enclos): could not find function "set_values"
> set_values(search, hl = "fr")
Error in eval(expr, envir, enclos): could not find function "set_values"