rvest: Easily Harvest (Scrape) Web Pages

Rでウェブスクレイピング

> library(rvest)
Loading required package: xml2
Error: package 'xml2' could not be loaded

バージョン: 0.2.0.9000


関数名 概略
encoding Guess and repair faulty character encoding.
google_form Make link to google form given id
html Parse an HTML page.
html_form Parse forms in a page.
html_nodes Select nodes from an HTML document
html_session Simulate a session in an html browser.
html_table Parse an html table into a data frame.
html_text Extract attributes, text and tag name from html.
jump_to Navigate to a new url.
pluck Extract elements of a list by position.
session_history History navigation tools
set_values Set values in a form.
submit_form Submit a form back to the server.

html

URLをもとにソースを分析する

> html(x = "http://google.com")
Error in as.matrix(object): argument "object" is missing, with no default

html_form

入力フォームの情報を抽出する

> html("https://hadley.wufoo.com/forms/r-journal-submission/") %>% html_form()
Error in object[, i]: incorrect number of dimensions

html_node, html_nodes

ノードを指定してスクレイピング

css, xpathどちらかの引数を指定する

> html("http://www.boxofficemojo.com/movies/?id=ateam.htm") %>% html_nodes(., "center")
Error in object[, i]: incorrect number of dimensions

html_session

セッション情報を取得する

> s <- html_session("http://had.co.nz")
Error in eval(expr, envir, enclos): could not find function "html_session"
> s %>% jump_to("thesis") %>% jump_to("/") %>% session_history()
Error in function_list[[1L]](value): could not find function "session_history"
> s %>% jump_to("thesis") %>% back() %>% session_history()
Error in function_list[[1L]](value): could not find function "session_history"
> s %>% follow_link(css = "p a")
Error in function_list[[1L]](value): could not find function "follow_link"

html_table

table要素からデータを取得する

> births <- html("http://www.ssa.gov/oact/babynames/numberUSbirths.html")
Error in object[, i]: incorrect number of dimensions
> html_nodes(births, "table")[[2]] %>% html_table() %>% head()
Error in eval(expr, envir, enclos): could not find function "html_nodes"

html_text

要素からテキストを抽出

> movie <- html("http://www.imdb.com/title/tt1490017/")
Error in object[, i]: incorrect number of dimensions
> cast <- html_nodes(movie, "#titleCast span.itemprop")
Error in eval(expr, envir, enclos): could not find function "html_nodes"
> html_text(cast)
Error in eval(expr, envir, enclos): could not find function "html_text"

現在のURLから新たなURLへジャンプする

> s <- html_session("http://had.co.nz")
Error in eval(expr, envir, enclos): could not find function "html_session"
> s %>% jump_to("thesis/")
Error in function_list[[1L]](value): could not find function "jump_to"
> s %>% follow_link("vita")
Error in function_list[[1L]](value): could not find function "follow_link"
> s %>% follow_link(3)
Error in function_list[[1L]](value): could not find function "follow_link"
> # s %>% jump_to("thesis/") %$% browseURL(url = url)
> # 新たなURLをもとにページを開く

session_history

set_values

フォームに送信する値を定義する

> search <- html_form(read_html("http://www.google.com"))[[1]]
Error in eval(expr, envir, enclos): could not find function "html_form"
> set_values(search, q = "My little pony")
Error in eval(expr, envir, enclos): could not find function "set_values"
> set_values(search, hl = "fr")
Error in eval(expr, envir, enclos): could not find function "set_values"