XML: Tools for Parsing and Generating XML Within R and S-Plus

XML文書を取り扱うためのパッケージ

> library(XML)
> doc <- xmlTreeParse(system.file("exampleData", "mtcars.xml", package = "XML"))

バージョン: 3.98.1.4


関数名 概略
Doctype Constructor for DTD reference
Doctype-class Class to describe a reference to an XML DTD
ExternalReference-class Classes for working with XML Schema
SAXState-class A virtual base class defining methods for SAX parsing
XMLAttributes-class Class '"XMLAttributes"'
XMLCodeFile-class Simple classes for identifying an XML document containing R code
XMLInternalDocument-class Class to represent reference to C-level data structure for an XML document
XMLNode-class Classes to describe an XML node object.
[.XMLNode Convenience accessors for the children of XMLNode objects.
[<-.XMLNode Assign sub-nodes to an XML node
addChildren Add child nodes to an XML node
addNode Add a node to a tree
append.xmlNode Add children to an XML node
asXMLNode Converts non-XML node objects to XMLTextNode objects
asXMLTreeNode Convert a regular XML node to one for use in a "flat" tree
catalogLoad Manipulate XML catalog contents
catalogResolve Look up an element via the XML catalog mechanism
compareXMLDocs Indicate differences between two XML documents
docName Accessors for name of XML document
dtdElement Gets the definition of an element or entity from a DTD.
dtdElementValidEntry Determines whether an XML element allows a particular type of sub-element.
dtdIsAttribute Query if a name is a valid attribute of a DTD element.
dtdValidElement Determines whether an XML tag is valid within another.
ensureNamespace Ensure that the node has a definition for particular XML namespaces
findXInclude Find the XInclude node associated with an XML node
free Release the specified object and clean up its memory usage
genericSAXHandlers SAX generic callback handler list
getChildrenStrings Get the individual
getEncoding Determines the encoding for an XML document or node
getHTMLLinks Get links or names of external files in HTML document
getLineNumber Determine the location - file & line number of an (internal) XML node
getNodeSet Find matching nodes in an internal XML tree/DOM
getRelativeURL Compute name of URL relative to a base URL
getSibling Manipulate sibling XML nodes
getXIncludes Find the documents that are XInclude'd in an XML document
getXMLErrors Get XML/HTML document parse errors
isXMLString Facilities for working with XML strings
length.XMLNode Determine the number of children in an XMLNode object.
libxmlVersion Query the version and available features of the libxml library.
makeClassTemplate Create S4 class definition based on XML node(s)
names.XMLNode Get the names of an XML nodes children.
newXMLDoc Create internal XML node or document object
newXMLNamespace Add a namespace definition to an XML node
parseDTD Read a Document Type Definition (DTD)
parseURI Parse a URI string into its elements
parseXMLAndAdd Parse XML content and add it to a node
print.XMLAttributeDef Methods for displaying XML objects
processXInclude Perform the XInclude substitutions
readHTMLList Read data in an HTML list or all lists in a document
readHTMLTable Read data from one or more HTML tables
readKeyValueDB Read an XML property-list style document
readSolrDoc Read the data from a Solr document
removeXMLNamespaces Remove namespace definitions from a XML node or document
saveXML Output internal XML Tree
setXMLNamespace Set the name space on a node
startElement.SAX Generic Methods for SAX callbacks
supportsExpat Determines which native XML parsers are being used.
toHTML Create an HTML representation of the given R object, using internal C-level nodes
toString.XMLNode Creates string representation of XML node
xmlApply Applies a function to each of the children of an XMLNode
xmlAttributeType The type of an XML attribute for element from the DTD
xmlAttrs Get the list of attributes of an XML node.
xmlChildren Gets the sub-nodes within an XMLNode object.
xmlCleanNamespaces Remove redundant namespaces on an XML document
xmlClone Create a copy of an internal XML document or node
xmlContainsEntity Checks if an entity is defined within a DTD.
xmlDOMApply Apply function to nodes in an XML tree/DOM.
xmlElementSummary Frequency table of names of elements and attributes in XML content
xmlElementsByTagName Retrieve the children of an XML node with a specific tag name
xmlEventHandler Default handlers for the SAX-style event XML parser
xmlEventParse XML Event/Callback element-wise Parser
xmlFlatListTree Constructors for trees stored as flat list of nodes with information about parents and children.
xmlGetAttr Get the value of an attribute in an XML node
xmlHandler Example XML Event Parser Handler Functions
xmlName Extraces the tag name of an XMLNode object.
xmlNamespace Retrieve the namespace value of an XML node.
xmlNamespaceDefinitions Get definitions of any namespaces defined in this XML node
xmlNode Create an XML node
xmlOutputBuffer XML output streams
xmlParent Get parent node of XMLInternalNode or ancestor nodes
xmlParseDoc Parse an XML document with options controlling the parser.
xmlParserContextFunction Identifies function as expecting an xmlParserContext argument
xmlRoot Get the top-level XML node.
xmlSchemaValidate Validate an XML document relative to an XML schema
xmlSearchNs Find a namespace definition object by searching ancestor nodes
xmlSerializeHook Functions that help serialize and deserialize XML internal objects
xmlSize The number of sub-elements within an XML node.
xmlSource Source the R code, examples, etc. from an XML document
xmlStopParser Terminate an XML parser
xmlStructuredStop Condition/error handler functions for XML parsing
xmlToDataFrame Extract data from a simple XML document
xmlToList Convert an XML node/document to a more R-like list
xmlToS4 General mechanism for mapping an XML node to an S4 object
xmlTree An internal, updatable DOM object for building XML trees
xmlTreeParse XML Parser
xmlValue Extract or set the contents of a leaf XML node

getNodeSet / xpathApply/ xpathSApply / matchNamespaces

> doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))
> getNodeSet(doc, "/doc//a[@status]")
[[1]]
<a status="xyz"/> 

[[2]]
<a status="1"/> 

attr(,"class")
[1] "XMLNodeSet"

names.XMLNode

> xmlRoot(doc) %>% names()
  comment         a 
"comment"       "a"
> xmlRoot(doc) %>% .[names(.) == "variables"]
named list()
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"

newXMLDoc

XMLノード・ドキュメントの作成

readHTMLList

HTMLでのリスト要素を取得する

Arguments

  • doc
  • trim
  • elFun
  • which
  • ...
> readHTMLList("http://suryu.me/rpkg_showcase/dataset/index.html", which = 14)
 [1] "13.1.\n                        \n                        agricolae"  
 [2] "13.2.\n                        \n                        biotools"   
 [3] "13.3.\n                        \n                        changepoint"
 [4] "13.4.\n                        \n                        describer"  
 [5] "13.5.\n                        \n                        DescTools"  
 [6] "13.6.\n                        \n                        gam"        
 [7] "13.7.\n                        \n                        Kendall"    
 [8] "13.8.\n                        \n                        lmtest"     
 [9] "13.9.\n                        \n                        mgcv"       
[10] "13.10.\n                        \n                        MuMIn"     
[11] "13.11.\n                        \n                        outliers"  
[12] "13.12.\n                        \n                        smatr"     
[13] "13.13.\n                        \n                        statcheck" 
[14] "13.14.\n                        \n                        stats"

readHTMLTable

HTMLのtable要素を読み込む

Arguments

  • doc
  • header
  • colClasses
  • skip.rows
  • trim
  • elFun
  • as.data.frame
  • which
  • ...
> tables <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" %>%
+   RCurl::getURL() %>% 
+   readHTMLTable()
> str(tables)
List of 4
 $ NULL: NULL
 $ NULL: NULL
 $ NULL:'data.frame':    244 obs. of  6 variables:
  ..$ Rank                            : Factor w/ 196 levels "–","1","10","100",..: 2 109 120 131 142 153 164 175 186 3 ...
  ..$ Country (or dependent territory): Factor w/ 244 levels "Abkhazia[Note 19]",..: 44 95 233 96 30 160 152 18 175 134 ...
  ..$ Population                      : Factor w/ 244 levels "1,132,657","1,167,242",..: 8 6 129 102 92 76 72 61 53 48 ...
  ..$ Date                            : Factor w/ 59 levels "April 1, 2016",..: 54 54 54 26 54 54 26 54 4 26 ...
  ..$ % of world
population          : Factor w/ 181 levels "0.00000075%",..: 175 174 181 180 179 178 177 176 173 172 ...
  ..$ Source                          : Factor w/ 28 levels "2008 census result",..: 13 28 13 14 13 13 26 13 12 14 ...
 $ NULL:'data.frame':    24 obs. of  2 variables:
  ..$ V1: Factor w/ 13 levels "","Cities","Continental",..: 1 13 1 3 1 12 1 2 1 10 ...
  ..$ V2: Factor w/ 11 levels "Age at first marriage\nDivorce rate\nDomestic citizens\nEthnic and cultural diversity level\nForeign-born population\nImmigrant"| __truncated__,..: NA 7 NA 4 NA 2 NA 9 NA 10 ...
> tables[[1]]
NULL

saveXML

xmlChildren

XMLノードオブジェクト内のサブノードを取得

> xmlChildren(doc$doc$children[["dataset"]]) %>% names()
Error in doc$doc: object of type 'externalptr' is not subsettable

xmlGetAttr

> doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))
> els <- getNodeSet(doc, "/doc//a[@status]")
> sapply(els, function(el) xmlGetAttr(el, "status"))
[1] "xyz" "1"

xmlRoot

> xmlTreeParse(system.file("exampleData", "mtcars.xml", package="XML")) %>% xmlRoot()
<dataset name="mtcars" numRecords="32" source="R Project">
 <variables count="11">
  <variable unit="Miles/gallon">mpg</variable>
  <variable>cyl</variable>
  <variable>disp</variable>
  <variable>hp</variable>
  <variable>drat</variable>
  <variable>wt</variable>
  <variable>qsec</variable>
  <variable>vs</variable>
  <variable type="FactorVariable" levels="automatic,manual">am</variable>
  <variable>gear</variable>
  <variable>carb</variable>
 </variables>
 <record id="Mazda RX4">21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4</record>
 <record id="Mazda RX4 Wag">21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4</record>
 <record id="Datsun 710">22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1</record>
 <record id="Hornet 4 Drive">21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1</record>
 <record id="Hornet Sportabout">18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2</record>
 <record id="Valiant">18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1</record>
 <record id="Duster 360">14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4</record>
 <record id="Merc 240D">24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2</record>
 <record id="Merc 230">22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2</record>
 <record id="Merc 280">19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4</record>
 <record id="Merc 280C">17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4</record>
 <record id="Merc 450SE">16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3</record>
 <record id="Merc 450SL">17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3</record>
 <record id="Merc 450SLC">15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3</record>
 <record id="Cadillac Fleetwood">10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4</record>
 <record id="Lincoln Continental">10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4</record>
 <record id="Chrysler Imperial">14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4</record>
 <record id="Fiat 128">32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1</record>
 <record id="Honda Civic">30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2</record>
 <record id="Toyota Corolla">33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1</record>
 <record id="Toyota Corona">21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1</record>
 <record id="Dodge Challenger">15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2</record>
 <record id="AMC Javelin">15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2</record>
 <record id="Camaro Z28">13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4</record>
 <record id="Pontiac Firebird">19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2</record>
 <record id="Fiat X1-9">27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1</record>
 <record id="Porsche 914-2">26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2</record>
 <record id="Lotus Europa">30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2</record>
 <record id="Ford Pantera L">15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4</record>
 <record id="Ferrari Dino">19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6</record>
 <record id="Maserati Bora">15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8</record>
 <record id="Volvo 142E">21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2</record>
</dataset>

xmlToDataFrame

XML文書からデータフレーム生成

Arguments

  • doc
  • colClasses
  • homogeneous
  • collectNames
  • nodes
  • stringsAsFactors
> path <- system.file("exampleData", "size.xml", package = "XML")
> xmlToDataFrame(doc = path, c("integer", "integer", "numeric")) %>% {
+   class(.) %>% print() # data.frameクラス
+   print(.)
+ }
[1] "data.frame"
  age sex number
1   0   0    500
2   0   1    300
3   1   0    200
4   1   1    400
5  10   0     NA

xmlTreeParse / xmlInternalTreeParse / xmlNativeTreeParse / htmlTreeParse / htmlParse

XMLパーサー。

xmlParsehtmlParseはそれぞれ、xmlTreeParse()およびhtmlTreeParse()useInternalNodes引数をTRUEにした時と同じ挙動。

Arguments

  • file
  • ignoreBlanks
  • handlers
  • replaceEntities
  • asText... 文字列で与える場合にTRUE
  • trim
  • validate
  • getDTD
  • isURL
  • asTree
  • addAttributeNamespaces
  • useInternalNodes
  • isSchema
  • fullNamespaceInfo
  • encoding
  • useDotNames
  • xinclude
  • addFinalizer
  • error
  • isHTML
  • options
  • parentFirst
> system.file("exampleData", "test.xml", package = "XML") %>% 
+   xmlTreeParse() %>% 
+   xmlRoot()
<foo x="1">
 <element attrib1="my value"/>
  test entity bar <?R sum(rnorm(100))?>
 <a>
  <!--A comment-->
  <b>%extEnt;</b>
 </a>
 <![CDATA[
 This is escaped data
 containing < and &. ]]>
 Note that this caused a segmentation fault if replaceEntities was 
not TRUE.
That is,
 <code>xmlTreeParse(&quot;test.xml&quot;, replaceEntities = TRUE)</code>
 works, but
 <code>xmlTreeParse(&quot;test.xml&quot;)</code>
 does not if this is called before the one above.
This is now fixed and was caused by
treating an xmlNodePtr in the C code 
that had type XML_ELEMENT_DECL
and so was in fact an xmlElementPtr.
Aaah, C and casting!
</foo>