Parsing XML Content

  • Deborah Nolan
  • Duncan Temple Lang
Chapter
Part of the Use R! book series (USE R)

Abstract

In this chapter, we explore approaches to parsing XML content within R and extracting content from the various types of elements in the XML document. The primary approach is to parse an XML document into a hierarchical tree object. We show how the tree representation of an XML document (described in Chapter 2) can be treated as a list in R, which makes it easy to navigate nodes and branches in the XML document. In addition, we demonstrate how to use functions in the XML package that are designed to work with different elements of the tree, e.g., functions for accessing node names, text content, attribute values, namespaces, etc. Subsequent chapters introduce XPath (Chapter 4), a powerful XML technology for locating content in an XML document, and describe more complex strategies for extracting XML content (Chapter 5).

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Elliotte Rusty Harold and W. Scott Means. XML in a Nutshell. O’Reilly Media, Inc., Sebastopol, CA, 2004.Google Scholar
  2. [2]
    David Hunter, Jeff Rafter, Joe Fawcett, Eric van der Vlist, Danny Ayers, Jon Duckett, Andrew Watt, and Linda McKinnon. Beginning XML. Wiley Publishing, Inc., Indianapolis, IN, fourth edition, 2007.Google Scholar
  3. [3]
    Duncan Temple Lang. RTidyHTML: Tidy HTML documents. http://www.omegahat.org/RTidyHTML, 2011. R package version 0.2-1.
  4. [4]
    Duncan Temple Lang. XML: Tools for parsing and generating XML within R and S-PLUS. http://www.omegahat.org/RSXML, 2011. R package version 3.4.
  5. [5]
    Duncan Temple Lang. Rcompression: In-memory decompression for GNU zip and bzip2 formats. http://www.omegahat.org/Rcompression, 2012. R package version 0.94-0.
  6. [6]
    Duncan Temple Lang. RCurl: General network (HTTP, FTP, etc.) client interface for R. http://www.omegahat.org/RCurl, 2012. R package version 1.95-3.
  7. [7]
    USGS Earthquakes Hazards Program. Latest earthquakes: feeds and data. http://earthquake.usgs.gov/earthquakes/catalogs/, 2010.
  8. [8]
    Daniel Veillard. The XML C parser and toolkit of Gnome. http://www.xmlsoft.org, 2011.

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Deborah Nolan
    • 1
  • Duncan Temple Lang
    • 2
  1. 1.Department of StatisticsUniversity of CaliforniaBerkeleyUSA
  2. 2.Department of StatisticsUniversity of CaliforniaDavisUSA

Personalised recommendations