Skip to main content

Text Quality, Text Variety, and Parsing XML

  • Chapter
  • First Online:
Text Analysis with R for Students of Literature
  • 9133 Accesses

Abstract

This chapter introduces readers to parsing XML in R with an emphasis on TEI encoded XML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See Eder, Maciej. “Mind your corpus: systematic errors in authorship attribution.” in Conference Abstracts of the 2012 Digital Humanities Conference, Hamburg, Germany. http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/mind-your-corpus-systematic-errors-in-authorship-attribution/.

  2. 2.

    While it is possible to download all of the available packages, doing so would certainly take a long time and would clog up your installation with way too many irrelevant features. R is a multipurpose platform used in a huge range of disciplines including: bio-statistics, network analysis, economics, data-mining, geography, and hundreds of other disciplines and sub-disciplines. This diversity in the user community is one of the great advantages of R and of open-source software more generally. The diversity of options, however, can be daunting to the novice user, and, to make matters even more unnerving, the online R user community is notoriously specialized and siloed and can appear to be rather impatient when it comes to newbies asking simple questions. Having said that, the online community is also an incredible resource that you must not ignore. Because the packages developed for R are developed by programmers with at least some amount of ad hoc motivation behind their coding, the packages are frequently weak on documentation and generally assume some, if not extensive, familiarity with the academic discipline of the programmer (even if the package is one with applications that cross disciplinary boundaries).

  3. 3.

    Notice the different path here. The XML version of Moby Dick is located in a different subdirectory of the main.

  4. 4.

    A node inside of another node is often referred to as a “child” node.

  5. 5.

    Notice that the chap.title object is another type of list, which is why the further bracketed sub-setting is required in order to get at the text contents.

  6. 6.

    They won’t be exactly the same because they come from slightly different sources.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Jockers, M.L. (2014). Text Quality, Text Variety, and Parsing XML . In: Text Analysis with R for Students of Literature. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-03164-4_10

Download citation

Publish with us

Policies and ethics