Text Quality, Text Variety, and Parsing XML

Jockers, Matthew L.

doi:10.1007/978-3-319-03164-4_10

Matthew L. Jockers⁷

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

9133 Accesses

Abstract

This chapter introduces readers to parsing XML in R with an emphasis on TEI encoded XML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See Eder, Maciej. “Mind your corpus: systematic errors in authorship attribution.” in Conference Abstracts of the 2012 Digital Humanities Conference, Hamburg, Germany. http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/mind-your-corpus-systematic-errors-in-authorship-attribution/.
2.
While it is possible to download all of the available packages, doing so would certainly take a long time and would clog up your installation with way too many irrelevant features. R is a multipurpose platform used in a huge range of disciplines including: bio-statistics, network analysis, economics, data-mining, geography, and hundreds of other disciplines and sub-disciplines. This diversity in the user community is one of the great advantages of R and of open-source software more generally. The diversity of options, however, can be daunting to the novice user, and, to make matters even more unnerving, the online R user community is notoriously specialized and siloed and can appear to be rather impatient when it comes to newbies asking simple questions. Having said that, the online community is also an incredible resource that you must not ignore. Because the packages developed for R are developed by programmers with at least some amount of ad hoc motivation behind their coding, the packages are frequently weak on documentation and generally assume some, if not extensive, familiarity with the academic discipline of the programmer (even if the package is one with applications that cross disciplinary boundaries).
3.
Notice the different path here. The XML version of Moby Dick is located in a different subdirectory of the main.
4.
A node inside of another node is often referred to as a “child” node.
5.
Notice that the chap.title object is another type of list, which is why the further bracketed sub-setting is required in order to get at the text contents.
6.
They won’t be exactly the same because they come from slightly different sources.

Author information

Authors and Affiliations

Department of English, University of Nebraska, Lincoln, Nebraska, USA
Matthew L. Jockers

Authors

Matthew L. Jockers
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jockers, M.L. (2014). Text Quality, Text Variety, and Parsing XML . In: Text Analysis with R for Students of Literature. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-03164-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-03164-4_10
Published: 04 April 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03163-7
Online ISBN: 978-3-319-03164-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics