Clustering

Jockers, Matthew L.

doi:10.1007/978-3-319-03164-4_11

Matthew L. Jockers⁷

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

9066 Accesses
1 Citations

Abstract

This chapter moves readers from the analysis of one or two texts to a larger corpus. Machine clustering is introduced in the context of an authorship attribution problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
If you have been working on other sections of this book or on R projects of your own, it might be a good idea to either restart R or to clear the R workspace. To do the latter, just click on the Session menu of the RStudio GUI and select Clear Workspace. This will remove all R objects and functions that you may have been using, wiping the R slate clean, as it were.
2.
Patrick Burns has written a 125 page book documenting many of R’s unusual behavior. The book is informative and entertaining to read. You can find it online at http://www.burns-stat.com.
3.
Enter ?regex at the prompt to learn more about regex in R.
4.
You can learn more about the useInternalNodes argument in the documentation for the xmlTreeParse function. Basically, setting it to TRUE avoids converting the contents into R objects, which saves a bit of processing time.
5.
See Chapter 10, section 10.5 for an explanation of the namespace argument.
6.
For a brief overview of how this work is conducted, See Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013. Pages 63–67.
7.
seq_along is a simple R function for generating a sequence of numbers. Check the R-help documentation for details. In this example, I could have just as easily used 1:43 or 1:length(book.freqs.l).
8.
Factors are explained in a later section.
9.
Other options include using reshape and expressions that leverage the apply family of functions.
10.
Remember that the getTEIWordTableList function that we built multiplies all the relative frequencies by 100.
11.
For details, consult the documentation for the dist and hclust functions.

Author information

Authors and Affiliations

Department of English, University of Nebraska, Lincoln, Nebraska, USA
Matthew L. Jockers

Authors

Matthew L. Jockers
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jockers, M.L. (2014). Clustering. In: Text Analysis with R for Students of Literature. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-03164-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-03164-4_11
Published: 04 April 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03163-7
Online ISBN: 978-3-319-03164-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics