Skip to main content

Clustering

  • Chapter
  • First Online:
Text Analysis with R for Students of Literature

Abstract

This chapter moves readers from the analysis of one or two texts to a larger corpus. Machine clustering is introduced in the context of an authorship attribution problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    If you have been working on other sections of this book or on R projects of your own, it might be a good idea to either restart R or to clear the R workspace. To do the latter, just click on the Session menu of the RStudio GUI and select Clear Workspace. This will remove all R objects and functions that you may have been using, wiping the R slate clean, as it were.

  2. 2.

    Patrick Burns has written a 125 page book documenting many of R’s unusual behavior. The book is informative and entertaining to read. You can find it online at http://www.burns-stat.com.

  3. 3.

    Enter ?regex at the prompt to learn more about regex in R.

  4. 4.

    You can learn more about the useInternalNodes argument in the documentation for the xmlTreeParse function. Basically, setting it to TRUE avoids converting the contents into R objects, which saves a bit of processing time.

  5. 5.

    See Chapter 10, section 10.5 for an explanation of the namespace argument.

  6. 6.

    For a brief overview of how this work is conducted, See Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013. Pages 63–67.

  7. 7.

    seq_along is a simple R function for generating a sequence of numbers. Check the R-help documentation for details. In this example, I could have just as easily used 1:43 or 1:length(book.freqs.l).

  8. 8.

    Factors are explained in a later section.

  9. 9.

    Other options include using reshape and expressions that leverage the apply family of functions.

  10. 10.

    Remember that the getTEIWordTableList function that we built multiplies all the relative frequencies by 100.

  11. 11.

    For details, consult the documentation for the dist and hclust functions.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Jockers, M.L. (2014). Clustering. In: Text Analysis with R for Students of Literature. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-03164-4_11

Download citation

Publish with us

Policies and ethics