Skip to main content

Text Analysis

  • Chapter

Abstract

In this chapter, several methods for extracting meaning from a collection of parsed textual documents are presented. Examples include information retrieval, topic modeling, and stylometrics. Particular focus is placed on how to use these methods for constructing visualizations of textual corpora and a high-level categorization of some narrative trends.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For the row names of the matrix tf, we have used the last ten characters of the Wikipedia filename in order to make the output print better in the width of this book. On your own machine, you may wish to use the entire filename.

  2. 2.

    We have rounded to three decimal points for readability.

  3. 3.

    http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html.

  4. 4.

    The precise mathematic formulation of LDA is fairly involved and we will not give a full specification here. For a more detailed description, see the original LDA paper [1].

  5. 5.

    Note that the output of each topic model will be slightly different, even if using the same text and parameters. The supplementary materials has a copy of the object tm, which exactly replicates the results in this text.

  6. 6.

    The code snippet here will produce a single plot with all of the documents in one column. Running it selectively over rows of the matrix mat (i.e., 1:60, 61:120, and 121:179) was used to produce the actual figure shown in the text.

  7. 7.

    Generally, the term bigram can refer to a sequence of any two object types such as words, letters, or morpheme. More generally, the term N-gram refers to a sequence of N objects.

  8. 8.

    We normalize these counts by the total number of bigrams in order to not inflate the counts in the longer texts.

  9. 9.

    The final, incomplete block is ignored.

  10. 10.

    It may seem that these counts would incorrectly include other uses of these marks, such as abbreviations like “Dr.” and “Mr.”. However, in these cases the tokenizer will not separate the period from the remainder of the word. These symbols are only defined as separate lemmas when they define sentence boundaries (the one caveat being that this would include sentences found within long embedded quotes).

  11. 11.

    http://tslp.acm.org/.

References

  1. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3: 993–1022, 2003.

    Google Scholar 

  2. Kurt Hornik and Bettina Grün. topicmodels: An r package for fitting topic models. Journal of Statistical Software, 40(13): 1–30, 2011.

    Google Scholar 

  3. Dan Knights, Michael C Mozer, and Nicolas Nicolov. Detecting topic drift with compound topic models. In ICWSM, 2009.

    Google Scholar 

  4. Maheshkumar H Kolekar, Kannappan Palaniappan, Somnath Sengupta, and Gunasekaran Seetharaman. Semantic concept mining based on hierarchical event detection for soccer video indexing. Journal of multimedia, 4(5):298–312, 2009.

    Google Scholar 

  5. David Mimno. mallet: A wrapper around the Java machine learning tool MALLET, 2013. URL http://CRAN.R-project.org/package=mallet. R package version 1.0.

  6. Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2 (1–2):1–135, 2008.

    Article  Google Scholar 

  7. Roger D Peng and Nicolas W Hengartner. Quantitative analysis of literary styles. The American Statistician, 56(3):175–185, 2002.

    Google Scholar 

  8. Kevin Dela Rosa, Rushin Shah, Bo Lin, Anatole Gershman, and Robert Frederking. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM, 2011.

    Google Scholar 

  9. Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston, MA, 2000.

    Google Scholar 

  10. Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the american statistical association, 101 (476), 2006.

    Google Scholar 

  11. Sholom M Weiss, Nitin Indurkhya, and Tong Zhang. Fundamentals of predictive text mining. Springer Science & Business Media, 2010.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Arnold, T., Tilton, L. (2015). Text Analysis. In: Humanities Data in R. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-20702-5_10

Download citation

Publish with us

Policies and ethics