Abstract
In this chapter, several methods for extracting meaning from a collection of parsed textual documents are presented. Examples include information retrieval, topic modeling, and stylometrics. Particular focus is placed on how to use these methods for constructing visualizations of textual corpora and a high-level categorization of some narrative trends.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
For the row names of the matrix tf, we have used the last ten characters of the Wikipedia filename in order to make the output print better in the width of this book. On your own machine, you may wish to use the entire filename.
- 2.
We have rounded to three decimal points for readability.
- 3.
- 4.
The precise mathematic formulation of LDA is fairly involved and we will not give a full specification here. For a more detailed description, see the original LDA paper [1].
- 5.
Note that the output of each topic model will be slightly different, even if using the same text and parameters. The supplementary materials has a copy of the object tm, which exactly replicates the results in this text.
- 6.
The code snippet here will produce a single plot with all of the documents in one column. Running it selectively over rows of the matrix mat (i.e., 1:60, 61:120, and 121:179) was used to produce the actual figure shown in the text.
- 7.
Generally, the term bigram can refer to a sequence of any two object types such as words, letters, or morpheme. More generally, the term N-gram refers to a sequence of N objects.
- 8.
We normalize these counts by the total number of bigrams in order to not inflate the counts in the longer texts.
- 9.
The final, incomplete block is ignored.
- 10.
It may seem that these counts would incorrectly include other uses of these marks, such as abbreviations like “Dr.” and “Mr.”. However, in these cases the tokenizer will not separate the period from the remainder of the word. These symbols are only defined as separate lemmas when they define sentence boundaries (the one caveat being that this would include sentences found within long embedded quotes).
- 11.
References
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3: 993–1022, 2003.
Kurt Hornik and Bettina Grün. topicmodels: An r package for fitting topic models. Journal of Statistical Software, 40(13): 1–30, 2011.
Dan Knights, Michael C Mozer, and Nicolas Nicolov. Detecting topic drift with compound topic models. In ICWSM, 2009.
Maheshkumar H Kolekar, Kannappan Palaniappan, Somnath Sengupta, and Gunasekaran Seetharaman. Semantic concept mining based on hierarchical event detection for soccer video indexing. Journal of multimedia, 4(5):298–312, 2009.
David Mimno. mallet: A wrapper around the Java machine learning tool MALLET, 2013. URL http://CRAN.R-project.org/package=mallet. R package version 1.0.
Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2 (1–2):1–135, 2008.
Roger D Peng and Nicolas W Hengartner. Quantitative analysis of literary styles. The American Statistician, 56(3):175–185, 2002.
Kevin Dela Rosa, Rushin Shah, Bo Lin, Anatole Gershman, and Robert Frederking. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM, 2011.
Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston, MA, 2000.
Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the american statistical association, 101 (476), 2006.
Sholom M Weiss, Nitin Indurkhya, and Tong Zhang. Fundamentals of predictive text mining. Springer Science & Business Media, 2010.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Arnold, T., Tilton, L. (2015). Text Analysis. In: Humanities Data in R. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-20702-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-20702-5_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20701-8
Online ISBN: 978-3-319-20702-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)