Text Analysis

Arnold, Taylor; Tilton, Lauren

doi:10.1007/978-3-319-20702-5_10

Text Analysis

Taylor Arnold⁸ &
Lauren Tilton⁸

Chapter

6084 Accesses
1 Citations

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

Abstract

In this chapter, several methods for extracting meaning from a collection of parsed textual documents are presented. Examples include information retrieval, topic modeling, and stylometrics. Particular focus is placed on how to use these methods for constructing visualizations of textual corpora and a high-level categorization of some narrative trends.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
For the row names of the matrix tf, we have used the last ten characters of the Wikipedia filename in order to make the output print better in the width of this book. On your own machine, you may wish to use the entire filename.
2.
We have rounded to three decimal points for readability.
3.
http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html.
4.
The precise mathematic formulation of LDA is fairly involved and we will not give a full specification here. For a more detailed description, see the original LDA paper [1].
5.
Note that the output of each topic model will be slightly different, even if using the same text and parameters. The supplementary materials has a copy of the object tm, which exactly replicates the results in this text.
6.
The code snippet here will produce a single plot with all of the documents in one column. Running it selectively over rows of the matrix mat (i.e., 1:60, 61:120, and 121:179) was used to produce the actual figure shown in the text.
7.
Generally, the term bigram can refer to a sequence of any two object types such as words, letters, or morpheme. More generally, the term N-gram refers to a sequence of N objects.
8.
We normalize these counts by the total number of bigrams in order to not inflate the counts in the longer texts.
9.
The final, incomplete block is ignored.
10.
It may seem that these counts would incorrectly include other uses of these marks, such as abbreviations like “Dr.” and “Mr.”. However, in these cases the tokenizer will not separate the period from the remainder of the word. These symbols are only defined as separate lemmas when they define sentence boundaries (the one caveat being that this would include sentences found within long embedded quotes).
11.
http://tslp.acm.org/.

References

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3: 993–1022, 2003.
Google Scholar
Kurt Hornik and Bettina Grün. topicmodels: An r package for fitting topic models. Journal of Statistical Software, 40(13): 1–30, 2011.
Google Scholar
Dan Knights, Michael C Mozer, and Nicolas Nicolov. Detecting topic drift with compound topic models. In ICWSM, 2009.
Google Scholar
Maheshkumar H Kolekar, Kannappan Palaniappan, Somnath Sengupta, and Gunasekaran Seetharaman. Semantic concept mining based on hierarchical event detection for soccer video indexing. Journal of multimedia, 4(5):298–312, 2009.
Google Scholar
David Mimno. mallet: A wrapper around the Java machine learning tool MALLET, 2013. URL http://CRAN.R-project.org/package=mallet. R package version 1.0.
Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2 (1–2):1–135, 2008.
Article Google Scholar
Roger D Peng and Nicolas W Hengartner. Quantitative analysis of literary styles. The American Statistician, 56(3):175–185, 2002.
Google Scholar
Kevin Dela Rosa, Rushin Shah, Bo Lin, Anatole Gershman, and Robert Frederking. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM, 2011.
Google Scholar
Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston, MA, 2000.
Google Scholar
Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the american statistical association, 101 (476), 2006.
Google Scholar
Sholom M Weiss, Nitin Indurkhya, and Tong Zhang. Fundamentals of predictive text mining. Springer Science & Business Media, 2010.
Google Scholar

Download references

Author information

Authors and Affiliations

Yale University, New Haven, CT, USA
Taylor Arnold & Lauren Tilton

Authors

Taylor Arnold
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Tilton
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Arnold, T., Tilton, L. (2015). Text Analysis. In: Humanities Data in R. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-20702-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-20702-5_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20701-8
Online ISBN: 978-3-319-20702-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics