In this chapter, several methods for extracting meaning from a collection of parsed textual documents are presented. Examples include information retrieval, topic modeling, and stylometrics. Particular focus is placed on how to use these methods for constructing visualizations of textual corpora and a high-level categorization of some narrative trends.
KeywordsTopic Model Latent Dirichlet Allocation Stop Word Code Snippet Word Model
- David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3: 993–1022, 2003.Google Scholar
- Kurt Hornik and Bettina Grün. topicmodels: An r package for fitting topic models. Journal of Statistical Software, 40(13): 1–30, 2011.Google Scholar
- Dan Knights, Michael C Mozer, and Nicolas Nicolov. Detecting topic drift with compound topic models. In ICWSM, 2009.Google Scholar
- Maheshkumar H Kolekar, Kannappan Palaniappan, Somnath Sengupta, and Gunasekaran Seetharaman. Semantic concept mining based on hierarchical event detection for soccer video indexing. Journal of multimedia, 4(5):298–312, 2009.Google Scholar
- David Mimno. mallet: A wrapper around the Java machine learning tool MALLET, 2013. URL http://CRAN.R-project.org/package=mallet. R package version 1.0.
- Roger D Peng and Nicolas W Hengartner. Quantitative analysis of literary styles. The American Statistician, 56(3):175–185, 2002.Google Scholar
- Kevin Dela Rosa, Rushin Shah, Bo Lin, Anatole Gershman, and Robert Frederking. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM, 2011.Google Scholar
- Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston, MA, 2000.Google Scholar
- Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the american statistical association, 101 (476), 2006.Google Scholar
- Sholom M Weiss, Nitin Indurkhya, and Tong Zhang. Fundamentals of predictive text mining. Springer Science & Business Media, 2010.Google Scholar