Topic Modeling

Jockers, Matthew L.

doi:10.1007/978-3-319-03164-4_13

Matthew L. Jockers⁷

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

9350 Accesses

Abstract

This chapter introduces topic modeling using the mallet package, part of speech tagging with openNLP, and topic-based word cloud visualization using the wordcloud package. (In this chapter I assume that readers are already familiar with the basic idea behind topic modeling. Readers who are not familiar may consult Appendix B for a general overview and some suggestions for further reading.)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Newman, Smyth, Steyvers (2006). “Scalable Parallel Topic Models.” Journal of Intelligence Community Research and Development and Newman and Block (2006). “Probabilistic Topic Decomposition of an Eighteenth Century Newspaper.” In JASIST, March 2006.
2.
Readers seeking a user-friendly introduction to how topic modeling actually works should consult Appendix B.
3.
The topicmodels package provides an implementation of (or interface to) the C code developed by LDA pioneer David Blei. See Blei, David M., Ng, Andrew Y., and Jordan, Michael I. “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3 (2003) 993–1022.
Johnathan Chang is a researcher at Facebook who has worked with Blei and with whom he has coauthored several papers including the influential topic modeling paper: J Chang, S Gerrish, C Wang, JL Boyd-Graber, DM Blei. “Reading tea leaves: How humans interpret topic models.” Advances in neural information processing systems, 2009 http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2009_0125.pdf.
David Mimno, a professor at Cornell, is the developer and maintainer Of the Java implementation of LDA in the popular MAchine Learning for LanguagE Toolkit (MALLET) developed at the University of Massachusetts under the direction of Andrew McCallum: McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” 2002. See http://mallet.cs.umass.edu.
4.
Mimno released (to CRAN) his R “wrapper” for the MALLET Topic modeling package on August 9, 2013.
5.
I have used all three of these packages to good effect and prior to the release of the mallet package I taught workshops using both topicmodels and lda. Each one has its advantages and disadvantages in terms of ease of use, but functionally they are all comparable.
6.
There appears to be no conventional wisdom regarding ideal text-segmentation parameters. David Mimno reports in email correspondence that he frequently chunks texts down to the level of individual paragraphs. Until new research provides an algorithmic alternative, trial and experimentation augmented by domain expertise appear to be the best guides in setting segmentation parameters.
7.
Another problem involves hyphens. Hyphens can appear at the end of lines as a printing convention but also in compound adjectives. We’ll not deal with that trickier problem here.
8.
The regular expression used here may appear complicated compared to the simple W that has been used thus far. In this case, the expression simply says: “replace anything, except for an apostrophe, that is not an alphanumeric character with a blank space. “ ”).” [:alnum:] matches any alphabetic or numeric character and [:space:] matches any blank space. The ’ (apostrophe character) then matches any apostrophes. The ˆ (caret character) at the beginning of the expression serves as a negation operator, in essence indicating that the engine should match on anything that is not a character, digit, space, or apostrophe: i.e. match all other characters!
9.
You did this in Chap. 12 as well.
10.
Paul Johnson pointed out that there is a more computationally efficient method for achieving this same result. Instead of building a matrix object inside the loop, build a list and then use do.call to rbind the list elements after the loop is completed.
11.
Methods in Java are more or less synonymous with functions in R.
12.
mallet’s default is to convert to lowercase.
13.
How to set the number of topics is a matter of serious discussion in the topic modeling literature, and there is no obvious way of knowing in advance exactly where this number should be set. In the documentation for the MALLET program, Mimno writes: “The best number depends on what you are looking for in the model. The default (10) will provide a broad overview of the contents of the corpus. The number of topics should depend to some degree on the size of the collection, but 200 to 400 will produce reasonably fine-grained results.” Readers interested in more nuanced solutions may wish to consult Chap. 8 of Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013, or visit http://www.matthewjockers.net/2013/04/12/secret-recipe-for-topic-modeling-themes/ for my “Secret” Recipe for Topic Modeling Themes.
14.
The MALLET program is not terribly difficult to run outside of R and there are now many good tutorials available online. A few of these are specifically written with humanities applications of topic modeling in mind. Perhaps the best place to start is with Shawn Graham, Scott Weingart, and Ian Milligan’s online tutorial titled “Getting Started with Topic Modeling and MALLET.” See http://programminghistorian.org/lessons/topic-modeling-and-mallet.
15.
Do not forget that prior to modeling you have chunked each novel from the example corpus into 1000 word segments.
16.
The ramifications of resetting these values is beyond the scope of this chapter, but interested readers may wish to consult Hanna Wallach, David Mimno and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In proceedings of Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 2009.
17.
My anecdotal experience seems consistent with more scientific studies, and interested readers may wish to consult Griffiths, T. L., & Steyvers, M. (2004). “Finding scientific topics.” Proceedings of the National Academy of Science, 101, 5228–5235.
18.
David Mimno’s “Topic Modeling Bibliography” provides a comprehensive list of resources for those wishing to go beyond this text. See http://www.cs.princeton.edu/%7Emimno/topics.html.
19.
Note: if you are copying and executing this code as you read along, your row values and weights are likely to be different because the topic model employs a process that begins with a random distribution of words across topics. Though the topics you generate from this corpus will be generally similar, they may not be exactly the same as those that appear in this text.
20.
It must be noted here that in the MALLET Java program, topics are indexed starting at zero. Java, like many programming languages begins indexing with 0. R, however, begins with 1. Were we to run this same topic modeling exercise in the Java application, the topics would be labeled with the numbers 0 - 42. In R they are 1- 43.
21.
For those who may not have intuited as much, the corpus of texts used in this book is composed of novels written entirely by Irish and Irish-American authors.
22.
See Chap. 10, Sect 10.4 for package installation instructions.
23.
To see how to control the look of the visualization, consult the help documentation for the wordcloud function using ?wordcloud.
24.
Your plots will not look the same since each run of the model is slightly different.
25.
A full discussion of factors is beyond the scope of this chapter, but for simplicity think about factors as a type of variable that can hold some limited set of values. These are often referred to as categorical variables. See also: Chap. 12 Sect. 12.9, footnote 4.
26.
All 500 of them can be viewed at http://www.matthewjockers.net/macroanalysisbook/macro-themes/.
27.
Note that both the NLP and openNLP packages must be installed first.
28.
The openNLP tagger implements the Penn Treebank tag set. See http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
29.
In previous functions you have written, you have explicitly called the return function to send the results of the function back to the main script. This is not always necessary as R’s default behavior is to return the last object generated by the function. This function is simple enough that I have chosen to leave off an explicit call to return.
30.
In the topicClouds sub-directory of the data directory, you will find two.pdf files showing 43 word clouds each. The file titled “fromUnTagged.pdf” contains the clouds produced without POS-based pre-processing, “fromTagged.pdf” includes the clouds produced only using words tagged as nouns.

Author information

Authors and Affiliations

Department of English, University of Nebraska, Lincoln, Nebraska, USA
Matthew L. Jockers

Authors

Matthew L. Jockers
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jockers, M.L. (2014). Topic Modeling. In: Text Analysis with R for Students of Literature. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-03164-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-03164-4_13
Published: 04 April 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03163-7
Online ISBN: 978-3-319-03164-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics