Economic history goes digital: topic modeling the Journal of Economic History

Wehrheim, Lino

doi:10.1007/s11698-018-0171-7

Economic history goes digital: topic modeling the Journal of Economic History

Original Paper
Published: 11 May 2018

Volume 13, pages 83–125, (2019)
Cite this article

Cliometrica Aims and scope Submit manuscript

Lino Wehrheim ORCID: orcid.org/0000-0001-9269-8116¹

2198 Accesses
33 Citations
4 Altmetric
Explore all metrics

Abstract

Digitization and computer science have established a completely new set of methods with which to analyze large collections of texts. One of these methods is particularly promising for economic historians: topic models, i.e., statistical algorithms that automatically infer the content from large collections of texts. In this article, I present an introduction to topic modeling and give an initial review of the research using topic models. I illustrate their capacity by applying them to 2675 articles published in the Journal of Economic History between 1941 and 2016. By comparing the results to traditional research on the JEH and to recent studies on the cliometric revolution, I aim to demonstrate how topic models can enrich economic historians’ methodological toolboxes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The potential of working hypotheses for deductive exploratory research

Article Open access 08 December 2020

Thematic Analysis: Making Values Emerge from Texts

American Dreaming: Really Reading The Great Gatsby

Article 01 August 2020

Notes

Questions on the future of economic history were discussed on a special panel at the 75th anniversary of the Economic History Association. See Journal of Economic History, Volume 75 Issue 4.
For an assessment of the status quo in digital history, see the white paper “Digital History and Argument”, the Arguing with Digital History working group, Roy Rosenzweig Center for History and New Media.
Jockers (2013, p. 123) calls them the “mother of all collocation tools”.
As the literature review shows, there are several economists who have recently become aware of them.
Scott Weingart blog gives a helpful overview of blogs on topic modeling. See http://www.scottbot.net/HIAL/index.html@p=19113.html. Although this overview may be somewhat outdated, it is still a good starting point for scholars who are unfamiliar with topic modeling.
See Blei et al. (2003), Blei and Lafferty (2009), Griffiths and Steyvers (2004), and Steyvers and Griffiths (2007) for formal descriptions.
Describing the origins of topic modeling in the context of digital humanities is relatively challenging as this touches on several disciplines which all have different histories. For example, the ‘history of humanities computing’ can be traced back to Father Roberto Busa, who indexed the work of Thomas Aquinas in the late 1940 s. See Hockey (2004) and Jockers (2013). For a brief description of the recent development of topic models, see Lüdering and Winker (2016).
There have been several extensions of the original model covering different assumption of LDA Blei (Blei 2012a, pp. 82–84). In the following, the terms LDA and topic model will be used synonymously.
Abstracts and keywords pose their own problems. With large collections, even the reading of abstracts can become too time-consuming. However, keywords may be too vague to be useful.
In this example, the word table could be accompanied by words like chair, tablecloth, or leg in the first topic, while in the second it could be words like column, row, or cell.
This is why topic models are also called generative models. See Steyvers and Griffiths (2007, p. 427).
More precisely, this nonzero probability follows from estimating the topic shares using Gibbs sampling, also used in this paper (see below). The share of topic k in document d is approximated by \( \hat{\theta }_{d,k} = \frac{{n_{d,k} + \alpha }}{{N_{d} + K\alpha }} \) with \( n_{d,k} \) being the number of times document d uses topic \( k \) and \( N_{d} \) equaling the total number of words in document d. Including the Dirichlet parameter \( \alpha \) results in \( \hat{\theta }_{d,k} \) always being nonzero. See Boyd-Graber et al. (2017, pp. 15–16) and Griffiths and Steyvers (2004, pp. 5229–30).
The following description is inspired by a lecture given by Mimno (2012b) and Ted Underwood description of topic modeling on his blog (available at https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/). For a technical description, see Griffiths and Steyvers (2004) and Steyvers and Griffiths (2007).
There is a trade-off between topic coherence and the time it takes to train the model. Finding many topics in large corpora can keep the computer busy for hours.
This parameter is sometimes called “hyperparameter”.
One step in topic modeling frequently consists of removing capitalization (see below). In the following, words are kept in lower case if they constitute a topic.
There are also attempts to automatically assign labels to topics. See Lau et al. (2011).
For example, in a German newspaper corpus analyzed in a different project, one topic consisted mainly of modal verbs.
Some algorithms operate with a term-document matrix, which is a transposed DTM.
In general, the word vectors as well as TM can contain absolute word counts (occurrences) or relative word counts (term frequency or tf–idf. See below.). The conversion of text files into word vectors or a DTM can be carried out within the different topic model applications (see below) or independently from topic modeling. Tools are, e.g., the tm (text mining) package for R developed by Feinerer (2017) or programs like RapidMiner.
The dimension of this DTM will be reduced by removing certain words. See below.
Ostensibly, a tokenizer is a computer program which cuts sentences into pieces called “tokens”, based on predefined rules. For example, the sentence “Mr. Smith’s mother is seventy-nine years old, but she doesn’t look her age.” could be tokenized most simply by cutting at each whitespace and punctuation mark:“Mr|Smith|s|mother|is|seventy|nine|years|old|but|she|doesn|t|look|her|age”. This simple rule could be modified, for example, so that it does not split numbers, to delete every “s” following an apostrophe, or to leave common expression like doesn’t intact.
For an extension relaxing the bag of words assumption, see Wallach (2006).
For tools to carry out these steps, see Graham et al. (2016). For a discussion of the effect of stopword removal, see Schofield et al. (2017).
For a general discussion of texts as data for economic research, see Gentzkow et al. (2017).
OCR mistakes can build their own topic, see Jockers (2013).
See, for example, McFarland et al. (2013).
To name just two examples, Miller (2013) uses a Chinese corpus and Heiberger and Koss (2018) use documents written in German.
This statement is only based on impressions based on work conducted in a different project and has not been verified empirically.
The authors also show how their model can be used for machine translations by generating bilingual lexica. For further applications of topic models in machine translation, see Eidelman et al. (2012) and Zhao and Xing (2007).
I thank one of the anonymous referees for this intriguing question.
For example, translations of scientific publications are similar to the parliament proceedings used in Mimno et al. (2009) in that they are rather direct. However, there can be quite considerable differences in translations of novels. Furthermore, some aspects of a text may become “lost in translation”, as, for example, the German differentiation between the formal form of address Sie and the informal Du is lost in the English translation you. It appears unlikely, though, that issue of this kind pose serious problems to topic modeling.
JEL codes are used by Abramitzky (2015), McCloskey (1976), and Whaples (2002). Another example of a comparable way of text coding can be found in the financial literature on sentiment following Tetlock (2007), which basically measures the tone of texts by counting negative and positive words based on predefined dictionaries.
This should not create the impression that manual coding is regarded as being futile. On the contrary, there are many cases in which it is completely appropriate, and studies like Whaples (2002) as well as professionals’ reliance on human coding, e.g., in media analysis, build a strong case for manual coding. Still, the advantages of one method are best illustrated when compared with the shortcomings of another, and in times of almost unlimited availability of textual sources, automatic methods like topic modeling will probably prevail.
Although there may be some overlaps, this should not be confused with priming as it is understood in psychology.
Of course, even the decision to use any type of quantitative representation of texts is based on the conviction that this can contribute something to our research. It could be that, for example, economic historians affiliated with history departments find this a less useful approach. Naturally, this argument holds true for the use of topic models as well.
For further explanation of the confirmation bias, see, e.g., Oswald and Grosjean (2004).
Of course, the preprocessing steps applied in topic modeling, such as the choice to remove certain stop words, can be regarded as being a priori decisions by the researcher that influence—and thus potentially bias—the output.
For instance, applications of topic models for the use of databases are explored by JSTOR One example is the “text analyzer”, an online tool which identifies documents in the JSTOR database that are similar to a search document in terms of topics. See http://www.jstor.org/analyze/.
Running one model on the 2675 documents of this paper took approximately 35 min using an ordinary computer.
There is no exact lower limit to the number of documents, but experience shows that it takes at least about two hundred scientific paper-sized documents to produce meaningful topics. If the corpus consists of a few long documents, such as books, these documents can be split, e.g., into single chapters. See Jockers (2013). The documents must not be too short either. For example, the length of a single tweet would not be sufficient in finding any meaningful topics, so in this case, several tweets can be aggregated, e.g., on a daily basis. See Lüdering and Tillmann (2016).
The analysis of textual sentiment, i.e., the tone of documents, is a second major approach in text mining, which is used particularly in finance and financial economics. For example, pessimism expressed in financial newspapers is found to influence stock returns and trading volume, see, e.g., Tetlock (2007) and García (2013). An example for a combination of topic modeling and sentiment analysis is given by Nguyen and Shirai (2015).
The Kullback–Leibler divergence (or distance) between two probability distributions p and q is defined as \( {\text{KLD}}\left( {p,q} \right) = \frac{1}{2}\left[ {D\left( {p,q} \right) + D\left( {q,p} \right)} \right] \) with \( D\left( {p,q} \right) = \mathop \sum \nolimits_{j = 1}^{T} p_{j} \log_{2} \frac{{p_{j} }}{{q_{j} }} \). The Jensen–Shannon divergence is defined as \( {\text{JSD}}\left( {p,q} \right) = \frac{1}{2}\left[ {D\left( {p,\frac{p + q}{2}} \right) + D\left( {q,\frac{p + q}{2}} \right)} \right] \). Both measures are equal to zero when p and q are completely identical and, with higher divergence, their value approaches one. See Steyvers and Griffiths (2007).
Available at http://dsl.richmond.edu/dispatch/pages/home.
In finance, there appears to be an affinity toward text as data, which can be traced back to Tetlock (2007), who was the first to use textual analysis in order to measure market sentiment.
That is, all articles published in regular and Task issues except regular book reviews and dissertation summaries, see Whaples (1991).
In the first trials, almost all topics contained the word Cambridge. Other cities that occur in the final topics were not found to appear regularly in bibliographical references except in combination with “university press”.
As ‘york’ and ‘cent’ occurred in several early topics, it became clear that in fact New York and per cent was meant. Thus, this step was taken for reasons of clarity and esthetics.
The stopwords can be received upon request. For sample one, the database consists of 1728 documents.
If a stemmer had been used, these words would have been collapsed into japan.
Depicting every topic as a word cloud would exceed the available space of this article.
Footnote 14 in Walters and Walters (1944) may serve as an example: “[…] Parish to John Craig, March 1, 1806, to Villaneuva, March 18, 1806, to Robert and John Oliver, October 29, 1806, in Parish LB, I, 239, 290, 291; II, 5.”
These subjects are based on JEL codes. See Whaples (1991, pp. 289–90).
Of course, this assignment is somewhat subjective, but it is no more subjective than assigning pages to subjects by hand.
Except for Canada, which shares a topic with other countries (Topic 20), every country analyzed in Whaples (1991) is comprised of a separate topic. These countries are Britain (33), France (35), Italy (16), Germany (26), Japan (1), Russia/Soviet Union (29), and the United States (9).
These words most probably stem from bibliographical references, which often remained untranslated.
A direct comparison of the results in Whaples (2002) as in sample one could have been carried out as well but was relinquished due to space restrictions.
The 25th subject in Whaples (2002) is the residual “Other”.
The peak of 1960 can be attributed mainly to the Task issue containing a nice punchline: the article with the highest share of topic 20 is Goodrich (1960), which discusses how the use of quantitative methods affects economic history.
The use of annual means is of course prone to outliers. If one is interested in the long-term development, a moving average would probably be more appropriate. However, the outliers could be what we are looking for if we are interested in identifying special events.
The same continuity holds true at a 5 and 20% threshold.
The correlation matrix is available upon request.
See Arun et al. (2010), Cao et al. (2009), Deveaud et al. (2014), and Griffiths and Steyvers (2004) for detailed explanations of each metric. It is important to note that these metrics only deliver the optimal number of topics from a technical point of view. In the end, the optimal K depends on the research question.
Computing these metrics takes a considerable amount of time. For the articles in sample 2, it took 4 days per metric on a standard computer. It is therefore necessary to retain the span of Ks at a manageable level, resulting in a coarse span of topics. It is important to note that these metrics only deliver the optimal number of topics in a technical sense. In the end, the optimal K depends on the research question.
See https://cran.r-project.org/web/packages/ldatuning/ldatuning.pdf.
Due to space restrictions, the topics are not presented here; they are available upon request.
For a comprehensive history of cliometrics see Haupert (2016) and the cited literature.
The econometric language topics exhibit almost identical shares in both samples indicating that they are relatively congruent. The descriptive topics exhibit a degree of difference, because, in sample 2 this topic appears to be less coherent than it does in sample 1.
Diebolt and Haupert (2018) count equations, tables, and graphs per page. See Fig. 8.
Until 1996, papers presented at the annual meetings of the Economic History Association were published in a fourth issue, which was devoted to the “Tasks of Economic History”, see Diebolt and Haupert (2018, p. 22) and Margo (2018, p. 12).
The journals analyzed in his study are: the American Economic Review, Explorations in Economic History, the Journal of Economic History, the Industrial and Labor Relations Review, and the Journal of Human Resources.
Margo (2018) uses an index based on the terms regression, logit, probit, maximum likelihood, coefficient, and standard error.
Another candidate for words and expressions that characterize econometric language are the indices and glossaries of econometric textbooks, which we use in an ongoing project. This provides the advantage that levels of methodological advancement can be differentiated between by using indices from introductory and advanced textbooks.
This touches on the issue of changes in the use of language discussed earlier.
To cite just one example: with a share of 32%, Kuznets' (1952) study on US national income before 1870 is among the papers with the highest share of the descriptive topic 6.
I thank one anonymous referee for the remark concerning the fact that this distinction between quantitative (in the sense of mere counting) and econometric approaches was made already by early cliometricians. See, e.g., McCloskey (1978, 1987).

References

Abramitzky R (2015) Economics and the modern economic historian. J Econ Hist 75(4):1240–1251
Article Google Scholar
Andorfer P (2017) Turing Test für das Topic Modeling. Von Menschen und Maschinen erstellte inhaltliche Analysen der Korrespondenz von Leo von Thun-Hohenstein im Vergleich. Zeitschrift für digitale Geisteswissenschaften. https://doi.org/10.17175/2017_002
Arguing with Digital History working group Digital History and Argument. White paper, Roy Rosenzweig Center for History and New Media (13 Nov 2017). https://rrchnm.org/argument-white-paper/
Arun R, Suresh V, Veni Madhavan CE, Narasimha Murthy MN (2010) On finding the natural number of topics with latent Dirichlet allocation: some observations. In: Zaki MJ, Yu JX, Ravindran B, Pudi V (eds) Advances in knowledge discovery and data mining, vol 6118. Springer, Berlin
Chapter Google Scholar
Bellstam G, Sanjai B, Cookson JA (2017) A text-based analysis of corporate innovation. SSRN working paper no. 2803232, May 2017
Blei DM (2012a) Probabilistic topic models. Commun ACM 55(4):77–84
Article Google Scholar
Blei DM (2012b) Topic modeling and digital humanities. J Digit Human 2(1):8–11
Google Scholar
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning, pp 113–120
Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Statist 1(1):17–35
Article Google Scholar
Blei DM, Lafferty JD (2009) Topic models. In: Srivastava AN, Sahami M (eds) Text mining: classification, clustering, and applications. CRC Press, Boca Raton
Google Scholar
Blei D, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Google Scholar
Bonilla T, Grimmer J (2013) Elevated threat levels and decreased expectations: how democracy handles terrorist threats. Poetics 41(6):650–669
Article Google Scholar
Boyd-Graber J, Blei D (2009) Multilingual topic models for unaligned text. In: UAI ‘09 Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp 75–82
Boyd-Graber J, Mimno D, Newman DJ (2015) Care and feeding of topic models. In: Blei DM, Erosheva EA, Fienberg SE, Airoldi EM (eds) Handbook of mixed membership models and their applications. Taylor and Francis, Boca Raton
Google Scholar
Boyd-Graber J, Hu Y, Mimno D (2017) Applications of topic models. Foundations and Trends in Information Retrieval, Boston
Book Google Scholar
Burguière A (2009) The Annales school: an intellectual history. Cornell University Press, Ithaca
Google Scholar
Cao J, Xia T, Li J, Zhang Y, Tang S (2009) A density-based method for adaptive LDA model selection. Neurocomputing 72:1775–1781. https://doi.org/10.1016/j.neucom.2008.06.011
Article Google Scholar
Chang J, Boyd-Graber J, Wang C, Gerrish S, Blei DM (2009) Reading tea leaves: how humans interpret topic models. Adv Neural Inf Process Syst 2009:288–296
Google Scholar
Collins WJ (2015) Looking forward: positive and normative views of economic history’s future. J Econ Hist 75(4):1228–1233
Article Google Scholar
Daniel V, Neubert M, Orban A (2018) Fictional expectations and the global media in the Greek debt crisis: a topic modeling approach. Working papers of the Priority Programme 1859 “Experience and Expectation. Historical Foundations of Economic Behaviour” No 4, Mar 2018
Deveaud R, Sanjuan E, Bellot P (2014) Accurate and effective latent concept modeling for ad hoc information retrieval. Doc Numérique 17:61–84. https://doi.org/10.3166/dn.17.1.61-84
Article Google Scholar
Diebolt C, Haupert M (2018) A cliometric counterfactual: what if there had been neither Fogel nor North? Cliometrica. https://doi.org/10.1007/s11698-017-0167-8
Article Google Scholar
DiMaggio P, Nag M, Blei D (2013) Exploiting affinities between topic modeling and the sociological perspective on culture: application to newspaper coverage of U.S. Government arts funding. Poetics 41(6):570–606
Article Google Scholar
Eidelman V, Boyd-Graber J, Resnik P (2012) Topic models for dynamic translation model adaptation. In: ACL ‘12 proceedings of the 50th annual meeting of the association for computational linguistics
Feinerer I (2017) Introduction to the tm Package: Text Mining in R. https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf. Accessed 27 Mar 2018
Fligstein N, Brundage JS, Schultz M (2017) Seeing like the fed: culture, cognition, and framing in the failure to anticipate the financial crisis of 2008. Am Sociol Rev 82(5):879–909
Article Google Scholar
Fogel R (1962) A quantitative approach to the study of railroads in American economic growth: a report of some preliminary findings. J Econ Hist 22(2):163–197
Article Google Scholar
Freeman Smith R (1963) The formation and development of the International Bankers Committee on Mexico. J Econ Hist 23(4):574–586
Article Google Scholar
García D (2013) Sentiment during recessions. J Finance 68(3):1267–1300
Article Google Scholar
Gentzkow M, Kelly BT, Taddy M (2017) Text as data. NBER working paper no. 23276, Cambridge, MA, Mar 2017
Goodrich C (1960) Economic history: one field or two? J Econ Hist 20(4):531–538
Article Google Scholar
Graham S, Milligan I, Weingart SB (2016) Exploring big historical data: the Historian’s macroscope. Imperial College Press, London
Google Scholar
Grajzl P, Murrell P (2017) A structural topic model of the features and the cultural origins of Bacon’s ideas. CESifo working paper no. 6643, Oct 2017
Griffiths TL, Steyvers M (2004) Finding scientific topics. PNAS 101(1):5228–5235
Article Google Scholar
Grimmer J (2010) A Bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit Anal 18(1):1–35. https://doi.org/10.1093/pan/mpp034
Article Google Scholar
Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21(3):267–297. https://doi.org/10.1093/pan/mps028
Article Google Scholar
Grün B, Hornik K (2011) Topicmodels: an R package for fitting topic models. J Stat Softw 40(13):1–30
Article Google Scholar
Hall D, Jurafsky D, Manning CD (2008) Studying the history of ideas using topic models. In: Proceedings of the conference on empirical methods in natural language processing, pp 363–371
Hansen S, McMahon M (2016) Shocking language: understanding the macroeconomic effects of central bank communication. J Int Econ 99(1):S114–S133
Article Google Scholar
Hansen S, McMahon M, Prat A (2018) Transparency and deliberation within the FOMC: a computational linguistics approach. Q J Econ 133:801–870
Article Google Scholar
Haupert M (2016) History of cliometrics. In: Diebolt C, Haupert M (eds) Handbook of cliometrics. Springer, Berlin
Google Scholar
Heiberger RH, Koss C (2018) Computerlinguistische Textanalyse und Debatten im Parlament: Themen und Trends im Deutschen Bundestag seit 1990. In: Brichzin J, Krichewsky D, Ringel L, Schank J (eds) Soziologie der Parlamente: Neue Wege der politischen Institutionenforschung. Springer VS, Wiesbaden
Google Scholar
Hockey S (2004) The history of humanities computing. In: Schreibman S, Siemens R, Unsworth J (eds) A companion to digital humanities. Blackwell, Malden
Google Scholar
Jacobi C, van Atteveldt W, Welbers K (2015) Quantitative analysis of large amounts of journalistic texts using topic modelling. Digit J 4(1):89–106
Google Scholar
Jockers ML (2013) Macroanalysis: digital methods and literary history. University of Illinois Press, Urbana
Book Google Scholar
Jockers ML (2014) Text analysis with R for students of literature. Quantitative methods in the humanities and social sciences. Springer, Cham
Google Scholar
JSTOR Text analyzer. http://www.jstor.org/analyze/. Accessed 29 Mar 2018
Kuznets S (1952) National income estimates for the United States prior to 1870. J Econ Hist 121(2):115–130
Article Google Scholar
Lamoreaux N (2015) The future of economic history must be interdisciplinary. J Econ Hist 75(4):1251–1257
Article Google Scholar
Larsen VH, Thorsrud LA (2015) The value of news. CAMP working paper no. 6/2015, Oslo, Oct 2015
Larsen VH, Thorsrud LA (2017) Asset returns, news topics, and media effects. CAMP working paper no. 5/2017, Oslo, Sept 2017
Lau JH, Grieser K, Newman DJ, Baldwin T (2011) Automatic labelling of topic models. In: ACL ‘11 Proceedings of the 49th annual meeting of the association for computational linguistics, pp 1536–1545
Lüdering J, Tillmann P (2016) Monetary policy on Twitter and its effect on asset prices: evidence from computational text analysis. Joint discussion paper series in economics no. 12-2016, Marburg, Mar 2016
Lüdering J, Winker P (2016) Forward or backward looking? The economic discourse and the observed reality. J Econ Stat 236(4):483–515
Google Scholar
Margo RA (2018) The integration of economic history into economics. Cliometrica. https://doi.org/10.1007/s11698-018-0170-8
Article Google Scholar
McCallum A (2002) MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu/index.php. Accessed 19 Mar 2018
McCloskey D (1976) Does the past have useful economics. J Econ Lit 14(2):434–461
Google Scholar
McCloskey D (1978) The achievements of the cliometrics school. J Econ Hist 38(1):13–28
Article Google Scholar
McCloskey D (1987) Econometric history. Studies in economic and social history. Palgrave, Basingstoke
Google Scholar
McFarland DA, Ramage D, Chuang J, Heer J, Manning CD, Jurafsky D (2013) Differentiating language usage through topic models. Poetics 41(6):607–625
Article Google Scholar
Meeks E, Weingart SB (2012) The digital humanities contribution to topic modeling. J Digit Human 2(1):2–6
Google Scholar
Miller IM (2013) Rebellion, crime and violence in Qing China, 1722–1911: a topic modeling approach. Poetics 41(6):626–649
Article Google Scholar
Mimno D (2012a) Computational historiography: data mining in a century of classics journals. ACM J Comput Cult Herit 5(1):1–19
Article Google Scholar
Mimno D (2012b) Lecture held at the Maryland Institute for technology in the humanities (topic modeling workshop). https://vimeo.com/53080123. Accessed 19 Mar 2018
Mimno D, Wallach HM, Naradowsky J, Smith DA, McCallum A (2009) Polylingual topic models. EMNLP 2009:880–889
Article Google Scholar
Miner G (2012) Practical text mining and statistical analysis for non-structured text data applications. Elsevier/Academic Press, Amsterdam
Google Scholar
Mitchener KJ (2015) The 4D future of economic history: digitally-driven data design. J Econ Hist 75(4):1234–1239
Article Google Scholar
Mohr JW, Bogdanov P (2013) Introduction—topic models: what they are and why they matter. Poetics 41(6):545–569
Article Google Scholar
Moretti F (2013) Distant reading. Verso, London, New York
Google Scholar
Nelson RK mining the dispatch: digital Scholarship Lab, University of Richmond. http://dsl.richmond.edu/dispatch/pages/home. Accessed 19 Mar 2018
Newman DJ, Block S (2006) Probabilistic topic decomposition of an eighteen-century American newspaper. J Am Soc Inform Sci Technol 57(6):753–767
Article Google Scholar
Nguyen TH, Shirai K (2015) Topic modeling based sentiment analysis on social media for stock market prediction. In: Proceedings of the 53rd annual meeting of the association for computational linguistics, pp 1354–1364
Nikita M (2016) ldatuning (R package). https://cran.r-project.org/web/packages/ldatuning/ldatuning.pdf. Accessed 19 Mar 2018
Oswald ME, Grosjean S (2004) Confirmation bias. In: Pohl R (ed) Cognitive illusions: a handbook on fallacies and biases in thinking, judgement and memory, 1st edn. Psychology Press, Hove
Google Scholar
Quinn KM, Monroe BL, Colaresi M, Crespin MH, Radev DR (2010) How to analyze political attention with minimal assumptions and costs. Am J Polit Sci 54(1):209–228
Article Google Scholar
Riddell AB (2014) How to read 22,198 Journal Articles: studying the history of German studies with topic models. In: Erlin M, Tatlock L (eds) Distant readings: topologies of German culture in the long nineteenth century. Boydell & Brewer, Suffolk
Schofield A, Magnusson M, Mimno D (2017) Pulling out the stops: rethinking stopword removal for topic models. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 432–436
Shirota Y, Hashimoto T, Sakura T (2015) Topic extraction analysis for monetary policy minutes of Japan in 2014: effects of the consumption tax hike in April. In: Perner P (ed) Advances in data mining: applications and theoretical aspects. Springer, Cham
Google Scholar
Steyvers M, Griffiths T (2007) Probabilistic topic models. In: Landauer TK, McNamara DS, Dennis S, Kintsch W (eds) Handbook of latent semantic analysis. Taylor and Francis, Hoboken
Google Scholar
Tetlock PC (2007) Giving content to investor sentiment: the role of media in the stock market. J Finance 62(3):1139–1168
Article Google Scholar
Thorsrud LA (2016a) Nowcasting using news topics. Big data versus big bank. Norges Bank working paper 20/2016, Oslo, Dec 2016
Thorsrud LA (2016b) Words are the new numbers: a newsy coincident index of business cycles. Norges Bank working paper 21/2016, Oslo, Dec 2016
Underwood T (2018) The stone and the shell (blog). https://tedunderwood.com/. Accessed 19 Mar 2018
Walker DD, Lund WB (2010) Evaluating models of latent document semantics in the presence of OCR errors. In: Proceedings of the 2010 conference on empirical methods in natural language processing, pp 240–250
Wallach HM (2006) Topic modeling: beyond bag of words. In: Proceedings of the 23rd international conference on machine learning, pp 977–987
Wallach HM, Mimno D, McCallum A (2009) Rethinking LDA: why priors matter. Adv Neural Inf Process Syst 22:1973–1981
Google Scholar
Walters PG, Walters R (1944) The American career of David Parish. J Econ Hist 2(2):149–166
Article Google Scholar
Weingart SB (2018) The scottbot irregular (blog). http://www.scottbot.net/HIAL/index.html@p=19113.html. Accessed 19 Mar 2018
Whaples R (1991) A quantitative history of the journal of economic history and the cliometric revolution. J Econ Hist 51(2):289–301
Article Google Scholar
Whaples R (2002) The supply and demand of economic history: recent trends in the journal of economic history. J Econ Hist 62(2):524–532
Article Google Scholar
Yang T-I, Torget AJ, Mihalcea R (2011) Topic modeling on historical newspapers. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, pp 96–104
Zhao B, Xing EP (2007) HM-BiTAM: bilingual topic exploration, word alignment, and translation. In: NIPS’07 Proceedings of the 20th international conference on neural information processing systems, pp 1689–1696

Download references

Acknowledgements

I am grateful to Claude Diebolt and Michael Haupert for generously sharing their data, Robert Whaples and Ann Carlos for insights concerning the JEH, and two anonymous referees for invaluable comments and suggestions on the manuscript. I thank Manuel Burghardt for patiently answering my technical questions on topic modeling, and Mark Spoerer, Tobias Jopp, and Katrin Kandlbinder for their continued support. Finally, I am very grateful to the participants in the research seminar in economic history as well as the lecture series on Digital Humanities at Universität Regensburg.

Author information

Authors and Affiliations

Department of History, Economic and Social History, Department of Economics, University of Regensburg, Universitätsstraße 31, 93040, Regensburg, Germany
Lino Wehrheim

Authors

Lino Wehrheim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lino Wehrheim.

Appendices

Appendix 1

See Fig. 9.

Appendix 2

See Fig. 10.

Appendix 3: Excluding task issues

A topic model with 25 topics is applied on all articles published between 1941 and 2016 excluding Task issues as identified by Diebolt and Haupert (2018) which reduces the corpus from 2675 to 1885 documents. Again, the topic model identifies two topics which can be interpreted as representing quantitative methods. The 15 most probable words of the quantitative topics are shown in Table 4.

Table 4 Quantitative topics without Task issues

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wehrheim, L. Economic history goes digital: topic modeling the Journal of Economic History. Cliometrica 13, 83–125 (2019). https://doi.org/10.1007/s11698-018-0171-7

Download citation

Received: 02 February 2018
Accepted: 24 April 2018
Published: 11 May 2018
Issue Date: 08 January 2019
DOI: https://doi.org/10.1007/s11698-018-0171-7

Keywords

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Economic history goes digital: topic modeling the Journal of Economic History

Abstract

Access this article

Similar content being viewed by others

The potential of working hypotheses for deductive exploratory research

Thematic Analysis: Making Values Emerge from Texts

American Dreaming: Really Reading The Great Gatsby

Notes

References

Acknowledgements