Keywords

1 Introduction

The power of history rests on its capability to interpret and contextualize past phenomena to explain continuities, differences, and specificities. To a large extent, this process depends on the work and abilities of the historian. With the introduction of computational analysis methods, historians can now use auxiliary means to enhance this process. Topic modeling is a highly useful computational analysis method used to understand and identify the content and patterns of text corpora. This method also allows for additional close reading. Based on complex calculations of word co-occurrences, the method summarizes the text by identifying the topics (i.e. the constellations of words that tend to come up in a discussion) (Mohr and Bogdanov 2013, 547) that can be used to analyze, categorize, and compare text corpora. It can be used to studying the content of a large number of texts as well as a “microscope” that extracts patterns that are otherwise difficult to detect in a small number of texts.

One can view topic modeling as a machine that condenses the studied text into topics that consist of a collection of words connected in a statistically coherent way. For example, if researchers wanted to extract ten topics from past issues of Pravda newspaper over the course of a century, they would most likely get one topic consisting of the terms “the Party,” “meeting,” and “resolution”; another topic containing terms such as “team,” “gymnastics,” and “skiing”; and a third topic with terms such as “plan,” “development,” and “production.” If the researcher then extracted more detailed topics from the texts, they would come across terms related to topics such as “Stakhanovite,” de-Stalinization,” and “Perestroika.”

The “machine” of topic modeling is based on a statistical calculation that reveals patterns of co-occurring words in a text. Using probability distribution, topic modeling categorizes individual words in the studied text and analyzes which words are statistically used most often in connection with each other, and these words then form a topic (Brett 2012; Mohr and Bogdanov 2013, 546; Nelson n.d.). In this way, topic modeling allows for the consistent detection and examination of patterns in a large text without the need for sampling.

“Topic modeling” is the term commonly used in text mining to signify a large group of computational algorithms that aim to detect patterns in an unstructured collection of documents. Topic modeling algorithms often use unsupervised machine learning, which allows researchers to identify patterns in the text without prescribing in advance what should be looked for. However, this does not apply to all forms of topic modeling (Isoaho et al. 2019; Boumans and Trilling 2016; Goldstone and Underwood 2014). There are several statistical methods used to calculate the number of topics in a text, with the most frequently used method being the latent Dirichlet allocation (LDA). In addition, there are other types, such as structural topic modeling, dynamic topic modeling, and sub-corpus topic modeling, that produce more nuanced results (Hakkarainen and Iftikhar in press; Isoaho et al. 2019; Roberts et al. n.d.; Blei and Lafferty 2006; Tangherlini and Leonard 2013). This chapter highlights the possibilities of MALLET, a basic LDA topic modeling tool. MALLET is a Java-based, open source, and free text analysis program developed by Andrew McCallum (2002).

Topic modeling provides exciting ways in which to analyze the content of larger corpora that encapsulate the content and “see” the text computationally. Therefore, it is not surprising that it has become one of the most popular methods of text analysis in the humanities and social sciences. It has been successfully used to reveal the themes studied texts consist of, the temporal variations of said themes, and the differences between texts (see Gritsenko 2016; Goldstone and Underwood 2014; Tangherlini and Leonard 2013). Numerous scholars in the humanities and social sciences now consider topic modeling a ubiquitous “digital humanities method that solves all the problems” without problematizing or examining which research purposes it can be used for.

The aim of this chapter is to show how topic modeling can be applied to research in Russian and East European studies, with an emphasis on historical research and the choices a researcher will face when using topic modeling. First, the chapter charts the steps that need to be taken when preparing a data set for topic modeling and describes how different choices can affect the results of the analysis. Second, the chapter discusses how the results of topic modeling can be interpreted. Third, the chapter explores the uses of topic modeling in Russian history sources, as well as the associated challenges and opportunities in this context.

2 Preparing a Text for Topic Modeling

The results of topic modeling respond to the following questions: What kinds of topics are present in the text? How prevalent are they? Where do these topics appear? The algorithm produces two types of outputs—namely, word–topic and topic–document proportions of the text. It thus creates groups of words that form a topic and identifies how frequently each topic appears in the text. Unlike a human reader, however, the topic modeling program does not understand the text: It only calculates the statistical co-occurrence of words and produces results based on this calculation that offer a statistical perspective of the text. Topic modeling often provides predictable results that are in accordance with the impressions of human readers, but it can also produce nonsensical results or reveal unexpected aspects of the text. It is thus up to the human to decide, based on their knowledge of the data and method, which results should be relied on and which should be discarded.

When beginning topic modeling, and similar to most natural language processing (NLP) analyses, the researcher needs to arrange the data to correspond with the research question, name the documents systematically, preprocess the text, prepare an adequate stop-word list, and determine the specificity of the results by selecting the number of topics. This chapter explains the steps that need to be taken when using Mallet, but it is important to note that different topic modeling algorithms require different approaches regarding the arrangement or naming of data. The choices made at this stage affect the results of the analysis and comprise a crucial element of the process. This stage is also the most time-consuming.

2.1 Arranging the Data

Arranging the data to correspond with the research question is the first step in preparing the text for topic modeling. The arrangement of the data, whether in one large document or in several smaller documents, and according to certain categories, determines what the topic modeling analysis will reveal. Combining all the studied texts into one large document provides a general view of the data set, whereas separating the text according to preset categories allows for the detection of their differences and similarities. For example, if a researcher is simply interested in what kinds of topics exist in studied texts, the texts can be merged into a large text document. If, in contrast, the researcher wishes to study how the topics of a newspaper have evolved over the years, it is useful to arrange the texts so that all the issues of one year or one month are in one document, another year or month in the second document, and so on. This arrangement would then provide data on topic changes on an annual or monthly basis. In a project that studied the reception in Soviet media of French singer and actor Yves Montand when on a tour of the Soviet Union in December 1956, we arranged the data so that each individual article was downloaded as a separate document and saved under a name that indicated the publication date and newspaper it was published in (Johnson et al. 2019). This then allowed us to detect how the depiction of Montand varied between publications and over time.

2.2 Systematic Naming

The second step, naming the documents in a systematic, expressive, and concise manner, assists the later stages of interpretation of the output. For example, using a document name such as “1953_F_Pravda.txt” for an article written by a female journalist and “1953_M_Pravda.txt” for an article written by a male journalist condenses the essential information of the document in a comprehensive way and does not confuse the computer program. Document names that are too long or that have spaces between words often cause problems when running the program.

2.3 Preprocessing the Text

Once the data have been arranged and named accordingly, the third step is to preprocess the text. Preprocessing is not mandatory, but it does make the final results clearer due to the simplistic assumptions inherent to the topic modeling algorithm. This step simplifies and standardizes the text for the computational analysis so that certain elements of the text can be revealed. Preprocessing involves various stages, including lemmatization, stemming, the removal of punctuation, and the conversion of uppercase letters into lowercase letters. Lemmatization refers to the process that converts the words in the text into their basic form (e.g. “studying” becomes “to study”). Stemming changes words into their root forms (e.g. “studying” becomes “stud”) (Arnold and Tilton 2015). In the context of Russian-language texts, these processes are highly useful, as Russian is a highly inflected language, and the same words can appear in different forms in a text. Although a human reader recognizes different forms of the same word, the computer program sees them as different words. Because the program does not recognize that the words “studying” and “studied” are different forms of the verb “to study,” the results of the analysis do not attribute the correct weight to the words, thus meaning that the results are distorted. Lemmatization produces more nuanced results than stemming, as the text is simplified to a greater extent (Sharoff et al. 2012; Jabeen 2018). In topic modeling, which explores the statistical relations between words, identifying the correct dictionary-based basic form of the word using lemmatization is often useful. However, simplifying the text does improve the final results if the research looks to explore how different word cases or tenses appear in the text. In these cases, the scholar should not stem or lemmatize the text, as it will result in the decreased quality of the final results.

The Mallet program removes punctuation and converts all the letters into lowercase but does not lemmatize or stem the text. Thus, researchers who wish to lemmatize their Russian-language texts can use programs such as MyStem, TreeTagger, Language Analysis Service LAS, Python, or R programming language packages for natural language processing (see for example MyStem n.d.; TreeTagger n.d. or Language Analysis Service LAS n.d.).

2.4 Composing the Stop-Word List

The fourth stage of preparing for topic modeling comprises the composition of a stop-word list. In this stage, researchers often remove the most frequent nonmeaning-making words from the text, including “the,” “and,” “or,” and “but.” These are referred to as stop words. While they appear in the texts frequently, they are irrelevant when analyzing the content of texts using statistical means. The Mallet program does not contain a ready stop-word list in Russian, meaning that users need to download their own stop-word lists. Luckily, there are stop-word-lists available online that can be easily applied to Mallet. When downloading a stop-word list in Russian, it should be saved in an eight-bit Unicode Transformation Format (UTF-8) to ensure that the Cyrillic appears correctly.

Although ready-made stop-word lists are available, for serious analysis, it is important to customize the stop-word list for the purposes of the study. Ready-made stop-word lists can contain words that are important in the text, but the specifics of the study might also require the removal of words that appear too frequently in the text. For example, a digitized collection of newspaper articles can contain repeated names of the days of the week (indicating the day of publication of the issue) or the authors of the articles. The researcher might want to add the names of the week and the journalists’ names to the stop-word list to avoid these words being overemphasized and distorting the analysis of the articles. However, having these words in the stop-word list removes them completely from the texts, and this affects the topic modeling results.

2.5 Selecting the Number of Topics

The fifth stage in the topic modeling process comprises the selection of the number of topics, the k-value. The difficulty in determining the “correct” k-value is considered one the greatest weaknesses of the method. There is no one way to determine the correct number of topics, and although there are computational means to determine the optimal number (Isoaho et al. 2019; Oiva et al. 2019), the researcher ultimately chooses which k-value to use. The researcher determines the k-value depending on how detailed an outcome the study requires. The optimal number of topics depends on the size of the data, the nature of the research question, and the content of the text.

For example, in studies exploring large long-term datasets, scholars have worked with one hundred topics (see Underwood 2012), while in the Yves Montand project, we found ten topics to be meaningful due to the small volume of the data set (Johnson et al. 2019). The data analyzed in the Montand project was preselected and contained only texts that discussed Montand’s tour during a short period of time. This meant that the expected variation of the topics was small. If the content of the data is not a preselected sample but covers a wide variety of different texts, the number of expected topics will obviously be higher. Similarly, if the researcher’s aim is to understand the general variation of the topics in the text, a smaller k-value may be useful; if the aim of the research is to extract nuances from the text, the k-value should be higher.

When determining the number of topics, the usability of the results guides the selection of reasonable k-value, as a large number of topics does not summarize the text in a way that is understandable for humans. For example, many scholars find three hundred topics too large a number to be analyzed. Tangherlini and Leonard argue that the risk of using too many topics is lower than the risk of using too few (2013, 732). Often, if the k-value is set to be larger than the “actual” number of topics in the text, the researcher can easily understand which topics belong to the same family of topics. For example, in a project that studied the contexts in which Finland was discussed in the Russian news and Russia discussed in the Finnish news, the vast coverage of sports news was divided into different types of sports news segments that were easy for researchers to identify (Gritsenko et al. 2018). As Tangherlini and Leonard state, perhaps the best—and most informal—advice is that given by Doyle and Elkan, according to whom a useful way in which to determine the number of topics is to look at whether the proposed topics are plausible (Tangherlini and Leonard 2013, 731).

While there is no singular and clear-cut way to determine the correct k-value, running the topic modeling algorithm with different numbers of topics is not difficult. This allows the researcher to determine the best k-value through exploration. Thus, a good way to explore the optimal number of topics is to run topic modeling with different k-values and decide, based on the results, which one to focus on. When analyzing the results, it is useful to consider the results of other k-values and discuss the reasons for selecting the studied number of topics for the study.

For example, when analyzing the articles for our Montand project, we eventually extracted ten topics after testing different k-values. We ran the algorithm with five, ten, and twenty topics. The analysis with five topics produced overly general topics that did not appear to provide any additional insights. Twenty topics provided more detailed results, but these were so detailed that the topics did not accurately summarize the texts. Ten topics provided sufficiently detailed results that also successfully summarized the studied articles. It is important to note that the topics we received in the smaller number of topics seemed to contain the topics that we received with the larger number of topics. This finding confirmed that the results were not random—rather, they followed certain logic—and that we just needed to select the resolution we wanted to operate with.

Although the launching of a topic modeling project has been depicted as a straightforward process, in reality the process often develops by repeated testing and alterations, as shown in the previous paragraphs. This comprises a normal way of developing a research project, and to ensure that the process does not become confusing, it is advisable to keep track of the steps taken as well as the reasons behind them. Checking the steps made throughout the process at the end will help to explain and justify the choices made in the research.

After arranging and naming the data, preprocessing the text, customizing the stop words, and selecting the number of topics to be studied, the researcher can then run the data through the topic modeling algorithm. The Programming Historian journal offers lessons on how to conduct topic modeling with the Mallet program (Graham et al. 2012).

3 Interpreting the Results of Topic Modeling

After the topic modeling algorithm has been run, the program produces lists of words that together form a topic. The percentage coverage of each topic in the analyzed documents is also produced. Although the researcher has been working with the text for a while at this stage, the results of the topic modeling only launch the actual analytical process.

Since the results of topic modeling can easily be misleading, it is important to validate the choices made and assess the output. At this stage, the researcher should evaluate how the preprocessing choices and modeling parameters affect the results, how well the topics model the phenomenon under investigation, and how interpretable and plausible the outcomes are (Isoaho et al. 2019). The overall assessment of the quality and robustness of the topic model results is crucial, as it forms the basis for the whole analysis. Several scholars have suggested metrics and solutions for computational quality assessment, both concerning the overall and topic-level quality (see Chap. 23; Chuang et al. 2015; Mimno et al. 2011).

After assessing the results, the analysis of the results can begin by naming the topics, as the algorithm only produces groups of words and does not evidence what kind of meaning the words that appear together make. However, it is not always necessary to name the topics. For example, if the researcher is interested in studying the appearance of certain keywords or in creating a relevant sample out of a vast corpus for close reading, it is reasonable to keep the “names” as Topic 1, Topic 2, et cetera. When naming the topics, at first glance, the lists of words may seem nonsensical. However, after some close reading, the common themes become clear. Word lists can be large and probabilistic, and the same word can belong to several topics to a varying extent. Luckily, the sequence of words is meaningful, as it shows the proportion of the words in the topic. The words that are more central to the topic come first in the list, thus helping the researcher to identify what is crucial to the topic. Researchers sometimes name the topics according to the first word of the word list, especially when they are operating with large numbers of topics. When operating with a smaller number of topics, it is useful to ascribe them more precise titles.

To concretize the naming of topics, below are two example topics from the Montand project. Using TreeTagger, the words have been lemmatized into their basic form so that different declinations of one word appear as the same word.

  • Topic 1: montan iv moskva sin’ore simona sssr franciâ pariž pevec vestibûl’ dekabr’ svâzi večer press gazeta zal dežurstvo šurupova

  • Topic 2: montan iv pet’ lûdej iskusstvo slov pariž pesnâ lûbov’ pevec lûbit’ čajkovskogo serdce vystuplenie koncert

Topic 1 contains Montand’s name and his spouse, Simone Signoret, and words including “the USSR,” “France,” “singer,” “lobby,” “December,” “evening,” “press,” “newspaper,” “hall,” “shift,” and the surname “Šurupova.” This combination of words comprises a topic that discusses Yves Montand’s arrival in the Union of Soviet Socialist Republics (USSR) from France in December, Soviet newspapers writing about him, and Soviet people eagerly waiting to buy tickets to his concerts. The words “lobby,” “hall,” “shift,” and “Šurupova” refer to a specific article that discussed the queues of Soviet people waiting to buy tickets to Montand’s concerts. Topic 1 could be titled the “Reception of Montand.”

Topic 2 contains, in addition to Montand’s name, words including “to sing,” “people,” “art,” “word,” “Paris,” “singer,” “love,” “heart,” “performance,” “Tchaikovsky” (a famous concert hall in Moscow), and “a concert.” This topic describes the songs Montand sang in his concerts, their lyrics, and the positive emotions the articles reported them evoking in the Soviet audiences. This topic could be titled “Montand’s emotional songs.” As these examples demonstrate, the naming of the topics depends, in addition to the words emerging in the group of words that represent the topic, on the researcher’s interpretation and knowledge of the context.

In addition to the word lists, the output shows the percentage coverage of the topics in the studied texts. In terms of the interpretation of topic modeling output, this part is important. A good method of grasping the variation of the topics is to visualize them to allow for an understanding of the proportions of the topics that the program suggests.

The ability of topic modeling to provide new insights into the text becomes especially visible when analyzing the scope and content of the topics. In the Montand project, this was exemplified in an interesting way: We had already read the articles before conducting the topic modeling, meaning that we had a suitable understanding of the topics that we expected to emerge. However, despite our established knowledge, based on a close reading of the articles, the results of the topic modeling provided a new kind of angle. The results highlighted that the topic we had labeled “French-Soviet art connection” consistently prevailed in the articles (see Fig. 24.1). While this result made sense, it is likely that with simple human reading we would not have identified the topic as being so prevalent. It was clear that all the articles discussed the French–Soviet artistic exchange, exemplified by Yves Montand’s tour of the Soviet Union. However, the more abstract level of understanding, whereby the transnational interaction addressed the cultural, diplomatic, and political spheres more generally, only became evident when the program had displayed the results.

Fig. 24.1
A stacked bar chart represents the percent versus publication of topics of Soviet Union newspapers discussing Yves Montand’s visit to the U S S R. French Soviet art connection is consistently prevalent topic in all publications.

The ten topics covered in newspaper articles that discussed Yves Montand’s 1956 tour of the Soviet Union

Topic modeling is considered a particularly useful method for analyzing large data sets. The shortcut it provides in understanding large amounts of texts is so efficient that it is impossible for a human reader to analyze it within a reasonable amount of time. However, the Montand project demonstrates that it is also possible to use topic modeling as a type of “microscope” that provides a statistical overview of the studied text. When topic modeling a smaller data set, however, one needs to remember that the statistical reliability of the results decreases with a smaller amount of data. Nevertheless, the algorithmic reading of texts provided by this quantitative approach complements qualitative research by adding another analysis layer to human reading.

In another example, it is demonstrated how topic modeling can allow for the detection of patterns that are not self-evident to a human reader. This example comprises a project in which I analyzed the annual reports of the Polish Chamber of Foreign Trade between 1950 and 1980. Again, in this project, I had read through the documents before beginning the topic modeling analysis. Based on my reading, I assumed that one topic would dominate all the studied texts throughout the years. However, the result of the topic modeling showed that said topic was dominant in just one document (it was also present in the other documents to a lesser extent). It appears that when reading through the documents, I had read the text in which the topic was dominant at an early stage in my close reading, and when I continued to read, I paid special attention to it. In this way, the topic became important in my mind. This exemplifies how a human reader reads in different ways: While they pay attention to one element, they may omit other issues that seem irrelevant but are not in actuality. A human reads texts by immediately interpreting the meaning and paying attention to issues of interest. Thus, although one cannot say that computer reading leads to truer results (because the results of computer-assisted analyses can sometimes be nonsensical), this example shows how topic modeling provides other useful and statistically based interpretations of the text.

Topic modeling provides new insights into the studied text because its results are based on the systematic categorization of text without understanding its meaning. In contrast, a human reader interprets the text immediately and pays attention to the significant similarities or differences regarding their understanding of the topic. However, the results of a computational analysis are also dependent on human interpretations in terms of preprocessing, arranging, customizing the stop words, and selecting the number of topics. As well as this, the algorithms behind the program are all based on human construction and selection, and this affects the outcomes of the analysis. The strength of computer reading lies in its inability to understand and its extraordinary capacity to calculate the text in a systematic way. The combination of the algorithms’ analysis and human’s interpretative skills leads to new findings. Thus, the use of the interpretative power of a human reader and the systematic reading of a computer renders topic modeling a powerful tool.

Alongside small data sets, topic modeling is highly useful for determining general patterns in larger text corpora. The results of topic modeling of large text corpora can help identify interesting sub-corpora, guide further analysis, and even give rise to new research questions (Nelson n.d.). For example, conducting a topic model of all the issues of Życie Gospodarcze, a Polish economics newspaper, between 1950 and 1980, led to the creation of new research questions, as it revealed radical temporal alterations of the most frequent topics (see Fig. 24.2).

Fig. 24.2
A stacked bar graph plots the top eight topics in the Polish Zycie Gospodarcze newspaper from 1950 to 1980. The topics of production planning and the need to increase production prevailed in the newspaper during the entire period. The topics of Production and Trade, Increase of production, prices and production plan dominates from 1964.

The top eight topics in the Polish Życie Gospodarcze newspaper, 1950–1980

Upon closer inspection, the results showed that the topics of production planning and the need to increase production prevailed in the newspaper during the entire period. But from 1953 to 1963, the tone of the discussion was different from the tone used in 1964 onward. The change in tone was so prevailing that the topic modeling program identified the production planning discussion before and after 1964 as separate topics.

Sometimes the program classifies a topic—that appears to a human reader as one topic—as two separate topics. This occurs if the researcher sets the number of topics relatively high. These results are often important indicators of radical shifts in the way in which issues are discussed. Conducting topic modeling with more topics can thus lead to the identification of more sensitive topic alterations. This topic modeling result is significant and calls for further exploration. The result also evidences the power of topic modeling in leading to new findings, as these results are extremely difficult to arrive at without the help of statistical computing. A human reader, reading through thirty years of newspaper articles, would most likely have sensed the change of tone but would have been unable to show it in a systematic way.

According to Guldi (2018), 90 per cent of topic modeling results reveals information that we already know. In a sense, this makes researchers confident that the method works and that the countless hours of work done by the preceding generations of scholars have not been done in vain. The remaining 10 per cent reveals insights that have not been identified by preceding research. In order to identify which results belong to the 90 per cent category and which to the 10 per cent category, one needs to understand the context and preceding research. The shift of topics in Życie Gospodarcze is unsurprising for a scholar researching Polish economic history, but the comprehensive nature of the change in tone is something that nobody has been able to demonstrate in this way before.

When analyzing the results, one should remember that topic modeling is a probabilistic method, therefore meaning that it provides probabilities rather than fixed end-results. The program calculates the probability of the topics several times when processing the text, and the results it provides comprise the average of these calculations. This is visible in practice, as the results are not always absolutely fixed, and topic modeling the same data set several times provides results that are not exactly the same. Thus, the results of topic modeling can give an approximate sense of the topics but not the exact truth. It is useful to view topic models as lenses that allow researchers to view a textual corpus in a different light and scale, where well-informed hermeneutic work is also needed in order to interpret the meaning of the results (Mohr and Bogdanov 2013, 560). For a historian, this variation is usually not problematic because we are accustomed to using interpretative data. When reading an eye-witness description of an event, for example, we don’t expect the document to reveal on a word-by-word basis what was said. Rather, we expect an interpretation of the tone of the discussions in the event. Similarly, topic modeling provides the approximate form of the studied text.

In addition to paying attention to the most prevailing topics, it is also worth considering the topics that do not appear in the results to get an overall understanding of the phenomenon. If a topic is prevailing, it means that it has been discussed extensively, but if a topic does not appear in the results, it does not mean that the topic is not important. Often issues that are taken for granted are not discussed, but the issues that arouse controversies are discussed. When interpreting the results of topic modeling, one should reflect on the limitations embedded within the chosen data set. For example, in the annual reports of the Polish Chamber of Foreign Trade, the issue of press advertising was discussed extensively in the mid-1950s when the chamber promoted the increased use of advertising in the Polish foreign trade. The use of print advertisements increased, but as it then became a normal aspect of foreign trade activities, it did not need to be discussed anymore. In addition, as the reports were written for the Polish Ministry for Foreign Trade, among others, the topics tackled in the reports were issues that the chamber wanted the ministry to be aware of, while the topics it did not want the ministry to be aware of were most likely not discussed.

Thus, one cannot emphasize enough the importance of understanding the context in topic modeling outcomes. To avoid being blindly guided by the topic modeling outcomes, one needs to understand the nature of the source, how the topic modeling processes influence the results, and the relationship between the emerging topics and the issues studied. As Mohr and Bogdanov incisively state, one might think that running any text through a topic modeling program like MALLET would produce brilliant research. However, it is still the quality of knowledge about the case and the clarity of thinking about the phenomena that determine the utility and richness of the analysis (2013, 559). When using topic modeling in Russian and East European studies, one needs to understand the context of the research.

4 Topic Modeling: Russian and East European Studies

Although the method of topic modeling itself is universal, its application to studies of Russian and East European history comes with certain requirements and challenges. The languages of the region and their “special” characters (from the perspective of English) are obvious specificities that need to be considered. The program used in the examples above recognizes numerous scripts, including Cyrillic and the Polish alphabet, once the text is in UTF-8.

The greatest vulnerability that prevents the use of topic modeling in studies of Russian history is the dispersed nature of digitized sources and the lack of systematic computer-readable text collections that contain adequate metadata. Russian statearchives, museums, libraries, scholarly projects, private companies, and initiatives of nongovernmental organizations, together with private individuals, are digitizing historical sources in increasing amounts. However, the problem lies in the fact that text collections seldom form systematic series that cover long periods that would be needed for systematic computational analysis. Furthermore, too often, “digitization” means uploading a scanned image of a text to the internet, with no computer-readable text or possibility to download the text to one’s own computer for further research. Fortunately, Russian digital history scholars have taken the initiative to collect the available sources into link collections that facilitate finding the available sources (see Perm University Digital Humanities Center’s project n.d.; Perm Province Newspapers 1914–1922; Historical Materials and Oral History n.d.) (for more, see Chap. 20; Kizhner et al. 2019).

Currently, the digitized text collections that can be used for the research of Russian history form a kaleidoscopic landscape. For example, the Russian National Library has not produced systematic digital collections with computer-readable text or metadata, comparable to the digital newspaper collections in many other countries. As a general rule, the memory organizations’ digitization efforts produce openly accessible samples of scanned images on nationally interesting topics (see, for example Artistic Legacies of Anna Akhmatova). Digitization initiatives are important openings for the popularization of history, but they do not serve the purposes of the big data approach to digital text analysis due to their focus on random samples and general lack of access to computer-readable texts. Furthermore, private companies have systematically digitized Soviet-era newspapers, and their collections form long-time series. However, access to their sources lies behind a paywall, and it is currently impossible to have complete data sets downloaded onto your own computer, which makes the use of big data approaches impossible. Lastly, private initiators and nongovernmental organizations also produce digital collections (e.g. Prozhiton.d.). These collections comprise an important addition to the digitization efforts made by other parties, but they are often only small data sets and are not easily downloadable.

The lack of voluminous collections of digitized historical texts with adequate metadata means that the Big Data approach to Russian historical studies has to be reconsidered. If one wants to apply this approach to studies of Russian history, it is necessary to shift the focus from volume to the other two Vs of Big Data—namely, variety and velocity (Schöch 2013). Instead of seeking a uniform analysis of one vast collection, it is necessary to develop intelligent ways of exploring vast collections of heterogeneous data and linking the results of smaller data sets together to form a meaningful whole. In this way, digital text analyses of Russian historical studies could contribute to the overall development of Big Data studies.

5 Conclusion

As this chapter has demonstrated, using topic modeling in studies of Russian and East European history is useful and can provide new ways to understand the past. For this, understanding the context, the specifics of the data, how the algorithm works, and the stage that the research literature is currently at is extremely important. Without these basic components, the researcher is unable to explain in an adequate manner the results of topic modeling and its wider meaning. The low number of usable digitized collections restrains opportunities to topic model large text corpora in studies of Russian history. Luckily, as this chapter has demonstrated, topic modeling can also be useful in analyzing smaller data sets. However, if we are willing to understand current complex problems that are studied with the help of big data, we should make an effort to understand the historical roots of these developments. For that, we should develop ways to combine the kaleidoscopic multitude of digitized historical sources into long time series of data that would correlate with the big data produced today.