Keywords

1 Introduction

Due to the rise in processing power, advancements in machine learning (Grimmer et al. 2021), and the availability of large text corpora online, the use of computational methods including automated content analysis (van Atteveldt und Peng 2018) has rapidly increased. Automated content analysis is applied and developed across disciplines such as computer science, linguistics, political science, economics and – increasingly – communication science (Hase et al. 2022). Recent pieces offer theoretical introductions to the method (Benoit 2020; Boumans and Trilling 2016; DiMaggio 2015; Grimmer and Stewart 2013; Günther and Quandt 2016; Manning and Schütze 1999; Quinn et al. 2010; Scharkow 2012; van Atteveldt et al. 2019; Wettstein 2016; Wilkerson and Casas 2017). Similarly, tutorials on how to conduct such analyses are readily available online (Puschmann 2019; Silge and Robinson 2022; Watanabe and Müller 2021; Welbers et al. 2017; Wiedemann and Niekler 2017).

Automated content analysis or “text as data” methods describe an approach in which the analysis of text is, to some extent, automatically conducted by machines. While automated analyses for other types of content, for example images (Webb Williams et al. 2020), have also been proposed more recently, this study will focus on text. In contrast to manual coding, text is not read and understood as one unit, but automatically broken down to its “features”, for example single words such as “she” or “say”. The complexity of texts is then reduced further by converting text to numbers: Texts are often understood based on how often different features, for example unique words, occur. Computers use feature occurrences as manifest indicators to infer latent properties from texts (Benoit 2020), for example negativity or emotions. Importantly, manual coding is still part of most automated analyses: Humans may construct dictionaries to automatically look up features expressing sentiment, code sentiment in texts as training data on which algorithms are trained, or create a gold standard of manually annotated texts against which the results of automated analyses are compared (Song et al. 2020; van Atteveldt et al. 2019).

When using text as data approaches, readers should bear in mind important caveats and limitations. Human decisions lie at the core of “automated” content analyses and thus necessarily introduce certain degrees of freedom to these approaches. For example, researchers have to decide how to prepare text for analyses (Denny and Spirling 2018) or choose a method to infer latent concepts of interest (Nelson et al. 2021), which can heavily impact results. Also, text as data approaches are costly: Not only does it take considerable effort to decide on how to conduct which steps of the analysis and write code to execute them. Studies often rely on large sets of manually annotated texts for the training or validation of algorithms, which require time and money for manual coders. As automated content analyses aim to infer latent concepts, researchers should also note that the method necessarily includes uncertainty and error: It cannot grasp texts in their full complexity, similar to manual coding (Grimmer and Stewart 2013). As Grimmer and Stewart (2013, p. 269, capitalization by authors) put it: “All Quantitative Models of Language Are Wrong – But Some Are Useful”.

Related to this, there is an ongoing debate about which variables can and should be measured automatically instead of relying on human coding (di Maggio 2015). It seems that the more complex the latent construct that should be inferred, the less suitable automated approaches. For example, formal features such as the use of hyperlinks in text (Günther and Scharkow 2014) or an article’s publication date (Buhl et al. 2019) are easily detected automatically. Text as data approaches can also identify events that are being reported on across articles (Trilling and van Hoof 2020) and, as such, news chains (Nicholls and Bright 2019). However, recent studies have cast doubt on the performance of automated analyses for grasping more complex variables at the core of communication studies: When measuring evaluations or sentiment, human coding clearly outperforms machines (van Atteveldt et al. 2021). Similarly, studies on automated measurements of frames (Nichols and Culpepper 2021) or media bias (Spinde et al. 2021) do not warrant optimism that text as data approaches are applicable for any kind of text or even better than human coding. Thus, automated approaches do not replace human abilities to understand text. Rather, they amplify them (Grimmer and Stewart 2013; Nelson et al. 2021), as do computational methods in general (van Atteveldt and Peng 2018).

Emerging trends in the field include approaches that try to better model syntactic relationships in texts, e.g., evaluations concerning a specific actor (Fogel-Dror et al. 2019). Others aim to more accurately grasp the semantic meanings of features through word embeddings (Mikolov et al. 2013; Pennington et al. 2014, but for a discussion of potential biases see Bolukbasi et al. 2016). Studies also propose mixed methods approaches where computational methods and manual coding support each other, often in an iterative process (Lewis et al. 2013; Nelson 2020). Recently, semi-automated methods in which manual input is used as a starting point have emerged (Watanabe 2021). Studies have also introduced new ways of resourceful and cheap data collection such as crowdsourcing (Lind et al. 2017).

2 Common steps of analysis and research designs

Automated content analysis typically consists of the following four steps (Wilkerson and Casas 2017): (1) data collection, (2) data preprocessing, (3) data analysis, and (4) data validation.

(1) Data collection. First, large text corpora need to be obtained through structured databases such as Lexis Uni or other third party providers, Application Programming Interfaces (APIs) for data from social networks or newspapers, or by scraping websites (Possler et al. 2019; van Atteveldt et al. 2019). The collection of large amounts of textual data often involves legal problems due to copyright issues (Fuchsloch et al. 2019).

(2) Data preprocessing. In what is called preprocessing, texts are then prepared for automated analysis. Potential units of analysis might be whole articles/social media messages, but also single paragraphs or sentences. Preprocessing reduces text units to those features that are informative for detecting differences or similarities between different text units and dismisses features that are not. In every study, researchers have to decide which parts of text are informative and hence which of the following steps are important for their analysis. Not only are there no standard preprocessing steps (Benoit 2020) but the choice of preprocessing steps influences results (Denny and Spirling 2018; Scharkow 2012). Common steps include (1) the removal of boilerplate, for example URLS included in texts obtained via scraping. Next, (2) tokenization, where text is broken down to its features, is important. Oftentimes, these features are unigrams, i.e., single words, such as “he” or “and” in what is called a “bag-of-words” approach: The order or context of words is not taken into account. In “bag-of-word” approaches, the occurrence of a feature is what counts, independent of where in a given text the feature occurs or which features occur in close proximity to it (van Atteveldt et al. 2019). However, there are more informative ways of feature extraction than unigrams: Stoll, Ziegele and Quiring (2020), for example, include n-grams. These may describe bigrams, i.e., an order of two words, such as “he walks”, or trigrams, i.e., an order of three words, such as “and then he”. More meaningful n-grams are collocations, i.e., specific words that often co-occur and, in conjunction, have a different meaning. Statistically checking for words that frequently co-occur or using Named Entity Recognition (NER), where names for persons, organizations or organizations are automatically detected, would for example lead to the unigrams “United” and “States” to be included as one feature, namely the collocation “United States”. Some analyses also distinguish several meanings a feature may have through Part-of-Speech (PoS) tagging. For example, “novel” as a noun and “novel” as an adjective describe two very different things (Manning and Schütze 1999). Further preprocessing steps might include discarding punctuation (3) and capitalization (4). In addition, (5) features with little informative values are often deleted. Depending on the research question, these might include numbers, so-called “stop words” (often based on ready-made lists, including for example “and”, “the”), or features occurring in almost every or almost no text in what is called relative pruning. Lastly, many analyses try to reduce complexity through (6) stemming or lemmatizing (the feature “analyzed”, for example, becomes “analyz” with stemming and “analyze” with lemmatizing). In “bag-of-words” approaches, texts are finally (7) represented in a document-feature-matrix where rows identify the unit of analysis (e.g., an article, a paragraph, a sentence) and columns identify how often a feature occurs in this unit (e.g., how often the unigram “terrorist” occurs in the first, the second unit and so forth).

(3) Data analysis. While recent overviews have used various systematizations for different methods in the field of automated content analysis, many distinguish between (1) dictionary and rule-based approaches, (2) supervised machine learning, and (3) unsupervised machine learning. While (1) and (2) include deductive approaches where known categories are assigned to texts, (3) is more inductive as it explores unknown categories (Boumans and Trilling 2016; Grimmer and Stewart 2013; Günther and Quandt 2016).

Deductive Approaches: Assigning known categories to text

(a) Dictionary and rules-based approaches often simply count the occurrence of features. Studies for example analyze whether news coverage of Islam mentions the feature “terrorism” (Hoewe and Bowe 2021). More complex studies use feature lists, also called dictionaries, to look up uncivil expressions (Muddiman et al. 2019) or topics in texts (Guo et al. 2016). Two kinds of dictionaries need to be differentiated: “Off-the-shelf” dictionaries such as the General Inquirer (Stone et al. 1966) or the Linguistic Inquiry and Word Count LIWC (Tausczik and Pennebaker 2010) are ready-made dictionaries developed to be applied across text genres or topics. As Taboada (2016) cautions researchers, many “off-the-shelf” dictionaries were developed based on specific genres and topics, namely user reviews of consumer products. Research shows a lack of agreement between different “off-the-shelf” dictionaries and for their results to differ from manual coding (Boukes et al. 2020; van Atteveldt et al., 2021). For sentiment analysis, Boukes et al. (2020, p. 98) therefore stress that “scholars should be conscious of the weak performance of the off-the-shelf sentiment analysis tools”. In contrast, “organic” dictionaries are inductively developed feature lists used to deductively assign known categories such as sentiment or topics to text units. As they are developed related to the research question and the corpus at hand, they are tailored for a specific genre (e.g., social media texts or news articles), topic (e.g., texts concerning climate change or economic development), and concept of interest (e.g., negative sentiment or incivility). Although the construction of “organic” dictionaries is quite demanding, they oftentimes offer better results and should be preferred over “off-the-shelf” dictionaries (Boukes et al. 2020; Muddiman et al. 2019). However, both types of dictionaries still have general pitfalls in that they cannot easily handle negation, irony or polysemy, meaning that the same feature might have a completely different meaning depending on its context (Benoit 2020). They are also often tailored to English-language only (Lind et al. 2019).

(b) Supervised machine learning uses manually annotated training data from which classifiers learn how to categorize previously unknown data. The method is for example applied to classify texts concerning their topics (Scharkow 2012) or whether or not they contain incivility (Stoll et al. 2020). First, variables are coded by human coders to create a training data set. Next, classifiers use this training data to learn which independent variables (for example, the frequency of features such as “bad” and “catastrophe”) predict the dependent variable (for example, negative sentiment). They then predict sentiment classifications for a previously unknown set of test data, i.e., texts researchers want to classify automatically (for a detailed overview of analysis steps see Barberá et al. 2021; Mirończuk and Protasiewicz 2018; Pilny et al. 2019). There is a plethora of classifiers that can be used, for example the Naive Bayes Classifier or Support Vector Machines (Scharkow 2012). Different classifiers can also be combined to ensembles. Supervised machine learning is not without limitations: Not only does the training data need to be of sufficient size, which can often mean that a considerable number of texts have to be coded manually. Researchers should also be cautious of strong dependencies of the classifier on the training data set, meaning the classifier works well for training data but poorly for test data. To avoid this, researchers often apply k-fold cross validation where the corpus is split into k groups. Then, each group is used as the test data once while the rest of the groups are used as training data without any overlaps between training and test data sets (Manning and Schütze 1999). Researchers should also test how generalizable their classifier is across contexts, meaning if it can accurately predict categories for new data with slightly different topics or text genres (Burscher et al. 2015).

Inductive Approaches: Exploring unknown categories in text

(c) Unsupervised machine learning takes a more inductive “bottom-up” approach as, in contrast to the previous approaches, categories are not previously known or fed to the model as training data. Instead, they are induced from the corpus (Boumans and Trilling 2016). If one is interested in categorizing texts concerning their main topics, for example, and has no assumptions as to which topics exist, unsupervised machine learning would be suitable.

The most prominent unsupervised machine learning approach is topic modeling (Blei et al. 2003). As a method to identify topics (Maier et al. 2018) and, as some argue, in combination with other methods even frames (Walter and Ophir 2019, but see Nicholls and Culpepper 2021), the method has been of increasing interest. Topic modeling identifies the relative prevalence of topics in texts based on word co-occurrences. It assumes that documents can be represented as mixture of different latent topics that are themselves characterized by a distribution over words (Blei et al. 2003; Maier et al. 2018). In contrast to single-membership models such as k-means clustering (Grimmer and Stewart 2013), topic modeling therefore allows for multiple topics to occur in a text. Recent applications such as structural topic modeling also enable researchers to analyze how covariates – for example the year a text was published or its author – influence topic prevalence or its content (Roberts et al. 2014). While some settings such as the number of topics to be estimated need to be specified before running the model, topics themselves are generated without human supervision. While less resources have to be put towards running the model, testing the reliability and validity of results produced by unsupervised machine learning can be quite demanding. In the case of topic modeling, researchers should, for example, check how results vary when estimating different numbers of topics, whether topics are robust and reproducible across model runs, and whether they are coherent and meaningful (Maier et al. 2018; Roberts et al. 2016; Wilkerson and Casas 2017). In particular, choosing the number of topics the model should identify is a highly subjective process that will likely influence results.

(4) Data validation. One should not blindly trust the results of any automated method. Therefore, validation is a necessary step (Grimmer and Stewart 2013). For more deductive approaches such as dictionaries and supervised machine learning, validation is relatively straightforward: Researchers already know which categories of interest, for example negative sentiment, might be found. Hence, validity is reassured by comparing automated results, i.e., which texts were assigned which sentiment, to a benchmark. Oftentimes, this benchmark is manually annotated data as a gold standard, here describing which sentiment humans would assign. While this gold standard not necessarily implies the “true” value as human coding is quite erroneous (DiMaggio 2015) even if intercoder reliability is reassured, it indicates on what humans would agree for a text to be the “true” sentiment.

The most frequently reported indices for the validity of automated analyses are precision and recall (Song et al. 2020). Precision indicates how many articles predicted to contain negative sentiment according to the automated analysis actually contain negative sentiment according to the manual benchmark: How good is the model at not creating too many false positives? For example, a value of .8 implies that 80 % of all articles that do contain negative sentiment according to the automated classification actually contain negative sentiment according to the manual benchmark. However, 20 % were misclassified as containing negative sentiment and do, in fact, not. Recall indicates how many articles that actually contain negative sentiment were found: How good is our model at not creating too many false negatives? For example, a value of .8 implies that 80 % of all articles with negative sentiment were found by the automated approach. However, 20 % were not because they had been misclassified as not containing negative sentiment when they in fact did (Manning and Schütze 1999). However, many studies do not yet report such validity tests (Song et al. 2020). Clear thresholds for what constitutes satisfactory values for these indices have not yet been agreed upon either – in contrast to intercoder reliability values for manual content analysis. Validity tests are also not very informative if results are unbalanced, meaning some categories – such as negative sentiment – have few true positives or true negatives. Given the uncertainty of quality thresholds, the question of “how good is good enough” (van Atteveldt 2008, p. 208) is still up for discussion.

The validation of unsupervised models is less direct. While studies argue that topic models, for example, can be validated by manually checking whether topics are coherent (Quinn et al. 2010) and can be differentiated from other topics (Chang et al. 2009, see Grimmer and Stewart 2013 for other approaches), there are no clear thresholds for what constitutes a valid model. Also, validity tests are reported even less often. Another issue are concerns about the reliability of these models. As Wilkerson and Casas (2017) summarize, unsupervised approaches are often instable, meaning that repeated estimations or different starting values lead to different results.

3 Analytical constructs employed in automated content analysis

Due to the interdisciplinarity of the method, automated content analysis has been used to measure a variety of constructs. For the field of communication science, studies often focus on four constructs of interest (see similarly Boczek and Hase 2020):

  1. 1.

    Actors: Many studies in the field of communication science use manual analysis to analyze how often actors, e.g., politicians or parties, are mentioned in texts (Vos and van Aelst 2018). Automated content analysis might be of massive help in this context. The recognition of so-called “named entities” (NER), including persons, organizations, or locations, has a long tradition in computer science. While different approaches have been discussed and the correct recognition of named entities is not yet solved (Marrero et al. 2013), studies have introduced potential approaches to our field. Recent analyses for example use rule-based approaches and dictionaries (Lind and Meltzer 2021; van Atteveldt 2008), machine learning (Burggraaff and Trilling 2020), or combinations of these methods (Fogel-Dror et al. 2019) to automatically classify named entities and often entity-related sentiment in text. This already indicates why these approaches might be of interest: Not only can we automatically count names of entities mentioned in text. We can also measure how different entities relate to each other, e.g., who talks about whom (van Atteveldt 2008), and sentiment concerning specific actors, e.g., how an entity is evaluated (Fogel-Dror et al. 2019).

  2. 2.

    Sentiment or Tone: Many studies are interested less in entity-related sentiment and more in the general sentiment or tone of news, for example for economic (Boukes et al. 2020) or political coverage (Young and Soroka 2012). A plethora of overview articles deliver introductions to such approaches which are often discussed in the context of sentiment analysis (Stine 2019; Taboada 2016). Sentiment analysis has developed from relying on dictionaries to using machine learning to applying deep learning and neural networks. Stine (2019) shows that the method delivered better performances with each turn in methods: While off-the-shelf dictionaries deliver insufficient results (Boukes et al. 2020) and organic dictionaries tailored to the genre, topic and concept of interest in one’s study are recommended instead (Muddiman et al. 2019), supervised approaches seem to offer better results than at least off-the-shelf dictionaries (Barberá et al. 2021; González-Bailón and Paltoglou 2015). However, artificial neural networks can also be a suitable approach, especially for unbalanced data (Moraes et al. 2013; Stine 2019) and have already been applied in communication science (Rudkowsky et al. 2018). In sum, machine learning approaches in general might be better suited to analyze sentiment than dictionaries (Barberá et al. 2021). However, almost all of these methods still fall short of human coding (van Atteveldt et al. 2021).

  3. 3.

    Topics: Many analyses are interested in topics, i.e., what is being talked about in texts. A plethora of methods has been applied to analyze topics: Many studies use supervised machine learning in the form of topic modeling (Blei et al. 2003; Maier et al. 2018; Quinn et al. 2010) while others have applied supervised machine learning (Burscher et al. 2015; Scharkow 2012) or dictionaries (Guo et al. 2016). Related to these studies, Trilling and van Hoof (2021) have proposed and compared different methods to detect events in text. While dictionaries seem to perform slightly worse than unsupervised machine learning (Guo et al. 2016), choosing a suitable method depends more on whether researchers already know which topics may appear (Grimmer and Stewart 2013). Supervised learning or dictionaries are more appropiate if a study is interested in identifying a set of predetermined topics. If these are unknown, (structural) topic modeling may be a better fit (Roberts et al. 2014).

  4. 4.

    Frames: Lastly, many communication scholars are interested not only in what is being talked about in texts but also how issues are being talked about, in particular framing as the selection and salience of specific aspects (Entman 1993). Recent studies have tried to detect frames based on computational methods, mostly by analyzing topics using unsupervised machine learning. They then map similar topics to overarching frames using network analysis and community detection algorithms (Walter and Ophir 2019) or cluster analysis (van der Meer et al. 2019) in a second step. Others have applied supervised machine learning (Burscher et al. 2014) or compared a range of methods (Nicholls and Culpepper 2021). However, researchers should refrain from presuming that constructs identified through computational methods can (always) be called frames, especially based on unsupervised approaches (Nicholls and Culpepper 2021; van Atteveldt et al. 2014).

4 Research desiderata

Automated content analysis has gained in importance across disciplines, including communication science. In pace with rising computational power, it has transformed the ways in which we think about and approach analyses of text. However, standards for how to conduct these analyses are still evolving. Moreover, which method and analyses steps are most suitable for a specific study depends on the data and research question at hand (Grimmer and Stewart 2013). In reality, the availability of computational power, manually annotated data or a researchers’ coding and statistical knowledge often influence such choices. Many departments in the field of communication science do not yet offer courses on statistics or programming that are necessary for communication scientists to fully understand and apply these methods (Boczek and Hase 2020).

Furthermore, the lack of available methods outside of bag-of-word approaches represents a research desideratum. Especially when dealing with more complex questions above and beyond how often a certain word or actor is mentioned in a given text, for example relationships between actors, studies need to more strongly consider syntactic relationships. Approaches for this have already been proposed (Fogel-Dror et al. 2019; van Atteveldt 2008), but most analyses still rely on the quite unrealistic “bag-of-words” assumption.

Another ongoing issue are concerns about the reliability and validity of computational methods (Nelson 2019), which are often neither tested nor reported. Uncertainty and error are an inherent part of automated analyses, similar to manual content where intercoder reliability reflects disagreement between individual coders. Given that studies using manual content analysis almost always need to report intercoder values for publication, similar thresholds for what constitutes a reliable and valid automated content analysis should be developed and be made mandatory for publication of automated analyses. Also, when deciding between manual and automated approaches, innovativeness should not outweigh the reliability and validity of results. While computational methods are often seen as a (methodological) advancement, they still have to satisfy essential validity and reliability thresholds for scholars to trust their results. In conclusion: Researchers should not choose computational methods over existing approaches simply because they seem more innovative.

The biggest question, however, is as follows: Even if we measure latent constructs such as topics, frames, or sentiment through automated content analysis – do we actually capture things that are relevant for theories and frameworks within communication science? Take topic modeling: There is an ongoing discussion about what topics mean (Maier et al. 2018). Are topics simply issues discussed in the news (van Atteveldt et al. 2014) or, if clustered, may they be interpreted as frames (Walter and Ophir 2019)? In other words, what do we gain by measuring topics in the news? Among other things, the shift to computational social science brings forward rigorous demands not only for statistical analysis or research designs but theory building (Peng et al. 2019). And while computational methods may inspire such (Waldherr et al. 2021), the status quo leaves further fruit of thought not only for methodological advances, but also for how computational methods might change existing and push new theories in communication science.

Relevant Variables in DOCA – Database of Variables for Content Analysis