Keywords

1 Introduction

In recent years, scholars have combined manual and automated content analysis with a broad range of other methods of data collection to analyze communication effects on public opinion, and to validate findings obtained through other means and sources. While mixed approaches based on content analysis and another method are not new (Lazarsfeld et al. 1968; Miller et al. 1979), their importance in communication studies is growing as new data sources become available and new methods are developed. In this chapter, we present a synthesis of different types of research designs applying content analysis in mixed methods approaches. This review is based on a strategic sample of cases and studies that we use to provide an articulated view of disparate and unconnected research areas. Overall, this chapter should serve as a starting point for researchers that seek to optimize the use of content analysis by combining it with other methods of data collection and analysis to account for a broad repertoire of socially relevant phenomena.

Citizens nowadays use and combine information from different sources (TV, newspapers, radio, Internet) and genres (e.g., news, infotainment) more or less simultaneously and with different degrees of attention. In many cases, the same content is conveyed through different platforms and may even come mediated through friends, acquaintances and extended networks of unknown contacts. This complex and diversified information environment, which some defined as a hybrid media system (Chadwick 2013) has not gone unnoticed by communication scholars.

For researchers interested in the impact of messages and media consumption on behavior, the new information ecosystem poses challenges, since effects depend on these very diverse and ever-changing patterns of exposure to information. This also has practical implications: to address many pressing research questions, scholars require vast datasets of individual usage patterns. Collecting and analyzing those datasets require considering user’s privacy expectations and obtaining their consent. Oftentimes, researchers also need to analyze access to information via social media referrals (e.g. Cardenal et al. 2019) or shared links using platforms’ Application Programming Interfaces (APIs). This can be challenging as APIs’ terms of use are frequently changing and plagued with data access restrictions, which can only be overcome by agreements between academia and social media platforms (Boeschoten et al. 2020; King and Persily 2019). Finally, wrangling such big datasets call for cross-disciplinary collaborations with much larger teams than most scholars were used to work with just a few years ago (King 2014).

There is no doubt, the opportunities offered by the current information ecosystem outweigh the challenges, however. We can now observe how individuals at the receiving end use and interact with different information sources with unprecedented levels of granularity and precision. Mobile apps and Internet browsers allow scrutinizing use habits and detecting levels of individual attentionFootnote 1 in unprecedented ways. The possibilities provided by the internet to collect different types of content and messages (conveyed through tweets, posts, news stories or political statements) and link them to individual reads, shares, re-tweets, posts, comments allow, for instance, exploring immediate behavioral reactions to information among big-N population samples (Barberá et al. 2019). Also, a growing number of open online repositories and archives of different data types (media content, experts’ surveys, codes or even pre-registered research designs) allows to harmonize and cross-validate communication texts with data collected by other means (see for instance Harvard Dataverse, see also Crosas 2013). At the same time, social scientists leverage these new sources of data by increasingly borrowing tools and techniques well-established in other scientific fields, like network analysis or text mining, which in turn offer new opportunities to make sense of online content data and connect it to other sources of information.

Ultimately, (old and new) multi-method approaches can help researchers explain relevant social phenomena from the ever-expanding communication texts and modes. In what follows, we offer an overview of how mixed method approaches can optimize the use of content analysis. We present frequently used mixed method designs that embed content analysis along three research aims (see a summary on Table 1). We first outline a set of strategies to link content data with survey data to analyze media audiences’ use patterns and their effects. We then dive into a second strand of studies that have taken advantage of the strengths of several data collection methods to provide robustness, cross-validate and generalize outcomes from content data of media or politicians. Third, and last, we show how network analysis can offer modeling strategies to communication scholars that use content data to map relationships between subjects of theoretical interest and explain their behavior over time. In the final section of this chapter, limitations of the afore-mentioned approaches and directions for future research are discussed.

Table 1 Types of mixed approaches using content analysis (own representation)

2 Linkage analyses

Studies linking content data with survey data are often called linkage analyses or linkage studies (de Vreese et al. 2017; Scharkow and Bachl 2017). These studies have been particularly frequent in political communication researchFootnote 2 and relate media content to media use and its effect among large population samples.Footnote 3 Linkage studies allow for comprehensively investigating what kind of content people use, with which frequency and intensity, and how it drives a particular reaction, attitude change or behavior. Overall, linkage analyses help put “flesh on the bone” (Schuck et al. 2016, p. 206) to studies aiming at understanding how media usage patterns cause a particular outcome, either over-time or/and at multiple units of analysis (message, but also source or country).

To date, studies using linkage approaches have followed two main strategies. A first set of studies infer people’s exposure to a particular content and the effect of such communicative content by linking aggregated content data to aggregated survey or behavioral data. This approach has also been called second-order linkage (Neuendorf 2002; Schulz 2008). Studies of such kind typically use a message code or category (e.g., the salience of a policy issue), aggregate relevant units of analysis accordingly (e.g., proportion of news coverage reporting on the issue) and, finally, correlate those units to aggregated use, attitudes or behaviors (e.g., audiences’ perceptions about such policy issue). This allows researchers to obtain a glimpse of which portion of the public may have been exposed to respectively affected by a given message.

A case in point are studies from the agenda setting literature (e.g., McCombs and Shaw 1972; Soroka 2002), but further studies on U.S. media framing and the tone of EU-related news coverage also made use of second-order linkage. Hester & Gibson (2003) content-analyzed a series of stories on economic news in the US to find negative and positive frames on the economy and link them to monthly aggregated consumers’ evaluations of economic conditions in a 48-month period span. As yet another example, Vliegenthart et al. (2008) used content analytical indicators of conflict and benefit framing in news coverage and compared values with aggregated Eurobarometer survey data on individuals’ support toward EU integration across 17 years, using time variation to account for how media contexts explain “public opinion dynamics” (Vliegenthart et al., 2008, p. 415). Boomgaarden and Vliegenthart (2007) used computer-assisted content analysis of Dutch national newspapers to assess the impact of immigration-related media coverage on vote percentage for anti-immigrant parties and, more recently, Brosius et al. (2019a) employed sentiment analysis to assess the impact of news coverage tone toward the EU on citizens’ EU trust (see also Brosius et al. 2019b).

The above mentioned studies are longitudinal in nature and use time-series analyses to compensate for the reduced number of observations resulting from aggregating data. However, all of these studies are limited by their inability to establish a one-to-one correspondence between a particular message and its recipient. That is, even though some of them discriminate how much news coverage a particular media outlet or program commits to a relevant dimension, the use of aggregated audiences makes it difficult to determine individual behavioral effects or, where it is not even possible to measure general news or media use, to grasp which messages people may have actually been exposed to.

A second set of studies connecting content data and media use at the individual level – also called first-order linkage (Neuendorf 2002) – are better suited to establish such one-to-one correspondence between messages (or sources) and individuals exposed to them, and engage in causal analysis. These studies tend to be more precise and focus on the impact of content analytical variables at the message, program or medium level, rather than at the media or national levels. As with second-order linkage studies, researchers using first-order linkage designs code or categorize a variable – issue, actor or event visibility or prominence, tone, frames, exemplars, message types – and then aggregate results of such content analysis at relevant coding units, whether headlines, news stories, articles or TV programs (Vreese and Semetko 2004; Donsbach 1991; Schuck et al. 2014). More sophisticated approaches also account for how prominently a coded variable is placed in a story; or weigh content variables’ code occurrences by the length of the news story; medium circulation numbers or audience shares (de Vreese et al. 2017). In a second step, first-order analyses use individuals’ responses to a survey questionnaire to determine frequency and amount of usage of different outlets, platforms or messages. The actual linkage is then done by weighting the proportion or frequency with which a given media message variable is used in each medium, with the frequency of individuals’ exposure to that medium (e.g., de Vreese and Semetko 2004; de Vreese et al. 2017; Schuck et al. 2014). Some more complex designs weigh exposure to media messages by publication recency, that is, how close in time the news story’s publication was from an individual’s exposure to the medium when it was published (e.g., de Vreese et al. 2017).

Seminal studies embedded in the so-called first-order linkage strand of literature combining content data and media use at the individual level are Erbring, Goldenberg und Miller (1980), Kepplinger et al. (1991), Lazarsfeld et al. (1968), and Miller et al. (1979). These studies were among the first in supplementing public opinion studies with content analyses of campaign messages in newspapers, magazines, radio speeches, and newscasts (Lazarsfeld et al. 1968; Schulz 2008). They were mostly concerned with agenda setting and issue coverage effects on individuals’ perceptions of the most important problem facing the country (Erbring et al. 1980 for the US case) or familiarity and position toward an issue (Kepplinger et al. 1991 for the German case). Lazarsfeld et al. (1968) were also among the first in weighting measures of self-reported exposure to campaign information with media’s Republican and Democrat leanings to analyze selective exposure patterns and implications.

Most recently, studies that weighed self-reported measures of news or media use with aggregated values of content analytical variables have investigated effects of media coverage (issue or actor visibility, prominence and attention) on the public image of political leaders (Bos et al. 2011) or media’s EU evaluations on EU skepticism (Peter 2004) and EU vote (van Spanje and de Vreese 2014), news media tone on party or vote choice (de Vreese and Semetko 2004; Hopmann et al. 2010) in referendums, or else exposure to non-like-minded or counter-attitudinal views through the media on vote decisiveness (Matthes 2012) or EU turnout (Castro Herrero and Hopmann 2017; see also Castro et al. 2018 for a similar approach applied to a study using cross-cutting media exposure as an outcome variable). Framing studies or studies focusing on journalistic reporting styles have also made extant use of linkage analysis, as attested by a strand of research investigating how exposure to strategic or conflict framing, or populist style affect political cynicism, learning, political polarization or turnout (Jebril et al. 2013; Müller et al. 2017; Schuck et al. 2013, 2014). Some of the studies abovementioned including Takens et al. (2015) use panel data to investigate how media coverage of political leaders affects electoral behavior. Takens et al. (2015) longitudinal study is particularly notable since they used 11 waves of a survey to discern media priming effects on people’s evaluations of party leaders and their use in voting decisions. As de Vreese et al. (2017) posit, the use of panel data allows to capture information use across time and represents a more reliable first-order linkage of content and behavioral data since it is able to identify within-individual changes in attitudes and cognitions.

First-order as well as second-order linkage studies face a number of problems. The presence of measurement error in empirical approaches that combine content data with individual-level data is no exception and can bias effects’ estimates. With regards to content media data, Scharkow and Bachl (2017) posit that between-outlet variance might be generally underestimated as a result of random misclassifications in content analyses scrutinizing proportion of news stories with a particular message, which in turn may underestimate effects of using such outlets on particular dimensions. De Vreese et al. (2017) further point at problems to build equivalent measures of visibility or prominence across different types of outlets (e.g., is a particular news item equally prominently featured when included on the front page of a newspaper or when included among the first broadcast news stories in a TV news bulletin?). Potential remedies to derived measurement errors are considering alternative specifications to chosen operationalizations or to assess the predictive validity of different measurement strategies (Dilliplane et al. 2013). For instance, coming back to the prominence example, researchers may assess whether a story coded as prominent in a TV news program is equally influential (e.g., mobilizing) as a prominent story in a newspaper (de Vreese et al. 2017). De Vreese et al. (2017) also suggest discounting content features that may appear simultaneously with the content feature of theoretical interest but have the opposite effect (e.g., by subtracting one content score from another, p. 230) as a way to refine effect measurement.

Self-reported measures of media use are another important source of measurement error. Individuals’ problems to recall how often they consult information sources as well as satisficing strategies may increase random measurement errors (Scharkow and Bachl 2017), while social desirability bias might yield more systematic errors. In this vein, self-reports have often been criticized for overestimating media consumption (Prior 2009; de Vreese et al. 2017) and in particular consumption of news and political articles (Vraga and Tully 2018). These overestimations of news consumption pose additional challenges since, as Scharkow and Bachl (2017) show, overreports from bounded measures of news exposure may “decrease its variance as compared to the true media use” (Scharkow and Bachl 2017, p. 327). Previous research tried to find frequent correlates of systematic over-reporting such as political interest or strength of partisanship to identify usual sources of biases, with mixed results (Guess et al. 2018; Prior, 2013). Further approaches have suggested alternatives that may increase the reliability and validity of self-reported measures of information usage, such as providing survey respondents with specific lists of outlets, programs and frequency scales (Andersen et al. 2016; Dilliplane 2011; Moehler and Allen 2016), or analyzing additional measures of message attention, motivation to seek information (Chaffee and Schleuder 1986), or information processing (Ball-Rokeach and DeFleur 1976) and need for cognition, among others (see Althaus and Tewksbury 2007, for an overview). Additionally, Takens et al. (2015) account for potential decays in media attention across time in their multi-wave study by using a decay rate according to a predicted probability of decrease in retrieving campaign information. More generally, Scharkow and Bachl (2017) put forth a simulation framework that could be used to gauge potential bias in linkage analysis for different values in error estimates of content and survey data in conjunction, which, as the authors put it, “can be a useful tool for a priori power analyses or post hoc sensitivity analyses” (p. 336).

3 Cross-validating content data

A second strand of mixed method designs for content analysis is concerned with the validation of content data, often with the purpose of generalization. This typically involves comparing results from content analysis on a primary data source with results on the same dimensions yielded by data collected through other means.Footnote 4 These other data sets serve as a benchmark of comparison to assess so-called convergence or relative validity (Marks et al. 2007) of measures.

In our discussion of examples of such validation efforts, we will focus on the measurement of party positions, an area where there is extensive research on the trade-offs between different measurements. This is because such latent positions can be measured with a variety of approaches, whether based on expert surveys, elite surveys or through content analysis of a variety of texts published by parties and politicians. That is, different from studies of media effects, researchers need to justify both their measurement tools and the texts they apply these tools to.

Many studies justify their choice of text with a comparison to results from other texts (e.g., Hutter and Gessler 2019). The different texts produced by parties differ in their audience and purpose as well as the frequency of their publication. Consequently, researchers have to make trade-offs between choosing sources that explicitly communicate positions (e.g., party manifestos), sources that are read by a wider audience (e.g., news reports on party positions) and sources that are published with a higher frequency (e.g., party press releases or tweets). Parties can strategically choose to emphasize different issues across these platforms. Hence, while establishing the comparability of results across sources is crucial for research, the goal can only be to establish convergence since each platform comes with its own logic, frequently termed as ‘affordances’ (Bucher and Helmond 2017). While party positions will differ across platforms due to differences in the audience and purpose of these platforms, these measures should converge for a single party across multiple platforms.

Another approach has been to validate the results of content analysis through comparison with measures based on surveys (e.g., Marks et al. 2007; Bakker and Hobolt 2013; Helbling and Tresch 2011). In the absence of a gold standard or a criterion measure of party positioning on European integration, Marks et al. (2007) compared the most frequently used content data to identify party positioning (i.e. electoral manifesto data) to measures of party positioning on European integration obtained from different surveys (expert surveys, elite surveys, and election surveys). Analyzing the error structure for each dataset (i.e. for which types of parties the prediction of one data source deviates from that made based on other data sources), they show that in some cases (i.e. when parties are internally divided), combining manifesto content with other datasets yields more valid measures of their positioning on European integration. Hence, this type of cross-validation is particularly suited for establishing the limits of content analysis (but see Benoit and Laver 2007).

Such cross-validation with existing measures is particularly important when establishing the validity of new measures. This also concerns studies with digital data that draw on content analysis to varying extents: For example, Mellon (2014) compared interviewees’ responses from Gallup survey questions on America’s most important problems with Google search trends of those same policy issues to determine to which extent the latter can be used as a proxy for issue salience in public opinion. Other studies contain a bigger content-analytical element as they use dictionary-based content analysis on text snippets collected from the internet (e.g., O’Connor et al. 2010) and validate sentiment or salience measures obtained from this with survey data. As with linkage analyses (e.g., de Vreese et al. 2017), these studies frequently make use of time series analyses in order to account for the dynamic aspects of the data and increase the validity of their analysis. This approach to cross-validate data retrieved from public opinion polls and surveys with online content analysis is also similar to studies on agenda setting media effects using second-order or correlational linkage analysis (e.g., Conway et al. 2015; Soroka 2002). Hence, combining different data sources for validation faces similar challenges in establishing correspondence between both measures.

In some cases, combining content data with other data for validation purposes can also provide substantive insights: One such application of automated and semi-automated content techniques is provided in Guess et al. (2019). Guess and colleagues compared self-reported questions on individuals’ frequency of posting “about politics” with their actual political posting activity. To categorize individuals’ posts as political, they hand-coded a set of Facebook and Twitter posts and used the resulting label as a benchmark to train a machine learning model and classify the rest of the data. Content analysis was therefore used to measure the level of discrepancy between actual political online activity (posting activity) and their subjective and reported behavior in survey responses. They found that the high variance in how people tweet about politics (with some tweeting multiple times a day while others doing it only twice a month) and the use of a bounded 6-point scale question for self-reporting could explain discrepancies in actual posting and self-reports among heavy users.

4 Semantic network analysis

The combination of content analysis and tools from network science results in the fourth and final analytical framework that we present in this chapter. This approach is known more broadly as semantic network or discourse network analysisFootnote 5 and offers a set of modeling strategies to assess media, political or other communication content as a relational structure that can be then used to explain processes and outcomes such as agenda setting or voting (Doerfel and Connaughton 2009; Leifeld 2017; Yang and González-Bailón 2017). Semantic network analysis is a well-established analytical framework in network science but uncommon in political and communication research. Only as early as 2005, network science was considered a new scientific disciplineFootnote 6 (Barabási 2015). No wonder, why its subfields, as semantic network analysis, are not yet broadly adopted by researchers.

In general, network science allows to shed light on processes that underpin social relations. More specifically, networks are simple representations of those processes and capture the basic relations among those who take part on them. Nodes and ties are the two basic elements of any network and what they represent depend on the specific question to be addressed. As an example, nodes in semantic networks can represent either words or concepts (e.g., topics or frames of certain policies or debates) or subjects or actors (e.g., media outlets, politicians or social media users). In those examples, ties can represent the number of actors using certain words or concepts, or they can depict the amount of co-occurring words or concepts commonly used by actors. Overall, as Yang and González-Bailón (Yang and González-Bailón 2017) put it, ties in semantic networks proxy social relations by depicting associations between concepts or between those who use them. Figure 1 illustrates three different types of semantic networks. Independently of what nodes and ties measure in any specific study (e.g., relations between concepts or actors who use certain frames), the underlying idea is that shared understandings of issues and concepts can help us explain the functioning of institutions or public opinion building. Through semantic network those shared understandings can be mapped and reduced to structures of interdependencies that then, can be analyzed using a different set of tools from network science.

Fig. 1
figure 1

Illustration of three toy semantic networks as graph (own representation)

Note: The toy network in panel A is a two-mode or bipartite network, based on affiliation data and therefore, it has two types of nodes. Red nodes represent semantic concepts and green nodes represent actors. Ties connect actors with the semantic concepts they have used. This is a weighted network, as illustrated by the width of the ties, which measure the strength of the connection between actors and concepts. Notably, relations in semantic networks can be captured as a binary variable, whether they exist or not, or on a continuous scale as the example provided. As a real world example, ties in this toy network can represent, for instance, the number of times a political candidate has used a certain concept in her public speeches. Toy networks in panel B and C represent one-mode network projections from different two-mode networks. Hence, there is only one type of node in each network. On the toy network on panel B, nodes can represent semantic concepts and ties the number of actors, for instance, that used a pair of concepts. Finally, on panel C nodes represent actors and ties would measure the strength of their relation based on shared concepts. For illustrative purposes, nodes have been sized according to their degree centrality measured as the strength of their relation with the other nodes in the network.

According to (Shumate et al. 2013) semantic networks can be examined at different levels. First, at the basic level, one can assess the characteristics of the network structure and identify patterns of word usage in a corpus of text. This approach can result in a network mapping co-occurrence of concepts where clusters or other relevant structural features are identified. Along these lines, (Farrell 2016) shows how network analysis enables empirical testing of discursive field theory on his study on the role of corporate funding in the polarization of the climate change debate. He applies automated text analysis to identify frames on a corpus of text produced by 164 organizations in the counter climate change movement during a time window that spans twenty years. Then, he shows by means of network analytical tools, that private funded organizations published significantly more content aimed to polarize the climate change debate.

Secondly, going beyond the descriptive effort, one can also use semantic network to explain outcomes by examining relations through shared meanings. An illustration of the later is (Leifeld 2013), where the author analyses the issue positions around the German pension debate.Footnote 7 His approach illustrates the potential of network analysis to identify mechanisms explaining the relationship between political actors through their shared understanding of a political process. Leifeld (2013) applies network analysis techniques to assess how different coalitions interact, across time, to frame the debate around pensions and eventually, adopt a new regulation scheme. For this, he gathers statements of politicians on the media and the Parliament and uses manual content analysis and annotate the corpus of text to reduce it to shared frames. Then, he creates networks of actors and shared frames across time to understand how policy debates evolve. Finally, he applies inferential network techniques to determine micro-level processes governing political discourses, such as popularity, coalition formation dynamics and clustering among different political actors (Leifeld 2016).

How semantic network can be applied to understand political outcomes is also illustrated by (Doerfel and Connaughton 2009). Despite the authors do not use inferential network techniques and hence, explicitly avoid making causal claims, their work shows that semantic structures help to understand electoral outcomes. In their study, they assess the structural semantic similarity between presidential winners and losers in the US over 44 years. Their result shows that more semantically cohesive and central discourses are systematically used by presidential winners. They obtain the semantic network structures by applying automated content analysis techniques to map co-occurrences of words in all televised presidential debates which are later linked to each candidate’s discourse. As yet another example of the explanatory value of semantic network (Yang and González-Bailón 2017) point to the work of (Bail 2012). This author identifies how antimuslim frames on the 9/11 attacks permeated media mainstream reporting. For this he combines manual content analysis with network tools and shows how the latter “can enrich researchers’ explanatory repertoire” (Yang and González-Bailón 2017).

5 Outlook and desiderata

In this chapter, we heeded the call by previous studies to provide an informed overview of benefits and pitfalls of “integrated research designs” (Stier et al. 2020b, p. 2). In particular, we focused on the potential of mixed method approaches to optimize the use of content analysis in advancing research along three main aims. That is, we first described a series of frequently-used steps to linking content data with survey data and analyze media effects. Second, we also reviewed studies that cross-validate content data with other data obtained through different methods with the aim of generalization. Finally, we explained how network analysis offer modeling strategies to communication scholars interested in using content data to map relationships between subjects and explain their aggregated behaviors over time.

We first outlined the methodology and applications of so-called linkage analyses that connect media contents with their effects. In contrast to alternative methods that primarily offer internal validity (namely, experiments), these “real-world designs” (Schuck et al. 2016, p. 210) allow scholars to work with large samples and provide researchers with (a) the possibility to capture individuals’ information habits and their effects and (b) versatility in analyzing data at multiple units of analysis (type of source, user, country).

Until recently, the main challenge of linkage analyses was to determine whether and how frequently and prominently individuals encountered a particular message in a given source. More specifically, researchers using this approach faced difficulties to measure message exposure at lower levels than a medium and potential reliability problems of survey responses. However, the ability to measure such exposure is growing, not only because of researchers’ improved remedies to measurement errors in content analysis and self-reports (see section on linkage analyses above); but also because of the new possibilities offered by digital trace data.

Researchers can now ask respondents of a survey for consent to collect their anonymized online behavioral data during a set time window. This allows them to trace individuals’ actual exposure to a specific media content and link it to their attitudes, preferences and behaviors (e.g., Peterson et al. 2019). These new possibilities come with limitations, however. Consent ratios for web tracking are still particularly low (Guess et al. 2019; Stier et al. 2020b) and introduce additional sample biases among non-probabilistic online population samples (for a discussion on ways to address this issue, see Peterson et al. 2019). Furthermore, when coupling survey with social media data, researchers must face limitations arising from the low penetration rates of some of those platforms, especially those which provide easier access to their data e.g., Twitter, Reddit, as well as limited access to users’ behavioral data due to e.g. frequent changes in terms of use of platforms’ APIs. An additional challenge that needs to be addressed in order to take full advantage of online behavioral data potentialities is the need to simultaneously track individuals’ navigation patterns across multiple platforms and devices (desktops, mobiles), since exposure and attention to certain content might be moderated by the use of different devices (Dunaway et al. 2018; T. Yang et al. 2020). Finally, researchers need to be aware of frequent changes in conditions of use of social platforms or search engines that can affect online behavior such as news or political information use. For example, during recent years, Facebook decided to downgrade news content in favor of closer friends’ posts and Twitter increased the amount of characters allowed per tweet (Mosseri 2018; Rose 2017). Those changes can erroneously be interpreted as individual behavioral signals as past research has shown (Salganik 2017).

Overall, research attempts to analyze frequent individual exposure to information at lower levels of analysis than the medium or the platform (e.g., message, news story) by combining behavioral data and automated content analysis seem promising but are still limited and scarce. In this vein, most studies on media consumption have been drawing on data at the domain or URL level (Allen et al. 2019; Cardenal et al. 2019; Stier et al. 2020a; Tyler et al. 2019). The analysis of media content at the URL level (Tyler et al. 2019) lack the granularity to assess the exact type of content (e.g., frames, tone, reporting styles) that is being accessed online. Looking ahead, more researchers should aim at crawling content below the URL level to capture “real world” and frequent information habits (see Guess et al. 2018; Peterson et al. 2019 for recent examples).

For cross-validation, the main way forward is the question of scale: Machine learning approaches increasingly allow scaling up coding done on a smaller sample. While this increases the generalizability of content analytical data (one of the main aims of cross-validation), the transferability of machine learning approaches across text types – which is crucial for cross-validation – remains challenging due to the lower performance of classifiers out of their training sample. Here, unsupervised approaches like structural topic modeling may provide an interesting avenue forward (see Heiberger et al. 2021).

Finally, despite the potential of semantic network analysis, scarce research to date has used it to measure shared meanings of concepts or frames and their relation to (political) outcomes (Shumate et al. 2013). Most of the research in the field has mainly used cross-sectional data. Notably though, the temporal dimension encodes important information to understand processes like political polarization and other public opinion dynamics (Yang and González-Bailón 2017) and yet few studies have taken a longitudinal perspective – the cases above mentioned being among the few notable exceptions. Furthermore, there has been little theory driving most of the studies in this field, which, as Shumate et al. (2013) note, have mostly inductively focused on clusters of shared meanings and frames or prominent concepts or actors (Chung and Park 2010; Danowski and Park 2013; Farrell 2016). In the future, researchers should focus on employing semantic network analysis to understand the underlying mechanisms driving patterns of shared meanings and use of words, as well as explaining the outcomes expected from those relations. Ideally, they should also advance the consensus around the network metrics to summarize the properties of semantic structures at the local – e.g., node centralization, and meso levels, e.g., reduction techniques. On a more general and final note, future and more comprehensive literature reviews and meta-analyses on mixed-method approaches should account for methods of data collection and analysis other than those discussed in this chapter. Prospective overviews should devote more attention to identifying frequent research designs and data types that have been used in combination with content analysis and data to address research aims beyond the three highlighted in this chapter. For example, input–output analyses comparing politicians’ or governmental communication to their media coverage (Jungblut 2020) or research on journalistic values and processes behind news decisions have combined different kinds of texts and media content, and journalists’ interviews to study news coverage of relevant societal issues. Additionally, a few recent studies have employed standardized content analysis to evaluate eye movements’ records from an eye-tracker in experiments analyzing online searches (e.g., Kessler and Zillich 2017) or to automatically code open-ended survey questions (Hopkins and King 2010; Simon and Xenos 2004) and estimate topic prevalence contingent on each respondent’s characteristics (Roberts et al. 2014). Future overview and empirical studies may hence resort to these uses of content analysis to complement the selection of research areas and studies showcased here and advance an articulated view of the combinatory power of mixing content analysis with other methods to study socially relevant phenomena.

Relevant Variables in DOCA – Database of Variables for Content Analysis