As the first step to conduct topic models, I created a list of “bad words” to be removed from the dataset, including common words embedded in online articles (such as “date” and “advertisement”) and prepositions and articles (such as “on,” “the,” and “a”). This step further cleaned the dataset and minimized potential noise from meaningless words. The trimmed sample had 1,229,776 elements in total, with 661,518 elements from CNN and 224,400 from Fox News (Figs. 1, 2).
I first examined the word frequencies using the whole sample, including articles from both CNN and Fox News. A word cloud shows that “Trump,” “white supremacist,” and “violence” were at the center of the dataset. Specifically, word counts for the whole sample show that the most used word, “Trump,” appears for 2879 times (0.23%), followed by 2093 counts of “white” (0.17%), 840 counts of “supremacist” (0.07%), and 818 times of “violence” (0.07%). In comparison, the average frequency for the words used in the trimmed sample is 8.9 times. After breaking down the dataset by media sources, both subsamples share the same top four words. The CNN subsample contains 0.34% of “Trump,” 0.22% of “white,” and 0.08% of “supremacist” and “violence.” In contrast, the Fox News subsample includes 0.30% of “Trump,” 0.29% of “white,” 0.13% of “supremacist,” and 0.14% of “violence,” after adjusting by the total number of elements by media source.
The word frequency and word cloud offer a descriptive visualization of the dataset, which confirms the accepted view of the Charlottesville Rally as a violent protest with a highly political nature perpetuating white supremacy in the US (Atkinson 2018; Hartzell 2018; Heim 2017; Klein 2019; Perry 2018; Tien et al. 2019).
Secondly, I performed standard LDA on the whole dataset and the two subsamples. The results provide two parts of information: topics and keywords. Topics are assigned based on the general theme of the article where as keywords under each topic suggests the wording preferences under each theme. The topic models extracted four topics with 2000 iterations of Gibbs sampling. Table 1 presents the results for the whole sample, CNN subsample, and Fox News subsample. After exploring LDA from two-topic to six-topic groupings, the four-topic model provides the most interpretable results with the least word overlap. I then assigned labels to summarize the contents of each topic model. For the whole sample, Topic 1 focuses on political issues, highlighting politicians (e.g., Trump, Obama, Bannon, and Clinton), political parties (e.g., Republican and Democrat), and other politically relevant words (e.g., campaign and left). Topic 2 is about white supremacy, which included both explicit terms such as “white,” “supremacist,” and terms that highly related, such as “neo-Nazi,” “nationalist,” and “bigotry.” Topic 3 features racial conflict by talking about “Black,” “race,” and “racism.” Additionally, the word “war” is relatively prominent. Topic 4 presents the most detailed information about the Charlottesville Rally, including the name of the victim Heather Heyer and neutral terms such as “law” and “protest.” Notably, all topics share some commonality of words across topics.
Comparing the total sample with the two subsamples, the keywords do not vary. The CNN subsample highly resembles the four topics captured in the full sample, whereas the Fox News subsample does not present topics as clearly. Specifically, the keywords LDA captured from the Fox News subsample do not align with those from the total sample. For instance, Topic 1 in this subsample contains the keywords “Trump” and “Republican,” which fits the political theme. At the same time, it also has terms such as “neo-Nazi,” “white,” “supremacist,” and “hatred,” signaling inconsistency for the purpose of topic interpretation. Still, I aligned the four topics under the Fox News with the whole sample by the most similar wording usage under each topic for the purpose of further comparison (Fig. 3).
Additionally, the positions of each keyword under each topic are meaningful. Under each topic, the display of the keywords follows the pattern from the highest frequency to the lowest. In other words, the more times a keyword appears in under a certain topic, the higher that keyword ranks in the list. For example, “Trump” is the most common word under Topic 1. This position indicates that in all the articles assigned under Topic 1, the term “Trump” is the most repeated.
Pairing the topic assignment and the word positions together, the two subsamples reveal unique wording preferences under the corresponding topics. Topic 3 results suggest that both subsamples focus on the key terms “war,” using words such as “symbol” and “violent” to describe the elements of the event. Beyond the similarities, CNN and Fox News are distinguishable for that the CNN subsample seems to attribute the event to “culture.” In contrast, the Fox News subsample redeems the event as “controversial.” Additionally, the term “patriot” only appears under the Fox News subsample.
Topic models answer the first research question, that is, both media outlets of interest—CNN and Fox News—adopted fairly similar frames to talk about the Charlottesville Rally. The descriptive word frequencies chart and word cloud both provide evidence to support Hypothesis 1; that is, the media outlets frame the Charlottesville Rally as a violent event lead by white supremacists. The results, however, reject Hypothesis 2, which claims that CNN and Fox News framed the Charlottesville Rally differently by emphasizing different topics. Politics, white supremacy, racial conflict, and the car incident are the four most prominent topics that appear in the online news articles during the week after the event happened. While the Fox News subsample has a mix of words spreading across the topics, the general interpretation does not change dramatically. Not only did both media outlets focus on highly similar topics in reporting the event, but the wording preferences are highly similar. Most of the keywords show up in both media outlets. However, there are some unique words in both subsamples observed. CNN subsample highlights “moral” and “culture” whereas Fox News emphasizes on “patriot” in their articles. In addition, even though the four topics seem to be highly similar, the ranking of the top keywords vary by the media sources. For instance, under Topic 3 (i.e., racial conflict), CNN mentions “black” the most whereas Fox News focuses on “violence.” Such differences in the ranking of top keywords are observed for most keywords. These differences support Hypothesis 3, which claims that CNN and Fox News adopted different wording preferences.
Unlike the results in the topic models, the sentiment carried in the CNN and Fox News subsamples differs significantly. Besides the positive sentiment and negative sentiment scores from LSD, I also generated a measure of the logged sentiment ratio for each article [Logged Sentiment Ratio = Log(Positive sentiment/Negative sentiment)]. The logged sentiment ratio aims to balance out those articles with strong emotions for both positive and negative sentiments. I applied a logarithm to normalize the distribution of the measure. Table 2 provides the descriptive statistics for each news outlet by sentiment. The CNN subsample shows relatively high positive (M = 28.28, SD = 19.27) and negative emotion scores (M = 49.14, SD = 31.55) in comparison to the FOX News subsample (Positive Mean = 16.21, Negative Mean = 31.26). Welch t-tests also confirm statistically significant between-group differences between the two subsamples for positive sentiments [t(403.87) = 7.24, p = 0.000], negative sentiments [t(403.77) = 6.52, p = 0.000], and logged sentiment ratio [t(311.86) = 2.86, p = 0.005] (Table 2).
Combining the results from both topic models and sentiment analysis, I further explored the relationship between the assigned topics and the sentiment scores. I hypothesized that the topic of each article would influence the expressed sentiments of the article. In other words, reporters tend to show different attitudes when writing news articles on various topics. I employed ANOVA to test for differences in positive sentiments, negative sentiments, and the logged sentiment ratios, treating the topics (i.e., politics, white supremacy, racial conflict, and car incident) as a between-subject variable.
The ANOVA results suggest that there are statistical differences in the usage of positive tones when talking about different topics for the whole sample [F(3,400) = 7.37, p = 0.000], the CNN subsample [F(3,230) = 7.90, p = 0.000], and the Fox News subsample [F(3,166) = 9.65, p = 0.000]. Similarly, the results for the measurement of logged sentiment ratio are also significant for the whole sample [F(3,398) = 8.93, p = 0.000], CNN subsample [F(3,230) = 12.27, p = 0.000], and Fox News subsample [F(3,164) = 11.30, p = 0.000]. Interestingly, only the Fox News subsample presents the significant difference between topics when expressing negative sentiment [F(3,166) = 5.33, p = 0.002] (Table 3).
To further investigate which topic presents more sentiments, I estimated multiple Ordinary Least Square (OLS) regression models using the topic assignment as predictors and controlling for the news outlets. The results confirm that CNN articles present more sentiments compared to Fox News. Such findings hold for positive sentiment, negative sentiment, and the logged sentiment ratio. OLS results also suggest that, for both positive sentiment and the logged sentiment ratio, compared to the reference topic of car incident, articles about racial conflict and politics carry more sentiment after controlling for media sources. More importantly, the topic related to white supremacy is statistically insignificant in sentiment expression.
The results from the sentiment analysis offer more evidence to answer the research questions. Combining topic models and sentiment analysis, the results provide support for Hypothesis 4, which claimed that topic choices influenced the level of sentiments expressed in the news reports, but the effect did not apply equally for every topic (Table 4).