Although we are continuing to collect tweets to add to our data collection as we follow the transition to the Biden-Harris administration, we first present an analysis on tweets from our dataset from January 2020 through the end of December 2020. This enables us to examine political discourse on Twitter through the Presidential primaries, debates and election. Highly political divisions have emerged in COVID-19 discourse , alongside conspiracy theories  and public heath related trends that have emerged due to COVID-19 . Our recent work on this dataset has also shown that partisan trends drive the discourse on Twitter, with conservative users posting at much higher volumes compared to their liberal counterparts. Conservative users also tended to share more known conspiracy-related narratives . We have also observed that there are highly connected conservative users that are more prone to spread public health and voting misinformation .
During the 2020 Presidential election, the incumbent former President Trump, faced little difficulty in securing the Republican nomination.Footnote 21 Although Trump did face three Republican challengers (Mark Sanford, Joe Walsh and Bill Weld), Trump earned 2395 delegate votes, an overwhelming majority.Footnote 22
The Democratic primaries were more competitive, with a historic 28 candidates vying for the nomination.Footnote 23 However, as national poll results began to roll in and initial primary results were tallied, candidates began to drop out of the race (see Table 5 for dates candidates from both parties suspended their campaigns). The advent of COVID-19 in the United States in March 2020, and the ensuing regulations to encourage social distancing, forced the remaining campaigns to shift to a virtual models. The race narrowed down to two candidates: Vermont senator Bernie Sanders and former Vice President Joe Biden. As more primaries took place and results reported, it became clear that Biden would win the 1991 delegates needed to become the presumptive Democratic nomineeFootnote 24. Sanders conceded to Biden on April 8, 2020 and endorsed Biden.Footnote 25\(^,\)Footnote 26
Overview of presidential candidate Twitter discourse
Our dataset specifically tracked 2020 US Presidential elections-related keywords and accounts. As a result, we expect to see that the captured discourse reflects major events that took place throughout our collection period. We limit our analysis to tweets from our dataset that were collected from January 2020 through December 2020.
The fight for the Democratic Presidential Nomination
We first investigate the chatter surrounding the Democratic primaries, as the race to win the nomination was competitive and multiple candidates emerged as favorites. While Biden may have held an early lead, Sanders, Elizabeth Warren and Pete Buttigieg were also serious contenders.Footnote 27 In Fig. 1, we tracked mentions of each of the Democratic presidential candidates’ names and Twitter handles who were still campaigning in March 2020, and found the 7-day daily rolling average percentage of all collected tweets that mentioned each candidate. This particular time series ends on May 8, 2020, which is one month after Sanders conceded to Biden, and Biden became the presumptive Democratic presidential candidate.
Throughout the Democratic primary timeline in Fig. 1, we can see that the attention that specific candidates attract on Twitter fluctuates greatly. We can clearly see that Sanders and Warren initially led most of the discourse on Twitter in January 2020, but that Sanders would eventually dominate Twitter chatter throughout most of the primaries. This dominance continues until February 25, 2020, when James Clyburn, a prominent South Carolina African American Representative, endorsed Biden. From there, we see a sharp increase in Biden mentions, and Biden quickly overtook Sanders not only in polls, but also in Twitter discourse.Footnote 28 Biden continued to hold a majority in Twitter mentions throughout the rest of the primaries, through Sanders’ concession on April 8, 2020. All other candidates saw a general decrease in tweet mention percentage after an initial increase in percentage after candidates announced that they had suspended their presidential campaigns.
While most of the mention percentages generally followed the popularity of certain candidates, in particular Biden, Sanders, Warren and Buttigieg, we find an increase in mentions surrounding Michael Bloomberg during the 9th Democratic debate.Footnote 29 The 9th Democratic debate was the first debate that Bloomberg was able to qualify for, but his performance was widely criticized.Footnote 30 He also attracted social media attention after having heavily funded his campaign’s ads with his personal money.Footnote 31
Chatter during the Presidential elections: Biden versus Trump
We now turn to the final race in the 2020 U.S. Presidential election between Biden and Trump. As shown in Fig. 2 the percentage of all tweets that mention Trump is significantly greater than the percentage of tweets that mention Biden (see Table 6 for keywords associated with each candidate). This gap in mentions is not unexpected, as Trump was the incumbent President and thus already had a significant presence on Twitter. While our current analysis is based on percentage of mentions in the tweets collected, our prior work in clustering users by political affiliation based on shared media found that conservative users have a more vocal presence on the political Twitter scene . Despite Trump’s general dominance in the chatter, we see that as major events occur, such as when Democratic primaries began to be called for Biden and during the Presidential debates, Biden began to see an increase in mentions. While a tweet may be counted as mentioning both Trump and Biden, we still see a corresponding decrease in percentage of Trump’s mentions when Biden’s mentions increase. This suggests that the discourse shifted away from Trump and towards Biden, particularly as election day neared, culminating in a similar percentage of tweets mentioning either Biden and/or Trump.
It appears that the tweets we collected in our dataset track well the real world events. However, the sheer percentage of our collected tweets that mention a particular candidate does not necessarily represent the sentiment and popularity of those candidates at the time. As Twitter has evolved as a platform, likewise the user base has also changed . This disparity between Twitter attention and real-world popularity was highlighted during the Democratic primaries. Sanders held the majority of percentage of tweet mentions from early January through the end of February. It was not until the initial primary results began to be tallied and reported that it became clear that Biden had actually won the Democrat’s vote.Footnote 32 Sanders’ dominance in Twitter discourse underscored how Biden’s eventual momentum took much of the Democratic party by surprise.Footnote 33 This can give us insight into how news and public discourse on social media platforms can misrepresent or give a false impression of the nation’s sentiment.
Twitter Location Engagement
Every tweet we collect is returned with metadata describing the tweet itself, including Twitter’s automatic language tag and post date. Each tweet also includes information about the author, and if the tweet was a response (reply, retweet or quote) to another tweet, the tweet’s metadata also contains information on the original poster. This metadata can sometimes include a user’s location data; however, we found that less than 1% of our tweets actually contained this information . Because of this, we leverage the included “location” field that a user manually populates as a part of their profile. We tag each tweet with its country of origin and, if the tweet originates from the United States, the detected state . While some users may list locations that are not accurate, do not exist or are unable to be identified through our algorithm, we leverage this as a proxy for tweet location.
We examine the domestic geographical flow of information within the United States. In isolating only retweets and quoted tweets (retweets with a comment), we find tweets that directly represent one user re-posting the tweet of another. Retweets and quoted tweets also return both the user specified location data for both the user who retweeted or quoted the tweet and the original poster. The user who retweeted or quoted the tweet will be referred to as the retweeter for clarity. Then, we retain all tweets within our dataset where we are able to identify a state for both the retweeter and the original poster, which directly implies that both the retweeter and original poster are also located in the United States. Figure 3 illustrates the flow of the top 200 most frequent state-to-state engagements, with the flow following retweets and quoted tweets from the original poster’s state to the retweeter’s state.
States in which the most tweets originate from generally coincide with the most populous states in the United States. The US Census Bureau lists California, Texas, Florida and New York as the most populous states in their 2019 estimate.Footnote 34 However, most tweets actually originate from the District of Columbia area, which is both the political center and the capital of the United States. This is consistent with the nature of the political landscape, as many politicians are located in the D.C. area. In general, Fig. 3 suggests that while there exists a substantial amount of intra-state tweet engagement, states with larger populations account for larger proportions of the measured intra-state engagement activity.
While this dataset gives us a glimpse of the political chatter on Twitter, there are still limitations to this dataset that warrant discussion. Due to the nature of the keywords we were tracking, the tweets in our dataset are highly skewed towards English and tweets that originate from the United States. Another limitation of the dataset is that the users on Twitter do not necessarily represent the collective sentiment of the United States. The audience that uses Twitter, according to a 2019 study conducted by Pew Research Center, skews younger and more Democratic than the general population; the most vocal on Twitter also tend to engage in political discourse.Footnote 35
Twitter also significantly rate limits the number of tweets that one can rehydrate, and tweets that have either been removed by the user or removed because a user was banned or suspended can no longer be retrieved through Twitter’s API. Our collection was also highly contingent upon the stability of our network and hardware, which means that there may be gaps in our data collection, particularly prior to our migration to AWS. Twitter has recently released an Academic Research track that enables researchers and academics to access the full-archival search; however, this still imposes rate limits that unfortunately makes filling these gaps in time hard.Footnote 36
Potential research avenues
There are many potential areas that can be explored using our dataset.
Recent work using our dataset has already begun to explore the prevalence of bots and misinformation within the 2020 political landscape [6, 7]. Luceri et al. also scrutinizes the bot engagement in political discourse in 2018 and found that many of these bots remained active during the 2020 election cycle . Our previous work has found that out of all major conspiracy theories that had taken root during the election, QAnon supporters were the most vocal and active. We also found that, when grouping users by their political affiliation, tweets from accounts most likely to be bots outnumber tweets from accounts that are most likely human for both the Republican and Democratic parties. Conservative accounts that are the most likely to be bots also have higher bot scores, suggesting that these accounts are more likely to be automated compared to their left-leaning counterparts . We used Indiana University’s Botometer, a tool that assigns a bot-score to a Twitter account based on an account’s activity [14, 15]. Others have also leveraged the polarized nature of the 2020 elections to model and estimate echo chambers based on a user’s political stance .
While this is just a sampling of current literature, there are many areas that are also being explored, including the presence, effect and detection of trolls  and foreign influence during the elections . Many new nascent and promising questions are also emerging in the wake of the elections, particularly as the COVID-19 pandemic has forced individuals to physically social distance and, consequently, seek community online.
After aggressive action to mitigate misinformation and the incitement of violence on major social network platforms, many flocked to alternative social network platforms that have espoused their support for freedom of speech, such as Parler and Gab.Footnote 37 While there has been much prior work in leveraging these alternative right-wing platforms to understand fringe views in conjunction with more main stream platforms [16,17,18] the recent high profile suspensions of major political figures’ accounts led to an increased public awareness and exodus to these platforms. Before Parler went offline, researchers even scraped post data.Footnote 38 Data collected across multiple platform have the potential to give insight into how fringe communities not only survive these rebuffs by the community but also thrive in the controversy.
Another interesting question that arises is how the pandemic and the resulting shift to online platforms changed the nature and effectiveness of political campaigns. As some politicians quickly cancelled in-person events as the severity of COVID-19 rose, others chose to continue in-person rallies .Footnote 39\(^,\)Footnote 40 Social media became an integral part of the campaign process, more so than before, as events such as the Democratic National Convention were held virtually.Footnote 41 Cross-platform studies will be essential in beginning to understand the full scope of how and to what extent COVID-19 has fundamentally altered our elections system.