Estimating time-series changes in social sentiment @Twitter in U.S. metropolises during the COVID-19 pandemic

Saito, Ryuichi; Haruyama, Shinichiro

doi:10.1007/s42001-022-00186-4

Estimating time-series changes in social sentiment @Twitter in U.S. metropolises during the COVID-19 pandemic

Research Article
Open access
Published: 12 November 2022

Volume 6, pages 359–388, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Computational Social Science Aims and scope Submit manuscript

Estimating time-series changes in social sentiment @Twitter in U.S. metropolises during the COVID-19 pandemic

Download PDF

2976 Accesses
4 Citations
2 Altmetric
Explore all metrics

Abstract

Since early 2020, the global coronavirus pandemic has strained economic activities and traditional lifestyles. For such emergencies, our paper proposes a social sentiment estimation model that changes in response to infection conditions and state government orders. By designing mediation keywords that do not directly evoke coronavirus, it is possible to observe sentiment waveforms that vary as confirmed cases increase or decrease and as behavioral restrictions are ordered or lifted over a long period. The model demonstrates guaranteed performance with transformer-based neural network models and has been validated in New York City, Los Angeles, and Chicago, given that coronavirus infections explode in overcrowded cities. The time-series of the extracted social sentiment reflected the infection conditions of each city during the 2-year period from pre-pandemic to the new normal and shows a concurrency of waveforms common to the three cities. The methods of this paper could be applied not only to analysis of the COVID-19 pandemic but also to analyses of a wide range of emergencies and they could be a policy support tool that complements traditional surveys in the future.

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

Mental Health Analysis in Social Media Posts: A Survey

Article 03 January 2023

Overview of the Twitter conversation around #14F 2021 Catalonia regional election: an analysis of echo chambers and presence of social bots

Article Open access 27 April 2024

Introduction

The SARS-CoV-2 coronavirus that first surfaced in Wuhan, China, in December 2019 spread globally and developed into a pandemic in 2020. As of June 26, 2022, two-and-a-half years after the outbreak, the cumulative number of cases worldwide is over 500 million, and the cumulative death toll is over 6 million. Over 5 billion people have been vaccinated with at least one dose according to the World Health Organization (WHO) [1], and vaccination is still a global primary agenda. During this time, the United States has experienced four or five waves of infection, behavioral restrictions, such as declarations of states of emergency in each state, a presidential election during the pandemic, and a national vaccination program [2]. In such emergencies, there is a need for a method that allows policymakers and public health professionals to quickly and accurately capture changes in citizens’ perceptions. For instance, if relaxing restrictions makes citizens feel more positive than policymakers expected, it may lead to the re-emergence of infections. By capturing the time-series of such perceptions, we can have a bird’s-eye view of social phenomena during the pandemic.

We propose a social sentiment estimation model for use in emergencies that is based on Twitter users located in U.S. metropolises during the pandemic. Many studies [3,4,5,6,7] have already attempted to estimate social sentiment in the COVID-19 pandemic, but they have the following limitations: (1) A periodic sentiment waveform that can change with the number of cases and behavioral restrictions has not been captured; (2) there have been no long-term trend analyses measuring from the pre-pandemic period to during the pandemic and then on to the new-normal period from a macro-perspective; and (3) no research has focused on large cities based on the characteristics of the coronavirus. Previous studies have evaluated text on social and news media that match keywords that directly remind us of viruses, such as “coronavirus” and “COVID-19,” so they cannot extract periodic changes in social sentiment. This is because those keywords are often used in limited contexts and emotional expressions.

In addition, according to the New York Times [8], the infection situation in the United States differs between metropolitan and rural areas, and it has been shown that since the late summer of 2020, per capita case and death rates in rural areas have outpaced those in metropolitan areas around the United States [8]. In addition, even in metropolitan areas, Rader et al. [9] have shown that the peak of the epidemic was more extreme in overcrowded cities than in less-populated cities. Therefore, when estimating social sentiment for coronavirus, it is necessary to separate metropolitan areas from rural areas and take into account cities’ sizes and characteristics. However, many previous studies have limited observational data at the linguistic, national, and state levels. To solve these problems, this research attempts the following approaches:

1.
Design mediation keywords inspired by activities of citizens limited by government-issued behavioral restrictions.
2.
Tweets collected based on the location information of New York City, Los Angeles, and Chicago are used as observation data from just before the pandemic to the new-normal period.
3.
Estimation performance is guaranteed through transformer-based neural network techniques, such as Bidirectional Encoder Representations from Transformers (BERT) and third-generation Generative Pre-trained Transformer (GPT-3).

The time-series of the extracted social sentiment was verified by the correlation coefficient with the number of confined cases, and the feature words extracted using term frequency-inverse document frequency (TF-IDF) supported the social sentiment waveform.

One limitation to note in this study is the demographic bias of Twitter users in the United States. On Twitter, it has been found that frequent users between the ages of 18 and 49 years account for 73 % of adult users as of 2021, which diverges from the demographics of the United States [10].

The contributions of our paper are as follows:

Proposal of a social sentiment time-series estimation model using mediation keywords that can be used during periods of emergency.
Long-term trend analysis of U.S. metropolises, such as New York City, Los Angeles, and Chicago, and the extraction of parallel trends of social sentiment waveforms common to all three cities.
Methodological improvements in a social sentiment estimation model using GPT-3.

The approach of this research, including keyword design, could be applied not only to the COVID-19 pandemic but also to other emergencies where citizens’ activities are restricted. In addition, by deploying and operating the model of this research on a data-streaming platform, it is possible to capture the time-series data of social sentiment in real-time emergencies.

Literature review

Coronavirus and natural language processing

Since January 2020 and the global spread of the coronavirus, many attempts have been made to use natural language processing methods to extract social insights from text information exchanged through the internet. First, Kruspe et al. [3] studied social sentiment during the pandemic using the neural network method. Kruspe et al. extracted social sentiment from Twitter in European countries, such as Italy, France, and Spain, during the initial months of the pandemic using a Multilingual Universal Sentence Encoder [11]. Caliskan [12] selected Ohio as a state with less ideological bias in the United States and multilaterally estimated tweets’ emotions using the GloVe [13] and Bidirectional Recurrent Neural Network (RNN) models. Chakraborty et al. [12] primarily indexed sentiment for a news-article dataset of the Global Database of Events, Language, and Tone (GDELT) project [14] using the AFINN Sentiment Lexicon and examined the relationship between the number of cases and deaths in China, the United States, Italy, and India. Saleh et al. [6] estimated the sentiment of tweets matching #socialdistancing and #stayathome sent between March 27 and April 10, 2020, using the AFINN Sentiment Lexicon, and then attempted to cluster topics through Latent Dirichlet Allocation (LDA). Abd-Alrazaq et al. [5] classified English tweets that match keywords, such as “corona” and “COVID-19”, into 12 topics primarily using LDA and scored sentiment by topic using the Python library TextBlob. Ridhwan et al. [7] evaluated sentiment on Twitter during the pandemic period of February through August 2020 in Singapore using both neural network-based RNN [15] and lexicon-based Valence Aware Dictionary and sEntiment Reasoner (VADER) [16] approaches. Moreover, Hussain et al. [17] visualized changes in citizens’ susceptibility to vaccines on Facebook and Twitter from March through November 2020 in the United Kingdom and the United States using VADER and BERT [18].

Our research takes a different approach from the above-mentioned methods. Previous studies have inferred the sentiment of texts from social networking services and news media that match keywords relevant to coronavirus and behavioral restrictions. However, these keywords are often only used in a limited context, so it can be difficult to capture periodic waves in tandem with increases or decreases in case numbers and the addition or relaxing of behavioral restrictions. In this study, we try to solve the above limitations by focusing on the sentiment of citizens limited by the behavioral restrictions issued by the state government.

Transformer-based neural network model

Next, we give an overview of the transformer-based neural network model on which this study relies. Transformer [19] is a model that solves the difficulty of parallelizing the training of RNN models [15, 20, 21] based on a two-part network of encoders and decoders [22] to handle tasks with different input and output lengths, such as machine translation and chatbots. The transformer uses an attention mechanism instead of a lengthy recursive network. The attention mechanism is inspired by the human eye, and can learn the relationship between distant tokens and between sentences by investigating the similarities between word vectors.

The BERT [18] language model uses Transformer’s encoder and has demonstrated state-of-the-art performance in language-understanding evaluation. BERT trains in two phases: pre-training and fine-tuning. In the pre-training phase, the attention mechanism trains a huge dataset to construct a general-purpose model, and in the fine-tuning phase, the model is adjusted according to the actual application. The pre-training phase uses two steps, Masked Language Modeling (MLN) and Next Sentence Prediction (NSP), to train sentences bidirectionally using the attention mechanism. In the fine-tuning phase, the parameters obtained by the pre-training phase are used as the initial values of the weights, and the training is specialized for the question.

GPT-3 [23] is a language model that uses Transformer’s decoder, and it was developed to support general-purpose tasks with only pre-training operation by 175 billion parameters. The model architecture of GPT-3 inherits GPT-2 [24], which was based on GPT [25]. GPT-3 achieves higher accuracy than GPT-2 by training with a larger dataset extracted from Common Crawl and Web-Text2. It has been confirmed that GPT-3 accomplishes high accuracy without fine-tuning, but in this study, we tried fine-tuning to realize even higher accuracy. In this research, we support the reliability of the estimation results using the above transformer-based neural network methods.

Methods

Initially, to extract the transition of social sentiment from the pre-pandemic period to the new-normal period, tweets inspired by citizens’ activities limited by restrictions in New York City, Los Angeles, and Chicago were retrieved from December 30, 2019, to January 2, 2022. The retrieved tweets were classified into sentiment using a neural network model that was fine-tuned on the Twitter dataset and indexed numerically. The indexed sentiment was validated for correlation with the number of confirmed cases, and then, feature words were identified using the TF-IDF to confirm the trend of tweets classified into sentiments.

Data collection

Tweets were collected using the Twitter application programming interface (API) and aggregated by type of behavioral restriction.

City and timeframe

Coronavirus infections in the United States have grown at different speeds in metropolitan and rural areas depending on the time of year [8], and it has been confirmed that infections tend to explode in overcrowded cities rather than in less-populated cities [9]. New York City, Los Angeles, and Chicago were selected as observation targets for this research based on their respective populations and the number of tweets sent in those cities. In the U.S., New York City, Los Angeles, and Chicago are the most congested cities in terms of population according to U.S. Census Data [26] and in terms of the number of tweets for each city according to Förster et al. [27].

In the actual search, the Full-archive Search API of Twitter API v2 was used to collect tweets posted within a 25-mile radius of each city’s city hall. The 25-mile radius setting was based on the Full-archive Search API limit, but we consider this reasonable for collecting tweets from the center of these large cities.

Our search period was the 2-year period from December 30, 2019, to January 2, 2022, capturing sentiment from before the coronavirus pandemic to the new normal following repeated outbreaks and behavioral restrictions. In addition, tweets were aggregated weekly to offset the weekend effect.

Keywords

Previous studies [3,4,5,6,7] used keywords that directly relate to coronavirus or behavioral restrictions to estimate citizens’ sentiment during the pandemic; however, these methods have the following two limitations.

1.
It is not possible to compare the pre-pandemic period with the pandemic period, because these keywords were either not recognized by the public before the pandemic or were used in other ways.
2.
These keywords are often used in a negative context and cannot be compared to the infection-spread period and convergence period in tandem with infection status.

Figures 1 and 2 show the results of estimating the time-series of sentiment in New York City using keywords relevant to the coronavirus and behavioral restrictions. The keywords used are shown in Table 1. These keywords were designed based on previous studies, Centers for Disease Control and Prevention (CDC) usages [28], and similarities between words using Word2Vec [29, 30]. In addition, the BERT model described in Section (1) later was used for sentiment estimation.^{Footnote 1} The higher the value of the sentiment index, the more negatively it is interpreted and vice versa.

In Fig. 1, the sentiment index value ranges between −0.05 to 0.1 during the pre-pandemic period when the coronavirus was not well recognized by the public, but it drops to less than −0.05 during the first outbreak in April 2020. In addition, the sentiment index ranges between 0.05 and 0.2 after April 2021, which is generally considered to be the time when behavioral restrictions were lifted and citizens looked ahead to the new normal. Figure 2 confirms the same trend as Fig. 1. From the above, these keywords are not appropriate for capturing social sentiment, such as fear and anxiety about the spread of infection or a sense of security about the end of infection.

Table 1 Keywords Related to Coronavirus and Behavioral Restrictions

Estimating time-series changes in social sentiment @Twitter in U.S. metropolises during the COVID-19 pandemic

Abstract

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

Mental Health Analysis in Social Media Posts: A Survey

Overview of the Twitter conversation around #14F 2021 Catalonia regional election: an analysis of echo chambers and presence of social bots

Introduction

Literature review

Coronavirus and natural language processing

Transformer-based neural network model

Methods

Data collection

City and timeframe

Keywords

Collection result

Training inference

Neural network model and fine-tuning

Accuracy of models

Methods of indexing

Corroboration of index

Results

New York City

Los Angeles

Chicago

Discussion

Limitations

Conclusion

Data availability statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation