Estimating time-series changes in social sentiment @Twitter in U.S. metropolises during the COVID-19 pandemic

Since early 2020, the global coronavirus pandemic has strained economic activities and traditional lifestyles. For such emergencies, our paper proposes a social sentiment estimation model that changes in response to infection conditions and state government orders. By designing mediation keywords that do not directly evoke coronavirus, it is possible to observe sentiment waveforms that vary as confirmed cases increase or decrease and as behavioral restrictions are ordered or lifted over a long period. The model demonstrates guaranteed performance with transformer-based neural network models and has been validated in New York City, Los Angeles, and Chicago, given that coronavirus infections explode in overcrowded cities. The time-series of the extracted social sentiment reflected the infection conditions of each city during the 2-year period from pre-pandemic to the new normal and shows a concurrency of waveforms common to the three cities. The methods of this paper could be applied not only to analysis of the COVID-19 pandemic but also to analyses of a wide range of emergencies and they could be a policy support tool that complements traditional surveys in the future.


Introduction
The SARS-CoV-2 coronavirus that first surfaced in Wuhan, China, in December 2019 spread globally and developed into a pandemic in 2020. As of June 26, 2022, two-and-a-half years after the outbreak, the cumulative number of cases worldwide is over 500 million, and the cumulative death toll is over 6 million. Over 5 billion people have been vaccinated with at least one dose according to the World Health Organization (WHO) [1], and vaccination is still a global primary agenda. During this time, the United States has experienced four or five waves of infection, behavioral restrictions, such as declarations of states of emergency in each state, a presidential election during the pandemic, and a national vaccination program [2]. In such emergencies, there is a need for a method that allows policymakers and public health professionals to quickly and accurately capture changes in citizens' perceptions. For instance, if relaxing restrictions makes citizens feel more positive than policymakers expected, it may lead to the re-emergence of infections. By capturing the time-series of such perceptions, we can have a bird'seye view of social phenomena during the pandemic.
We propose a social sentiment estimation model for use in emergencies that is based on Twitter users located in U.S. metropolises during the pandemic. Many studies [3][4][5][6][7] have already attempted to estimate social sentiment in the COVID-19 pandemic, but they have the following limitations: (1) A periodic sentiment waveform that can change with the number of cases and behavioral restrictions has not been captured; (2) there have been no long-term trend analyses measuring from the pre-pandemic period to during the pandemic and then on to the newnormal period from a macro-perspective; and (3) no research has focused on large cities based on the characteristics of the coronavirus. Previous studies have evaluated text on social and news media that match keywords that directly remind us of viruses, such as "coronavirus" and "COVID-19," so they cannot extract periodic changes in social sentiment. This is because those keywords are often used in limited contexts and emotional expressions.
In addition, according to the New York Times [8], the infection situation in the United States differs between metropolitan and rural areas, and it has been shown that since the late summer of 2020, per capita case and death rates in rural areas have outpaced those in metropolitan areas around the United States [8]. In addition, even in metropolitan areas, Rader et al. [9] have shown that the peak of the epidemic was more extreme in overcrowded cities than in less-populated cities. Therefore, when estimating social sentiment for coronavirus, it is necessary to separate metropolitan areas from rural areas and take into account cities' sizes and characteristics. However, many previous studies have limited observational data at the linguistic, national, and state levels. To solve these problems, this research attempts the following approaches: 1. Design mediation keywords inspired by activities of citizens limited by government-issued behavioral restrictions.
2. Tweets collected based on the location information of New York City, Los Angeles, and Chicago are used as observation data from just before the pandemic to the new-normal period. 3. Estimation performance is guaranteed through transformer-based neural network techniques, such as Bidirectional Encoder Representations from Transformers (BERT) and third-generation Generative Pre-trained Transformer (GPT-3).
The time-series of the extracted social sentiment was verified by the correlation coefficient with the number of confined cases, and the feature words extracted using term frequency-inverse document frequency (TF-IDF) supported the social sentiment waveform. One limitation to note in this study is the demographic bias of Twitter users in the United States. On Twitter, it has been found that frequent users between the ages of 18 and 49 years account for 73 % of adult users as of 2021, which diverges from the demographics of the United States [10].
The contributions of our paper are as follows: • Proposal of a social sentiment time-series estimation model using mediation keywords that can be used during periods of emergency. • Long-term trend analysis of U.S. metropolises, such as New York City, Los Angeles, and Chicago, and the extraction of parallel trends of social sentiment waveforms common to all three cities. • Methodological improvements in a social sentiment estimation model using GPT-3.
The approach of this research, including keyword design, could be applied not only to the COVID-19 pandemic but also to other emergencies where citizens' activities are restricted. In addition, by deploying and operating the model of this research on a data-streaming platform, it is possible to capture the time-series data of social sentiment in real-time emergencies.

Coronavirus and natural language processing
Since January 2020 and the global spread of the coronavirus, many attempts have been made to use natural language processing methods to extract social insights from text information exchanged through the internet. First, Kruspe et al. [3] studied social sentiment during the pandemic using the neural network method. Kruspe et al. extracted social sentiment from Twitter in European countries, such as Italy, France, and Spain, during the initial months of the pandemic using a Multilingual Universal Sentence Encoder [11]. Caliskan [12] selected Ohio as a state with less ideological bias in the United States and multilaterally estimated tweets' emotions using the GloVe [13] and Bidirectional Recurrent Neural Network (RNN) models.
Chakraborty et al. [12] primarily indexed sentiment for a news-article dataset of the Global Database of Events, Language, and Tone (GDELT) project [14] using the AFINN Sentiment Lexicon and examined the relationship between the number of cases and deaths in China, the United States, Italy, and India. Saleh et al. [6] estimated the sentiment of tweets matching #socialdistancing and #stayathome sent between March 27 and April 10, 2020, using the AFINN Sentiment Lexicon, and then attempted to cluster topics through Latent Dirichlet Allocation (LDA). Abd-Alrazaq et al. [5] classified English tweets that match keywords, such as "corona" and "COVID-19", into 12 topics primarily using LDA and scored sentiment by topic using the Python library TextBlob. Ridhwan et al. [7] evaluated sentiment on Twitter during the pandemic period of February through August 2020 in Singapore using both neural network-based RNN [15] and lexicon-based Valence Aware Dictionary and sEntiment Reasoner (VADER) [16] approaches. Moreover, Hussain et al. [17] visualized changes in citizens' susceptibility to vaccines on Facebook and Twitter from March through November 2020 in the United Kingdom and the United States using VADER and BERT [18]. Our research takes a different approach from the above-mentioned methods. Previous studies have inferred the sentiment of texts from social networking services and news media that match keywords relevant to coronavirus and behavioral restrictions. However, these keywords are often only used in a limited context, so it can be difficult to capture periodic waves in tandem with increases or decreases in case numbers and the addition or relaxing of behavioral restrictions. In this study, we try to solve the above limitations by focusing on the sentiment of citizens limited by the behavioral restrictions issued by the state government.

Transformer-based neural network model
Next, we give an overview of the transformer-based neural network model on which this study relies. Transformer [19] is a model that solves the difficulty of parallelizing the training of RNN models [15,20,21] based on a two-part network of encoders and decoders [22] to handle tasks with different input and output lengths, such as machine translation and chatbots. The transformer uses an attention mechanism instead of a lengthy recursive network. The attention mechanism is inspired by the human eye, and can learn the relationship between distant tokens and between sentences by investigating the similarities between word vectors.
The BERT [18] language model uses Transformer's encoder and has demonstrated state-of-the-art performance in language-understanding evaluation. BERT trains in two phases: pre-training and fine-tuning. In the pre-training phase, the attention mechanism trains a huge dataset to construct a general-purpose model, and in the fine-tuning phase, the model is adjusted according to the actual application. The pre-training phase uses two steps, Masked Language Modeling (MLN) and Next Sentence Prediction (NSP), to train sentences bidirectionally using the attention mechanism. In the fine-tuning phase, the parameters obtained by the pre-training phase are used as the initial values of the weights, and the training is specialized for the question.
GPT-3 [23] is a language model that uses Transformer's decoder, and it was developed to support general-purpose tasks with only pre-training operation by 175 billion parameters. The model architecture of GPT-3 inherits GPT-2 [24], which was based on GPT [25]. GPT-3 achieves higher accuracy than GPT-2 by training with a larger dataset extracted from Common Crawl and Web-Text2. It has been confirmed that GPT-3 accomplishes high accuracy without fine-tuning, but in this study, we tried fine-tuning to realize even higher accuracy. In this research, we support the reliability of the estimation results using the above transformer-based neural network methods.

Methods
Initially, to extract the transition of social sentiment from the pre-pandemic period to the new-normal period, tweets inspired by citizens' activities limited by restrictions in New York City, Los Angeles, and Chicago were retrieved from December 30, 2019, to January 2, 2022. The retrieved tweets were classified into sentiment using a neural network model that was fine-tuned on the Twitter dataset and indexed numerically. The indexed sentiment was validated for correlation with the number of confirmed cases, and then, feature words were identified using the TF-IDF to confirm the trend of tweets classified into sentiments.

Data collection
Tweets were collected using the Twitter application programming interface (API) and aggregated by type of behavioral restriction.

City and timeframe
Coronavirus infections in the United States have grown at different speeds in metropolitan and rural areas depending on the time of year [8], and it has been confirmed that infections tend to explode in overcrowded cities rather than in lesspopulated cities [9]. New York City, Los Angeles, and Chicago were selected as observation targets for this research based on their respective populations and the number of tweets sent in those cities. In the U.S., New York City, Los Angeles, and Chicago are the most congested cities in terms of population according to U.S. Census Data [26] and in terms of the number of tweets for each city according to Förster et al. [27].
In the actual search, the Full-archive Search API of Twitter API v2 was used to collect tweets posted within a 25-mile radius of each city's city hall. The 25-mile radius setting was based on the Full-archive Search API limit, but we consider this reasonable for collecting tweets from the center of these large cities.
Our search period was the 2-year period from December 30, 2019, to January 2, 2022, capturing sentiment from before the coronavirus pandemic to the new normal following repeated outbreaks and behavioral restrictions. In addition, tweets were aggregated weekly to offset the weekend effect.

Keywords
Previous studies [3][4][5][6][7] used keywords that directly relate to coronavirus or behavioral restrictions to estimate citizens' sentiment during the pandemic; however, these methods have the following two limitations. 1. It is not possible to compare the pre-pandemic period with the pandemic period, because these keywords were either not recognized by the public before the pandemic or were used in other ways. 2. These keywords are often used in a negative context and cannot be compared to the infection-spread period and convergence period in tandem with infection status.   The keywords used are shown in Table 1. These keywords were designed based on previous studies, Centers for Disease Control and Prevention (CDC) usages [28], and similarities between words using Word2Vec [29,30]. In addition, the BERT model described in Section (1) later was used for sentiment estimation. 1 The higher the value of the sentiment index, the more negatively it is interpreted and vice versa. In Fig. 1, the sentiment index value ranges between −0.05 to 0.1 during the prepandemic period when the coronavirus was not well recognized by the public, but it drops to less than −0.05 during the first outbreak in April 2020. In addition, the sentiment index ranges between 0.05 and 0.2 after April 2021, which is generally considered to be the time when behavioral restrictions were lifted and citizens looked ahead to the new normal. Figure 2 confirms the same trend as Fig. 1. From the above, these keywords are not appropriate for capturing social sentiment, such as fear and anxiety about the spread of infection or a sense of security about the end of infection.
We focus on the activities of citizens limited by behavioral restrictions. These include common activities, such as commuting, dating, and traveling, even during the pre-pandemic period. By designing keywords based on such activities, we hypothesized that we could extract social sentiment according to the number of cases and the status of behavioral restrictions. These keywords are related to coronavirus and behavioral restrictions, but are also used in contexts outside of coronavirus or behavioral restrictions. In other words, they are the keywords that mediate between coronavirus and inference. In addition, we proposed mediation keywords inspired by the activities of citizens limited by government restrictions.
In Hallas et al. [31], as part of the response of the U.S. state government to the pandemic, the transition of restrictions was investigated by dividing into three types: containment and closure, economic response, and health systems. The present study designed mediation keywords based on citizens' activities as limited by this containment and closure, which represents behavioral restrictions. In Hallas et al., containment and closure were subcategorized into eight restrictions, such as workplace closings and restrictions on gathering size, but in the present study, we recategorized them into three restrictions, as shown in Table 2. Keywords that are restricted activities were designed according to the CDC's descriptions [28] for Stay-at-homeordered and Travel Restrictions and to detailed Alabama State orders [32], as well as the CDC description for Restrictions on Gatherings. Also, since the meaning of a single word changes depending on the context, we designed phrases consisting of two or three words related to state-level restrictions.
To estimate the sentiment of citizens limited by behavioral restrictions, the collected tweets did not need to have been posted in the context of coronavirus or behavioral restrictions. This is because our study does not estimate opinions on coronavirus and behavioral restrictions as in the results of Figs. 1 and 2, but rather estimates the daily emotional changes that citizens have in both normal and pandemic periods. On the other hand, changes in the overall theme exchanged in the weekly aggregated tweet space are supported by the TF-IDF as described in a later section. Finally, retweet information was excluded from the search results.

Collection result
The tweets obtained by the Twitter API will be explained. The total number of tweets is 309,425, the number of unique users is 102,807, 2 , and the total file size is 88.4 M bytes. Table 3 shows the number of tweets, and Table 4 shows the number of unique users in each city for each restriction type. Also, Table 5 indicates the number of tweets for each number of words in the keyword. As originally expected, using phrases with fewer words located a higher number of tweets. In this study, tweets were retrieved from June 17 to August 30, 2022. According to Yoshida [33], when Yoshida attempted to retrieve tweets based on tweet IDs in the public Twitter dataset associated with COVID-19, they reported that 15.3% of all tweets were inaccessible. Accordingly, in our study, it should be noted that we may not have retrieved all tweets from December 29, 2019, to January 2, 2022.

Training inference
In this study, a transformer-based neural network model is tested as a sentimentinference method. Although there is a method to classify sentiment based on a preregistered dictionary, this method cannot classify texts that do not match the dictionary. For the results filtered in advance by keywords, such as in Table 2, classification based on contextual information using a neural network is appropriate.

Neural network model and fine-tuning
In this study, we used the BERT and the GPT-3 models based on Transformer architecture. In the BERT model, BERT-Base [34], which was pre-trained on a dataset consisting of English Wikipedia and 11,038 unpublished books, was used. As a machine learning library, Pytorch 1.7.1 and torchtext 0.8.1 were used for training and inference processes. The number of epochs in training and validation tasks was 14. In GPT-3, Open AI GPT-3 Curie [35], which is a model optimized for language translation, complex classification, text sentiment, and summarization, was used.
For training data, the Sentiment 140 dataset [36], which labeled tweets with emotions, was used for fine-tuning. After removing the URL and mention information starting with @ from the Sentiment140 tweet dataset, the positive and negative data were divided equally, and then, the dataset was divided into 80% training data and 20% validation data. The split data are shown in Table 6.

Accuracy of models
In addition to the Sentiment140 dataset used for training, the datasets collected were used for testing, and then, the collected datasets were manually labeled. For labeling, an author and two collaborators labeled sentiment for the same data and a majority vote determined the final label for the test data. The test results are shown in Table 7. Although the GPT-3 training data are approximately 1/10 of the BERT training data, the accuracy of the BERT model was 77.1% and the GPT-3 model was 89.5% in the Sentiment140 dataset. Furthermore, in the collection dataset, the accuracy of the BERT model was 72.4% and the GPT-3 model was 81.0%. This result confirms the superiority of GPT-3's performance in the sentiment classification of Twitter tweets.

Methods of indexing
The retrieved tweets were converted to sentiment using a classifier application that implemented the BERT and GPT-3 models. In the BERT classifier, the accuracy of the model could only be guaranteed at 72.4% in the collection dataset, so tweets with a probability estimation result of less than 0.70 were sent to neutral sentiment. Each tweet in the BERT was classified as 0 for positive, 1 for neutral, or 2 for negative, and 0 was normalized to -1, 1 to 0, and 2 to 1 to make the neutral 0. In the GPT-3 classifier, the accuracy of the model was sufficiently guaranteed to be 81.0% in the collection dataset, so positive was defined as 0 and negative was defined as 2 for polar classification, and finally, 0 was normalized to -1 and 2 was normalized to 1. Tables 8 and 9 show examples of sentiment classified and indexed by a classifier application that implemented the BERT and GPT-3 models. The results are the arithmetic mean of the classified values of the sentiment index, aggregated on a weekly basis from Monday to Sunday. The higher the value of the index, the more pessimistic the sentiment is throughout the week, and the lower the value, the more optimistic the sentiment.

Corroboration of index
To confirm the usefulness of the extracted sentiment index, it is validated in two ways. First, a correlation coefficient 3 is used to examine the time-series relationship between the extracted sentiment and the number of cases. A significantly higher value of the correlation coefficient indicates a higher sensitivity of citizens to the number of infected cases in each city during that period, and vice versa. The second is the extraction of feature words of classified tweets using TF-IDF. TF-IDF is a feature vectorization method widely used to identify the importance  4 , and feature words are used to identify unique words of tweets extracted on a weekly basis. If many feature words related to coronaviruses are extracted, it is interpreted as many tweets being exchanged in the context of coronaviruses, and vice versa. The extracted words characterize the entire theme exchanged during that period. In this research, we have already filtered tweets using detailed keywords in Table 2, so feature words are extracted using the TF-IDF method, not topic extraction such as LDA.

Results
This section shows a time-series of the sentiment of citizens limited by the restrictions of each metropolis of New York City, Los Angeles, and Chicago from December 30, 2019, to January 2, 2022. This section focuses on a time-series analysis of the sentiment index within these same regions. The indexes extracted using GPT-3, which showed high accuracy, were drawn, and the indexes extracted using BERT were drawn as a reference. For the number of new infections, the New York Times COVID-19 Data [37] categorized at the county level as of April 3, 2022 were used and aggregated weekly. The correlation coefficient was used to confirm the relationship between sentiment index and infection status, and the 4-week average analysis was completed in addition to the weekly analysis. Furthermore, each city's timeline and feature words were referenced to ascertain the relevance of the sentiment index to events, such as state government orders.

New York City
New Cases on the left axis of Fig. 3 show the number of cases identified in New York City, and the right axis shows the index of sentiment extracted by the sentiment classifier. By January 2022, New York City had experienced four waves of infection in the spring and winter of 2020 and the fall and winter of 2021. In New York State, unlike in the South and West, there was no spike in infections in the summer of 2020, and since July 2020, California has surpassed New York in terms of the number of infected cases. 5 First, we describe the plotted waveform of sentiment. The correlation coefficient between the BERT sentiment and GPT-3 sentiment is 0.77. 6 Table 10 shows the correlation coefficient between the sentiment index obtained and the number of confirmed cases. Lag means the time lag in the number of cases against the sentiment index. For instance, 2 weeks refer to the correlation coefficient with the number of cases 2 weeks after the week in which the sentiment index was extracted. In Table 10, Total means the total sentiment extracted for each restriction type, and a positive correlation is confirmed in the lag of 1 week to 2 weeks until May 2021, but no significant correlation is confirmed after July 2021. Additionally, from examining the correlation coefficient by restriction type, we can see that tweet sentiment by type is associated with the number of confirmed cases in the lag of 1 week to 2 weeks until May 2021. On the other hand, the total 4-week average as a trend line confirms a significant positive correlation throughout the period from February 2020 to December 2021, and in each period, the total 4-week average values show a higher correlation than the weekly values.
Next, we verify the sentiment waveform. In Fig. 3, sentiment spiked in the week of March 9, 2020, and continued to peak until the week of March 23. The first cases of infection were confirmed in New York State on March 3, after which Governor Cuomo announced the New Rochelle containment area on March 10 and WHO declared a global COVID-19 pandemic on March 11. 7 As of March 25, the number of infected cases in New York accounted for more than 7% of the total number of cases worldwide, and Governor Cuomo stated that the closure of schools and gatherings dramatically delayed the exponential increase in infections. 8 In this first wave of infections from March, the peak of sentiment index values overlaps with the above period. Table 11 and 12 shows the top feature words extracted by the TF-IDF regarding sentiment. These feature words support that, during this first wave, negative tweets using keywords related to Stay-at-home-ordered Restrictions, Restrictions on Gatherings, and Travel Restrictions were exchanged in contexts related to COVID-19 using words such as "coronavirus," "canceled", "quarantine", "essential," and "distancing".
In the summer of 2020, infections subsided in New York City, and the sentiment index rose to 0.00 by the week of July 6, 2020. In the feature words of the same period, shown in Table 11, the keywords related to behavioral restrictions decreased, while those related to coronavirus were still prevalent. Infection spikes in the South and West, rather than in New York City, might have affected the city's sentiment index. In the second wave, citizens' awareness may have risen, as Governor Cuomo tightened regulations on schools and places of worship on October 6 in response to increasing cases in parts of New York City. 9 Then, through November, behavioral restrictions increased as hospitalization rates broke records. 10 According to Fig. 3, the value of the sentiment index decreased from the week of November 23 to the week of November 30, but according to the feature words in Tables 11 and 12, there is a high possibility that the Thanksgiving holidays had an effect. In addition to the spread of coronavirus infection, as seen in the feature words, the winter storm that occurred in mid-December might have contributed to the rise in the sentiment index in December. 11 (A similar winter snowstorm effect was confirmed from feature words in the week of February 1, 2021.) It should be noted that holidays make citizens feel positive about going out, gatherings, and travel, and storms are natural behavioral restrictions; therefore, our study's keywords were sensitive to these events. In addition, focusing on the feature words of positive sentiment in late April 12, when the number of cases and the sentiment index decreased, the topic of vaccination increased. After June 2021, no significant correlation between the number of cases and the sentiment index could be confirmed, except for the 2-week time lag in the winter wave of 2021. However, as can be seen from the feature words in Table 11, it is highly possible that tweets related to coronavirus affected the sentiment index even during this period. In particular, keywords related to vaccination from the week of August 2 to 16, 2021 and keywords related to Omicron from the week of December 20 to 27 stand out.

Los Angeles
The left axis of Fig. 4 shows the number of infection cases in Los Angeles County, and the right axis shows the sentiment index. Unlike New York State, California has experienced five waves of infection, including in the summer of 2020, with Los Angeles County having the highest cumulative number of confirmed cases at the county level in the United States in June 2020 [38]. Additionally, in California, since Governor Newson issued a stay-at-home order in March 2020, he has been intermittently adding and easing behavioral restrictions, eventually announcing re-opening on June 15, 2021, without capacity restrictions or distancing requirements. 12 First, we explain the sentiment waveform. The correlation coefficient between the BERT sentiment and GPT-3 sentiment is 0.71. 13  correlated with cases except the second wave. Furthermore, even after June 2021, when socioeconomic activities resumed, the correlation between the number of cases and the sentiment index continued to be confirmed, which is a unique feature of Los Angeles. Then, the sentiment waveform is verified. The first wave began with an infection on the cruise ship Grand Princess on the week of March 2, 2020, and in the same week, Governor Newsom declared a State of Emergency [39]. The number of cases peaked 2 weeks later during the week of March 30. According to Table 13, the correlation coefficient is 0.95 in total with a lag of 2 weeks, and the correlation coefficient for each restriction type is also a significant value. Keywords such as "cancel", "quarantine", "lockdown", and "closed" also stand out from Table 15, which confirms that negative tweets posted in the context of behavioral restrictions contributed to the waveform.
In the next wave of infections from spring to summer of 2020, infection numbers peaked during the week of July 13. In the following week, the number of cases in California exceeded New York State, reaching the highest level in the United States. 14 In Los Angeles, no significant correlation was found between confirmed cases and the sentiment index from spring to summer, 2020. According to Table 15, tweets on the theme related to "mask" were prevalent. By June, Governor Newsom had announced that Californians would be required to wear face masks in public, 15 and public awareness had likely risen, coupled with increased infections (Table 16). In the spread from October 2020, infections peaked from December 2020 to January 2021. In addition, the total correlation coefficient has a high value of 0.73 to 0.74, and unlike in the other two cities, there is a possibility that citizens' awareness was highly sensitive regarding coronavirus during this period in Los Angeles. The high level of public interest in the coronavirus can be seen from the feature words of the same period in Table 15. According to Tables 13 and 14, the correlation for both the June 14 to October 11 and November 15 to December 27 periods is significant, although there are differences in the time lag. Citizens may have been more concerned about the coronavirus during both the pandemic and new-normal periods in Los Angeles than in the other two cities, according to their feature words and correlation coefficients.

Chicago
New Cases in Fig. 5 display a weekly time-series of the number of cases in Cook County, Illinois, where the city of Chicago is located. Citizens of Chicago have experienced four major waves of infection from 2020 to 2021 at the same times as New York City. 16 Interestingly, Illinois consistently had the lowest levels of cases and deaths in the country from late spring to early summer of 2020. However, the city also faced the highest level of deaths per week of all states during the second surge in the winter of 2020. 17 First, we can describe the sentiment waveform. The correlation coefficient between the BERT sentiment and GPT-3 sentiment is 0.76. 18 On the other hand, a strong correlation between the sentiment index and the number of cases could not be confirmed, as shown in Table 17. During the Omicron wave, which started in October 2021, the correlation coefficient of 0.87 in total shows a strong positive correlation only when the lag between the sentiment index and cases is not set. Second, the sentiment waveform is verified. The sentiment waveform's peak in March 2020, shown in Fig. 5, could be explained by the announcement of behavioral restrictions by Governor Pritzker. Orders issued on March 20 required citizens to stay at home and non-essential businesses to be closed statewide, which restriction was extended until May 29, 2020. 19 On May 5, 2020, Governor Pritzker announced a re-opening plan consisting of 20 and in the same week, sentiment responded positively. Additionally, the negative reaction in sentiment in early June may have been heavily influenced by posts about Blackout Tuesday, according to Table 18. During the second wave, which began in November 2020, the restriction level moved into Tier 2 on November 11 and then into the highest Tier 3 on November 20 in response to a surge in cases. 21 Sentiment responded negatively as the restriction level shifted. Then, toward the end of November, sentiment responded positively, and feature words in November 2020 of Table 19 show that this was the effect of the Thanksgiving holidays. In addition, on January 15, 2021, mitigations at Tier 3 restriction levels were resumed, 22 and in Table 19, the keywords "mitigations" and "tier" stand out as positive words. For 2020, we could not confirm the correlation between the sentiment index and the number of cases in Chicago, but we could indicate an association between the orders and relaxing of behavioral restrictions by the State government, although the relationship between behavioral restrictions and sentiment time-series was disrupted by Blackout Tuesday and the Thanksgiving holidays. On the other hand, for 2021, the relationship between the sentiment index and coronavirus-related events was not confirmed, which could have been due to the lifting of orders and restrictions.

Discussion
While the Results section analyzed the relationship between the number of infections and sentiment trends within each region, this section attempts to compare sentiment waveforms in each city. First, as a general trend of the waveform common to all three cities, we could see that sentiment spiked in the first wave of infection and then gradually stagnated over time. In each figure, the sentiment index exceeded 0.1 in March 2020, and after that, we found that sentiment gradually declined according to 12-week moving average lines. Hallas et al. [31] demonstrate the timeseries of the index, showing that stringency of policy response was relaxed toward December 2020, despite increasing infections per capita in each state of the United States. Although there are other possible factors, such as vaccinations and reduced lethality of mutant strains, it is conceivable that relaxing mitigation policies might have turned around the sentiment of citizens limited by behavioral restrictions. Chakraborty et al. [12] show that the degree of negativity reduced over time in English news articles related to the coronavirus within a short period of 60 days; therefore, the influence of change in news media could be considered at the same time. Furthermore, compared to the pre-pandemic period, the sentiment index in January 2020 was around − 0.2 in each city, but it was still higher in New York City and Los Angeles after April 2021 when citizens became conscious of the new normal. In these cities, sentiment affected by behavioral restrictions might not have returned to pre-pandemic levels even in the latter half of 2021. Table 20 shows the correlation coefficients between the sentiment indexes of each city. Significance is shown at the 5% significance level in the two-sided test between cities. In particular, a strong relationship between the New York City and Los Angeles was confirmed, and a parallel trend of waveforms was observed between each city. In the stringency index by Hallas et al. [31], New York and California have been considered relatively strict States in their policy response to the pandemic, but in terms of infection status, Los Angeles is different from New York City, for example, in experiencing a wave of infections in the summer of 2020. On the other hand, as seen in Tables 10, 13, and 14, the significance of the correlation between the number of cases and the sentiment index varies with the observation period in New York City and Los Angeles. From the above, we can see that the sentiment waveform of each city was not solely influenced by its own infection situation.

Limitations
There are three limitations to this study. The first is the limitation of the sentiment classification model. The BERT model was classified into three polarities (positive, neutral, and negative) based on the probability of inference, but in the GPT-3 model, it could not return inference probabilities due to the specifications of API.
More accurate results could be derived by creating Twitter training data classified into three polarities. The second limitation regarded the extracted sample. In the study, tweets posted were evaluated, but the number of tweets retrieved has decreased over time 23 A decrease in the number of tweets might indicate a decrease in citizens' interest, and the characteristics of the sample population might have changed over time. Therefore, in the future, a multifaceted approach such as an evaluation of user bias should be used.
The third limitation regards the sentiment captured. In this study, we proposed a social sentiment estimation model set in three U.S. metropolises during the COVID-19 pandemic period from December 30, 2019, to January 2, 2022. However, in New York City, the influences of the winter storm in December 2020 and the blizzard in February 2021 were also confirmed from the extracted feature words. Events that limit citizen behavior, such as natural disasters, also bring pessimistic feelings to citizens, just like the coronavirus. However, the results of this experiment also suggest

Conclusion
We proposed a social sentiment estimation model based on Twitter use in New York City, Los Angeles, and Chicago during the coronavirus pandemic. By designing mediation keywords that are related to the coronavirus, but do not explicitly mention the coronavirus, we could estimate sentiment in response to infection numbers and level of behavioral restrictions. In addition, these estimation results are verified by the performance of the transformer-based GPT-3 model. And, using these results, we were able to capture long-term trends in the sentiment of citizens in large cities during a pandemic for the first time in history. In our results, the correlation between the sentiment index and infection numbers differed for each city. In Los Angeles, a relatively positive correlation between the sentiment index and the number of cases was confirmed over 2 years, but the same was not confirmed in Chicago. On the other hand, in each city, the relationship between the timeline of events related to COVID-19 and the waveform was confirmed, and this result was supported by feature words using TF-IDF. In addition, we identified concurrency between the New York City and Los Angeles waveforms, suggesting a general and universal trend in citizens' sentiment during this period. Our model is applicable not only in COVID-19 pandemic situations but also in general emergencies that restrict the activities of citizens, such as natural disasters. Furthermore, estimating the time-series of social sentiment in an emergency from a macro-perspective will allow us to confirm the periodicity and inertia of the sentiment wave at that time. In addition, implementing these estimation models on data-streaming platforms has the potential impact to be applied to policymakers' understanding of citizen sentiment in policy-making and feedback after policy implementation in an emergency. Figures 6,7, and 8 show the time series of the number of tweets retrieved for each type of state-level restriction. In all three cities, tweets related to Stay-athome-ordered and Travel Restrictions peaked during the weeks of March 2 to March 16, 2020, and tended to decline thereafter. Regarding tweets related to Stay-at-home-ordered and Travel Restrictions, there were fewer collections after June 2020 than between January and February 2020. Also, by restriction type, the number of tweets related to Travel Restrictions was the highest throughout this period, followed by Stay-at-home-ordered Restrictions and Restrictions on Gatherings. Table 21 shows the summary statistics of weekly 105 values generated by indexing tweets retrieved from December 30, 2019, to January 2, 2022, in each city.