Keywords

1 Introduction

Social media is more ubiquitous than ever, enabling it to be a good tool to keep connected during the pandemic. Using automatic data processing for Twitter messages, the Social Response to COVID-19 SMART (Social Media Analytic and Research Testbed) Dashboard helps researchers search Tweets in different cities, filter noise (such as removing redundant retweets and using machine learning methods to improve precision), analyze social media data from a spatiotemporal perspective, and visualize social media data in various aspects (such as weekly and monthly trends, top URLs, top retweets, top mentions or top hashtags). The Social Response to COVID-19 SMART Dashboard uses multiple data mining programs, GIS methods, and advanced geo-targeted social media API’s to track selected topics in space and over time. There are multiple components to searching, processing, and visualizing social media messages from the Twitter Standard Search application programming interfaces or API’s. The filtered statistics of the focus topics and geo-targeted cities are visually represented in the SMART Dashboard.

The daily and almost live monitoring capability of the Dashboard has great potential for local, state, public health agencies, and practitioners to integrate real-time information to investigate large-scale disease outbreaks. For example, the Social Response to Covid-19 SMART Dashboard can be used to study sentiments on COVID-19 and vaccines in Italian cities based on new policy mandates and curfews. Because of the Dashboard’s unique capability to capture the temporal and spatial nature of COVID-related policies, behaviors, beliefs, and sentiments through Twitter content revealing various trends in diverse geographic areas, community leaders can use this tool to closely connect to their constituents and mitigate social issues before they become full-blown movements. Another potential use is to monitor public opinion towards crisis events such as the SARS COVID-19 outbreak. The Dashboard visualizes the most popular media shared in Twitter based on the COVID-19 pandemic in real-time.

2 Literature Review

To provide more background on the Dashboard, the following areas will be discussed in detail: the impact of COVID-19 on the 10 metropolitan cities in Italy to understand the geographical and temporal constraints, social media analytics to delve into their use cases, and SMART Dashboard 2.0 to delve into the history of the dashboard.

2.1 Impact of COVID-19 on the 10 Italian Cities

The COVID-19 pandemic has turned the once tourist-filled cities of Italy to ghost towns due to quarantine measures. The SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is caused by a coronavirus, and it presents itself with symptoms that include “fever or chills, cough, shortness of breath, difficulty in breathing, fatigue or tiredness, muscle or body aches, headaches, new loss of taste or smell, sore throat, congestion or runny notes, nausea or vomiting, and diarrhea” [12]. During the beginning of March 2021, Italy came into the forefront of world health news due to its rapidly rising COVID-19 cases and deaths as well as for being the first country outside of Asia to have such high cases and deaths.

To provide more background, Italy’s first confirmed COVID-19 case was reported in the Province of Lodi, Lombardy region on February 20th, 2020 [11]. The next day, Italy and all of Europe the first COVID-19-related death was announced in the province of Padova and Veneto region [1]. Due to the increasingly older residents who have a larger likelihood of comorbidities in Italy, the majority are at risk for the disease [2].

Other Twitter dashboard studies have focused on identifying real-time Twitter trend analysis using big data analytics and machine learning techniques [3]. For instance, Garg and Kaur [4] have explained the analysis of Twitter data using components of Cloudera distribution of Hadoop. In fact, the study’s objective assigned polarity to each tweet. Map reduce and Apache SPARK frameworks were used for sentiment analysis. The result showed that Apache SPARK is better than MapReduce. Saad and Yang [5] have performed sentiment analysis of Twitter data using ordinal regression. While, Ahmed and Rodriguez-Diaz [6] have performed sentiment analysis on online customer reviews as a form of visualization. Finally, Rathod and Barot [7] researched the same field to predic public opinion on ongoing events by analyzing tweet sentiments using machine learning classifiers like SVM, Naive Bayes, logistic classifier, and KNN classifier. SVM was found to be the best classifier with the least mean square error for the classifications. Garg et al. [8] have identified the trending pattern in Twitter using SPARK. These patterns were obtained by collecting tweets on a real-time basis and identifying trending hashtags at the same time. It was implemented using a big data technology SPARK streaming. This type of technique can help governments or companies know about more about the behaviors/trends of their given campaign/program and/or brand/product awareness and customer needs.

The time frame chosen to provide a proof of concept for the Dashboard is from March 3rd to June 25th, 2020. This period is divided into Phases 0, 1, and 2. Phase 0 started when the first case was reported until before the lockdown. Phase I, or the lockdown phase, started on March 11th, 2020 and ended on May 4th, 2020 [9]. Phase 2 lasted from May 4th, 2020 until June 3rd, 2020 [2].

Phase I is marked by increased restrictions in Milan in response to the pandemic. Specifically, educational institutions, religious events, cultural centers, and all events and places that required gathering were prohibited [9]. This included professional sporting events. Visits to family and relatives were prohibited as well as patronizing bars and restaurants. Dining institutions were allowed takeout with limited hours. Face masks were required in all public spaces indoors and outdoors. In addition, there was a self-certification form that the government required the residents to fill out and keep on their person whenever they left their homes that enabled contact tracing measures [9]. The lockdown was exacerbated when military force was ordered to keep lockdown measures in place. Due to travel restrictions, no airports were open for use. The only travel of any kind allowed was to the grocery store, pharmacy, or the hospital. Next, Phase II marked the easing of restrictions in Phase I. Businesses opened without limits to their hours of operation [9]. Some airports opened enabling reduced international travel. Public parks also opened as well as public transportation with reduced capacity [9].

2.2 Metropolitan Italian Cities

For this study, Italian major metropolitan cities were explored to understand the interconnections between geographical location, number of COVID-19 cases, social response to the pandemic and locally-enforced measures based on Twitter data. Table 1 shows the 10 cities that were selected across Italy. These cities were chosen by the Crowdfight International Team, a multidisciplinary research group, based on economic and cultural factors. Since the outbreak started in the North, the team decided to start there while other cities were added over time. Milan and Venice were chosen to represent the Northwestern region. Turin and Bologna were chosen for the Northeastern region. Florence and Rome represented the Central region. Naples and Bari claimed the Southern region. Palermo and Cagliari represented the Islands.

Table 1. Socio-demographics for the 10 Italian metropolitan cities [13]

2.3 SMART Dashboard

The idea of the “prototype created by the Center for Human Dynamics in the Mobile Age at SDSU was to facilitate the rapid dissemination of official alerts and warnings notifications from OES during disaster events via multiple social media channels to targeted demographics” [15]. The platform can identify and recruit top 1000 social media volunteers based on their social network influence factors and can aid government agencies to communicate more effectively to the public [14].

In our study, this same Dashboard was refitted to 10 metropolitan cities in Italy. More specifically, the north, center, south and island cities of Italy [14]. The backend was improved and mounted on larger servers.

3 Methodology

3.1 Data Collection

To provide the analysis, the team began by collecting Twitter data through the Twitter Standard Search API. This involved making a Twitter Developer account, requesting access tokens and keys followed by authentication of said keys. The API allows for collecting specific metadata, so the researchers had freedom to choose which ones to use for the study. In addition, Table 2 shows the keywords that were used to harvest the Tweets. These were chosen by the Crowdfight International Team in partnership with the Metabolism of Cities Living Lab under the Center for Human Dynamics in the Mobile Age (HDMA), after discussions with Italian colleagues as well as medical professionals. Keywords were selected based on popularity based on hashtag and word of mouth.

Table 2. Social response to Covid-19 smart dashboard selected keywords

3.2 Data Collection

In order to understand how the data is analyzed it is important to understand the client and server framework in Fig. 1 below.

Fig. 1.
figure 1

Data framework [4]

The server side for the Dashboard is explained below. For the database, the social media data tends to be more unstructured, so a NOSQL database, specifically MongoDB was used [10]. The Twitter Search Engine, coded in Python, was used to specify keywords, time period, and automate collection [10]. The web server used is written in NodeJS so that there would not be a need to switch to other server-side languages to implement the server [10]. This was specifically written so that JavaScript and node modules can be utilized to expand the functionality. Having NodeJS for the server also enabled for easier REST API creation, since the API is also built with NodeJS [10]. The client side of the framework is built upon HTML5 (HyperText Markup Language 5), JavaScript (JS), and CSS3 (Cascading Style Sheets, Version 3) as the base. On top of which are various JavaScript libraries to be discussed in Table 3.

Table 3. JavaScript libraries in the client side [10]

3.3 Dashboard Features

Due to the flexibility of the original SMART Dashboard 2.0, the Social Response to COVID-19 Dashboard was created by first changing the geo-tagged tweets during data collection then changing the keywords and filtering out specific links that may be deemed inappropriate or unrelated to the cause on the SMART Dashboard. Each section of the COVID Dashboard is discussed below.

The first few components that the user sees is the screen in Fig. 2 below, containing the Dashboard Toolbar on the far left, the SMART index at the top, and the Trend and Top Media sections below the SMART index. It also houses the “Stop Auto Refresh” button in order to enable researchers to stop the feed and conduct analyses.

Fig. 2.
figure 2

Initial interface of social response to Covid-19 smart dashboard

Dashboard was created by first changing the geo-tagged tweets during data collection then changing the keywords and filtering out specific links that may be deemed inappropriate or unrelated to the cause on the SMART Dashboard. Each section of the COVID Dashboard is discussed below. The SMART Dashboard 2.0 Toolbar, on the far left, contains the shortcuts of each component on the Dashboard. It also houses the keywords used to extract the Tweets. In addition, it contains the “Download” button to gain access from the data in the dashboard, the Privacy Policy, and Feedback buttons. The “Home” button enables the selection of keywords and filtration of certain Tweets that may be inappropriate or that adds noise to the findings. The toolbar also enables the selection of keywords simply by checking and unchecking the keywords desired.

The SMART Index, which consists of the four multi-colored blocks across the top, shows the most current metrics from the last 10 min it refreshed. The blocks will be discussed from left to right. The first block (blue) from top to bottom shows the number of Tweets harvested within the past hour, the date they were extracted, and the distribution of the time that each tweet was extracted. The second block (green) shows the number of Tweets extracted in the past 24 h, current date, and distribution of the Tweets over time. The third block (yellow) contains the number of Tweets since the day before the current date. It also contains the distribution of the number of Tweets from the day before and the current date. The fourth block (pink/salmon) shows the number of Tweets since the beginning of collection and the distribution of Tweets from the beginning of the Tweet harvest until the current date The Trend Section shows the frequency of Tweets generated by the keywords over time through a series of line graphs. Users can hover over any section of the graph and it will show the Tweets, both filtered and unfiltered, in the time frame. Any point in the line can be clicked to show the Tweets at the selected timeframe within the point selected. In addition, the tabs on the top can change said time frame. In Fig. 2, the graph shows how users can visualize Twitter metrics from the past 10 min, hour, daily, weekly, and monthly, therefore shrinking the graph towards the left. The bottom sliding scale can also change the distribution of the timeline of the graph.

The Top Media Section on the lower left of Fig. 2, shows the most shared images posted within the timeframe. The user has options to change the time frame, whether to show all media from the beginning of extraction, a week of current date, a day from current date, and from the current date.

The Top URL Section shows the most posted links or web pages within the timeframe. The user has options to change the time frame, whether to show all URL’s from the beginning of extraction, a week of current date, a day from current date, and from the current date. Figure 4 shows it all. A unique feature in the Social Response to COVID-19 Dashboard is the Word Cloud Section in Fig. 3. It includes a word cloud and most frequent vocabulary words table within the selected time period. The word cloud function contains the most frequent vocabulary words within the corpus at any chosen time period. The size of the words indicate a higher frequency, while words with smaller fonts are less frequent. Word clouds are an intuitive, decorative, and convenient way to see most common keywords in a corpus. Future developments for certain word clouds can include using stopwords, or words that are used so commonly that they provide little to no value to the visualization. For example, in English, this could include articles and prepositions, like “the” and “into.” This would naturally mean selecting a particular language, which when harvesting geo-tagged Tweets, do not guarantee one specific language.

Fig. 3.
figure 3

Word cloud

In addition, the Vocabulary Frequency Table shows the most frequent words in the corpus in the selected time frame. The information is presented in bar chart form arranged from most frequent to least frequent. Another unique feature of the Dashboard is the Tweets in Cities section shows the normalized tweeting rates by city population within the certain time period selected. Basemaps can be changed to the user’s preference. In our example in Fig. 4, the map is based in Milan, Italy and the selected time period is all Tweets since the beginning of the extraction. Other options include a week from current date, a day from current date, and the current date. What may also be notable for researchers is the geographic visualization of where the tweets were collected as well as the collection radius and other useful statistics like the total number of Tweets collected in the selected time period and the latest population information that the API can find.

Fig. 4.
figure 4

Tweets in cities

The most common Retweets from the selected time period are displayed in the Top Retweets section. Like the other sections, users can select which time period they want: all Tweets since the beginning of extraction, a week from the current date, the day before the current date, and the current date. Each Retweet has its frequency next to it. Retweets are important because they are quantifiable measures of influence. They also heavily affect a corpus if the study does not require original Tweets.The Top Mentions section shows the most frequent user references (beginning with ‘@’) in the selected time period. This section is notable because mentions are quantifiable measures of reference. It shows the frequency of interaction between the Twitter users within the collected corpus. Their corresponding frequency is displayed next to each user that was mentioned. Users can select which time period they want: all Tweets since the beginning of extraction, a week from the current date, the day before the current date, and the current date. The Top Hashtags, which refer to an idea or theme of a tweet, are shown below the Top Mentions section. Users create this hashtag to refer to certain movements, using the pound sign (‘#’). Like mentions, these are also quantifiable measures of reference and levels of interaction between users and hashtags. The corresponding frequency is displayed next to each frequent hashtag in the time period. Users can select which time period they want: all Tweets since the beginning of extraction, a week from the current date, the day before the current date, and the current date. The last shows the Geocode Status of the Tweets collected in the selected time period. This is meaningful for the researchers because it gives context to the successfully geocoded tweets in the corpus. It can give insight into error rates, so future experiment parameters can be adjusted accordingly. Corresponding counts and percentages are displayed next to each status (Fig. 5).

Fig. 5.
figure 5

Top hashtags

4 Discussion and Future Work

This type of dashboard is successful in filtering certain websites and content and the unique combination of visualizations increases the potential of the tool to be used in many different settings. For the purposes of social response to COVID-19, it allows policymakers to understand the current behaviors of society and can be used to observe public opinion during and after crisis events or disease outbreaks. The SMART Dashboard is available for use to assist response and assistance efforts during the pandemic. Real-time public health information and major events captured using social media are now at the forefront of behavioral measurement, disease surveillance, health promotion, and more. Different cities and regions may reveal different patterns of social media messages and trends. By analyzing the context of social media messages, linking place and time together we can discover more meaningful patterns and insights depending on the goals of the study of disease outbreaks and social media activities. Having expounded on the Dashboard’s capabilities, it is useful to note that the limitations of this study are dependent on the Twitter Standard Search API, the capacity of the server to store data, the extraction parameters in data collection, and the specific keywords used in the study. With the constant sharing of ideas online, it is impossible to capture the totality of themes online. In addition, certain natural language processing techniques for the word cloud can be improved by implementing specific stopwords in order to see specific keywords rather than articles and prepositions. Work can be done to make the techniques agnostic to language including stopword adjustments.