Keywords

1 Introduction

A number of novel big data sources have the potential to be useful for socio-economic analyses [9]. These alternative sources of information include, for example, administrative data (e.g., tax and hospital records), commercial data sets (e.g., consumer panels, credit or debit card transactions), and textual data (e.g., social media, web searches, news data). In some cases, these data sets are structured and ready for analysis, while in other cases, for instance text, the data is unstructured and requires some preliminary steps to extract and organize the relevant information [2]. These unconventional data sources have been particularly relevant during the COVID-19 pandemic [14, 24], when this information has been used to integrate and augment the official statistics produced by national and international statistical agencies [4]. In general, the evolution of this field is contributing to the development of various decision-making instruments that help policymakers in designing policy interventions with the potential of fostering economic growth and societal well-being. These trends are inspiring the research activities at the European Commission’s Competence Center on Composite Indicators and Scoreboards (COIN)Footnote 1 at the European Commission, Joint Research Centre (JRC)Footnote 2. This contribution describes our currently on-going research work, aimed at developing a tracker of economic activities and societal issues by obtaining policy-relevant insights from data sets which are considered unconventional in social sciences as well as stimulating the adoption of cutting hedge modeling technologies in the EU intuitions.

2 Google Search Data

Beginning with the work in [7], Google Search data have been used as a proxy of a variety of economic measures, especially in those contexts in which official statistics are not easily available. The JRC has studied the effects of Google Searches in monitoring the interests of European citizens in three main fields related to the pandemic crisis: health, economy and social isolation.Footnote 3 Web searches heavily depend on their link with the underlying phenomenon. As a result, scientists are required to be able to find the most relevant set of queries in each language and institutional environment. This task is especially difficult in a cross-country context, since locating the relevant queries is either time-consuming or even impossible (due to language barriers). To overcome this issue, authors in [5, 6, 23] recently exploited Google Trends topics, that are language-independent aggregations of various queries belonging to the same concept from a semantic perspective, enabling cross-country studies. Through the Google Trends APIFootnote 4, it is possible to get access to Google Search data by the Search Volume Index (SVI) of both queries and topics, normalized to query time and location. Each data point filtered by time range (either daily, weekly or monthly) and geography (either country or ISO 3166-2), is divided by the total number of searches to get a measure of relative popularity. The figures are based on a uniformly distributed random sample of Google Searches updated once per day from 2004, thus there may be some difference between similar requests. Google also displays when possible the top-25 searches and topics linked to any particular topic or query. Top queries and topics are the most frequently searched queries (or topics) by users in the same session at any particular time and location.

3 DNA: Dow Jones Data, News and Analytics

We consider also newspaper articles as an alternative dataset. Several papers have tried to understand the predictive value of news for measuring financial and economic activities, such as GDP, stock returns, unemployment, or inflation [3, 15, 20,21,22]. In particular, many works have used the sentiment extracted from news as a useful addition to the toll-set of predictors that are commonly used to monitor and forecast the business cycle [1, 8, 11,12,13]. For this task, we rely on a commercial dataset of economic news obtained from the Dow Jones Data, News and Analytics (DNA) platform.Footnote 5 We use in particular the articles published by Thomson Reuters News consisting of several million news texts, full-text, since 1988. The content is about a wide set of topics, ranging from financial matters, to macro-economic announcements or political implications on national economies. We use this news data set to build a set of real-time economic sentiment indicators for the EU27 countries and the UK, focusing on a number of topics of interest [3, 10]. The sentiment indicators are: (i) fine-grained, i.e. they are bound in the [–1, +1] interval; (ii) aspect-based, meaning that they are computed only about the specific topic of interest [3, 10]. Sentiment indicators are computed for the different European countries by filtering directly on a direct mention in the text of the articles. Along with this extracted sentiment signal, for each filtered topic and country we also report the volume time-series, that is the number of sentences dealing about that specific topic-country under analysis, representing a measure of the popularity of the specific topic in the selected country. For each time-series, daily averages of sentiment and volume scores are calculated. Lower-frequency aggregations at monthly or quarterly frequencies, are also allowed.

4 GDELT: Global Dataset of Events, Language and Tone

GDELTFootnote 6 is the global database of events, locations and tone that is maintained by Google [17, 18]. It is an open big data platform of news collected at worldwide level, containing structured data mined from broadcast, print and web sources in more than 65 languages. It connects people, organizations, quotes, locations, themes, and emotions associated with events happening across the world. It describes societal behavior through eye of the media, making it an ideal data source for measuring social factors. The data set starts in February 2015 and is freely available to users via REST APIs.Footnote 7 GDELT processes over 88 million articles a year and more than 150,000 news outlets, updating the output every 15 min.Footnote 8 We use GDELT themes to filter out news related to certain social or economic topics (e.g., “industrial production”, “unemployment”, “cultural activities”, etc.), limiting only to the news of the European country we are interested about. After this processing, we compute as output the (i) Article Tone, that is, a score between \(-1\) and \(+1\) expressing whether a certain message conveys a positive or negative sentiment with respect to a certain topicFootnote 9; (ii) Topic Popularity rate, that is, the number of articles referred to the searched topic normalized by the total number of articles in the period.

In our application, we first select a list of representative keywords for the topic of interest along with the country to focus on and the period of extraction. The list of curated keywords is further extended programmatically by means of synonyms, which are computed using the Sense2Vec python libraryFootnote 10. By using the Word Embeddings [16] from the pre-trained GloVe model [19], we select only the articles from GDELT such that the topics are related to one of the selected themes of interest. Once collected the relevant news data, we are then able to calculate the Articles Tone score and the Topic Popularity rate by averaging the obtained measures from GDELT for the selected articles by the period of extraction.

5 Data Visualization and Analytics

We construct alternative indicators using the described datasets on various social and economic topics, representing broad categories of variables, such as: “economy”, “industrial production”, “unemployment”, “inflation”, “capital market”, “cultural activities”, “housing market”, “international trade”, “monetary policy” or “loneliness”. We are building a number of services in order to provide access to the processed data along with intuitive and user-friendly visualizations. We rely on Business Intelligence (BI) and construct an interactive dashboard by means of the Microsoft Power BI infrastructure.Footnote 11 The dashboard allows users to choose which data to visualise by filtering the country, topic and time, and is available at https://knowledge4policy.ec.europa.eu/composite-indicators/socioeconomic-tracker_en.

We are also running a number of empirical exercises to analyse the relationships between the information extracted from our unconventional datasets and official releases of social and economic variables. We are particularly interested in nowcasting social and economic variables, that is, forecast the value of a variable during period t when the official release of the value will occur only in period \(t^{*}\), with \(t^{*}>t\). For European countries the typical delay in the release of official statistics ranges from 30 to 45 days. The goal of our studies consists then in nowcasting the value of the economic or social variable in real-time and before the official release of the statistical agencies. We use standard forecasting models augmented by the alternative indicators as additional regressors and compare their performance relative to the models without them. Timely and reliable forecasts for these signals play a relevant role in planning policies in support to the most vulnerable [6]. Given the delay and infrequent publication of official figures from statistical agencies, the importance of reliable unconventional indicators is even more prominent in times of high uncertainty, as also emphasized by the recent COVID-19 pandemic. Our early results, that we plan to extensively report in the form of an extended paper, show that our unconventional variables are relevant predictors in various nowcasting applications.

6 Conclusions and Future Work

We present our work-in-progress related to the development of alternative economic and social indicators from various unconventional data sets, including GDELT, Google Search, and newspaper articles. The currently on-going project aims to provide intuitive and user-friendly access to the data analysed by using an interactive BI dashboard, as well as producing improved nowcasting and forecasting methods to analyse various socio-economic measures for countries in the EU. When mature, we will discuss the results of our nowcasting applications by producing an extended version of this work which we plan to submit to a scientific outlet.

We are in particular aiming at a specific case with the goal of nowcasting inflation in different EU countries. In particular, at this purpose we intend to use advanced neural forecasting methods using deep learningFootnote 12 to obtain improved performance over classical forecasting approaches. The obtained preliminaries results seem to show that the information extracted from the considered alternative datasets have a predicting power for the inflation indicator in several EU countries. A thorough statistical analysis of these results needs however to be performed before we can release any robust conclusion on the subject.