On the Development of a European Tracker of Societal Issues and Economic Activities Using Alternative Data

Consoli, Sergio; Colagrossi, Marco; Panella, Francesco; Barbaglia, Luca

doi:10.1007/978-3-031-23633-4_3

Sergio Consoli⁴⁶,
Marco Colagrossi⁴⁶,
Francesco Panella⁴⁶ &
…
Luca Barbaglia⁴⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1753))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1482 Accesses

Abstract

We provide an overview on the development of a tracker of economic activities and societal issues across EU member states mining alternative data sources, that can be used to complement official statistics. Considered alternative datasets include Google Searches, Dow Jones Data, News and Analytics (DNA), and the Global Dataset of Events, Language and Tone (GDELT). After providing an overview on the methodology under current development, some preliminary findings are also given.

You have full access to this open access chapter, Download conference paper PDF

Using the GDELT Dataset to Analyse the Italian Sovereign Bond Market

Automatic Extraction of Events from Open Source Text for Predictive Forecasting

Modeling Inter-country Connection from Geotagged News Reports: A Time-Series Analysis

Keywords

1 Introduction

A number of novel big data sources have the potential to be useful for socio-economic analyses [9]. These alternative sources of information include, for example, administrative data (e.g., tax and hospital records), commercial data sets (e.g., consumer panels, credit or debit card transactions), and textual data (e.g., social media, web searches, news data). In some cases, these data sets are structured and ready for analysis, while in other cases, for instance text, the data is unstructured and requires some preliminary steps to extract and organize the relevant information [2]. These unconventional data sources have been particularly relevant during the COVID-19 pandemic [14, 24], when this information has been used to integrate and augment the official statistics produced by national and international statistical agencies [4]. In general, the evolution of this field is contributing to the development of various decision-making instruments that help policymakers in designing policy interventions with the potential of fostering economic growth and societal well-being. These trends are inspiring the research activities at the European Commission’s Competence Center on Composite Indicators and Scoreboards (COIN)^{Footnote 1} at the European Commission, Joint Research Centre (JRC)^{Footnote 2}. This contribution describes our currently on-going research work, aimed at developing a tracker of economic activities and societal issues by obtaining policy-relevant insights from data sets which are considered unconventional in social sciences as well as stimulating the adoption of cutting hedge modeling technologies in the EU intuitions.

2 Google Search Data

Beginning with the work in [7], Google Search data have been used as a proxy of a variety of economic measures, especially in those contexts in which official statistics are not easily available. The JRC has studied the effects of Google Searches in monitoring the interests of European citizens in three main fields related to the pandemic crisis: health, economy and social isolation.^{Footnote 3} Web searches heavily depend on their link with the underlying phenomenon. As a result, scientists are required to be able to find the most relevant set of queries in each language and institutional environment. This task is especially difficult in a cross-country context, since locating the relevant queries is either time-consuming or even impossible (due to language barriers). To overcome this issue, authors in [5, 6, 23] recently exploited Google Trends topics, that are language-independent aggregations of various queries belonging to the same concept from a semantic perspective, enabling cross-country studies. Through the Google Trends API^{Footnote 4}, it is possible to get access to Google Search data by the Search Volume Index (SVI) of both queries and topics, normalized to query time and location. Each data point filtered by time range (either daily, weekly or monthly) and geography (either country or ISO 3166-2), is divided by the total number of searches to get a measure of relative popularity. The figures are based on a uniformly distributed random sample of Google Searches updated once per day from 2004, thus there may be some difference between similar requests. Google also displays when possible the top-25 searches and topics linked to any particular topic or query. Top queries and topics are the most frequently searched queries (or topics) by users in the same session at any particular time and location.

3 DNA: Dow Jones Data, News and Analytics

We consider also newspaper articles as an alternative dataset. Several papers have tried to understand the predictive value of news for measuring financial and economic activities, such as GDP, stock returns, unemployment, or inflation [3, 15, 20,21,22]. In particular, many works have used the sentiment extracted from news as a useful addition to the toll-set of predictors that are commonly used to monitor and forecast the business cycle [1, 8, 11,12,13]. For this task, we rely on a commercial dataset of economic news obtained from the Dow Jones Data, News and Analytics (DNA) platform.^{Footnote 5} We use in particular the articles published by Thomson Reuters News consisting of several million news texts, full-text, since 1988. The content is about a wide set of topics, ranging from financial matters, to macro-economic announcements or political implications on national economies. We use this news data set to build a set of real-time economic sentiment indicators for the EU27 countries and the UK, focusing on a number of topics of interest [3, 10]. The sentiment indicators are: (i) fine-grained, i.e. they are bound in the [–1, +1] interval; (ii) aspect-based, meaning that they are computed only about the specific topic of interest [3, 10]. Sentiment indicators are computed for the different European countries by filtering directly on a direct mention in the text of the articles. Along with this extracted sentiment signal, for each filtered topic and country we also report the volume time-series, that is the number of sentences dealing about that specific topic-country under analysis, representing a measure of the popularity of the specific topic in the selected country. For each time-series, daily averages of sentiment and volume scores are calculated. Lower-frequency aggregations at monthly or quarterly frequencies, are also allowed.

4 GDELT: Global Dataset of Events, Language and Tone

GDELT^{Footnote 6} is the global database of events, locations and tone that is maintained by Google [17, 18]. It is an open big data platform of news collected at worldwide level, containing structured data mined from broadcast, print and web sources in more than 65 languages. It connects people, organizations, quotes, locations, themes, and emotions associated with events happening across the world. It describes societal behavior through eye of the media, making it an ideal data source for measuring social factors. The data set starts in February 2015 and is freely available to users via REST APIs.^{Footnote 7} GDELT processes over 88 million articles a year and more than 150,000 news outlets, updating the output every 15 min.^{Footnote 8} We use GDELT themes to filter out news related to certain social or economic topics (e.g., “industrial production”, “unemployment”, “cultural activities”, etc.), limiting only to the news of the European country we are interested about. After this processing, we compute as output the (i) Article Tone, that is, a score between \(-1\) and \(+1\) expressing whether a certain message conveys a positive or negative sentiment with respect to a certain topic^{Footnote 9}; (ii) Topic Popularity rate, that is, the number of articles referred to the searched topic normalized by the total number of articles in the period.

In our application, we first select a list of representative keywords for the topic of interest along with the country to focus on and the period of extraction. The list of curated keywords is further extended programmatically by means of synonyms, which are computed using the Sense2Vec python library^{Footnote 10}. By using the Word Embeddings [16] from the pre-trained GloVe model [19], we select only the articles from GDELT such that the topics are related to one of the selected themes of interest. Once collected the relevant news data, we are then able to calculate the Articles Tone score and the Topic Popularity rate by averaging the obtained measures from GDELT for the selected articles by the period of extraction.

5 Data Visualization and Analytics

We construct alternative indicators using the described datasets on various social and economic topics, representing broad categories of variables, such as: “economy”, “industrial production”, “unemployment”, “inflation”, “capital market”, “cultural activities”, “housing market”, “international trade”, “monetary policy” or “loneliness”. We are building a number of services in order to provide access to the processed data along with intuitive and user-friendly visualizations. We rely on Business Intelligence (BI) and construct an interactive dashboard by means of the Microsoft Power BI infrastructure.^{Footnote 11} The dashboard allows users to choose which data to visualise by filtering the country, topic and time, and is available at https://knowledge4policy.ec.europa.eu/composite-indicators/socioeconomic-tracker_en.

We are also running a number of empirical exercises to analyse the relationships between the information extracted from our unconventional datasets and official releases of social and economic variables. We are particularly interested in nowcasting social and economic variables, that is, forecast the value of a variable during period t when the official release of the value will occur only in period \(t^{*}\), with \(t^{*}>t\). For European countries the typical delay in the release of official statistics ranges from 30 to 45 days. The goal of our studies consists then in nowcasting the value of the economic or social variable in real-time and before the official release of the statistical agencies. We use standard forecasting models augmented by the alternative indicators as additional regressors and compare their performance relative to the models without them. Timely and reliable forecasts for these signals play a relevant role in planning policies in support to the most vulnerable [6]. Given the delay and infrequent publication of official figures from statistical agencies, the importance of reliable unconventional indicators is even more prominent in times of high uncertainty, as also emphasized by the recent COVID-19 pandemic. Our early results, that we plan to extensively report in the form of an extended paper, show that our unconventional variables are relevant predictors in various nowcasting applications.

6 Conclusions and Future Work

We present our work-in-progress related to the development of alternative economic and social indicators from various unconventional data sets, including GDELT, Google Search, and newspaper articles. The currently on-going project aims to provide intuitive and user-friendly access to the data analysed by using an interactive BI dashboard, as well as producing improved nowcasting and forecasting methods to analyse various socio-economic measures for countries in the EU. When mature, we will discuss the results of our nowcasting applications by producing an extended version of this work which we plan to submit to a scientific outlet.

We are in particular aiming at a specific case with the goal of nowcasting inflation in different EU countries. In particular, at this purpose we intend to use advanced neural forecasting methods using deep learning^{Footnote 12} to obtain improved performance over classical forecasting approaches. The obtained preliminaries results seem to show that the information extracted from the considered alternative datasets have a predicting power for the inflation indicator in several EU countries. A thorough statistical analysis of these results needs however to be performed before we can release any robust conclusion on the subject.

Notes

1.
European Commission’s Competence Center on Composite Indicators and Scoreboards (COIN): https://composite-indicators.jrc.ec.europa.eu/.
2.
The Joint Research Centre (JRC) of the European Commission (EC): https://ec.europa.eu/info/departments/joint-research-centre_en.
3.
See https://knowledge4policy.ec.europa.eu/projects-activities/tracking-eu-citizens%E2%80%99-concerns-using-google-search-data_en.
4.
https://trends.google.com/trends/.
5.
DNA platform: https://www.dowjones.com/dna/.
6.
GDELT website: https://blog.gdeltproject.org/.
7.
See https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/.
8.
See http://data.gdeltproject.org/gdeltv2/lastupdate.txt for the English data, while http://data.gdeltproject.org/gdeltv2/lastupdate-translation.txt for the translated data.
9.
https://blog.gdeltproject.org/vader-sentiment-lexicon-now-available-in-gcam/.
10.
Sense2Vec library: https://pypi.org/project/sense2vec/.
11.
Microsoft Power BI: https://powerbi.microsoft.com/.
12.
https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html.

References

Barbaglia, L., Consoli, S., Manzan, S.: Forecasting GDP in Europe with textual data. Available at SSRN, 3898680:1–38 (2021)
Google Scholar
Barbaglia, L., Consoli, S., Manzan, S., Reforgiato Recupero, D., Saisana, M., Tiozzo Pezzoli, L.: Data science technologies in economics and finance: a gentle walk-in. In: Consoli, S., Reforgiato Recupero, D., Saisana, M. (eds.) Data Science for Economics and Finance, pp. 1–17. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66891-4_1
Chapter Google Scholar
Barbaglia, L., Consoli, S., Manzan, S.: Forecasting with economic news. J. Bus. Econ. Stat. 1–12 (2022). (in press). https://doi.org/10.1080/07350015.2022.2060988
Barbaglia, L., Frattarolo, L., Onorante, L., Pericoli, F., Ratto, M., Tiozzo Pezzoli, L.: Testing big data in a big crisis: nowcasting under COVID-19. working paper available at SSRN, 4066479:1–38 (2022)
Google Scholar
Brodeur, A., Clark, A.E., Flèche, S., Powdthavee, N.: COVID-19, lockdowns and well-being: evidence from google trends. J. Public Econ. 193, 104346 (2021)
Article Google Scholar
Caperna, G., Colagrossi, M., Geraci, A., Mazzarella, G.: A babel of web-searches: Googling unemployment during the pandemic. Labour Econ. 74, 102097 (2022)
Article Google Scholar
Choi, H., Varian, H.: Predicting the present with google trends. Econ. Record 88, 2–9 (2012)
Article Google Scholar
Consoli, S., Pezzoli, L., Tosetti, E.: Emotions in macroeconomic news and their impact on the European bond market. J. Int. Money Finan. 118, 102472 (2021)
Article Google Scholar
Consoli, S., Reforgiato Recupero, D., Saisana, M. (eds.): Data Science for Economics and Finance. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66891-4
Book Google Scholar
Consoli, S., Barbaglia, S., Manzan, S.: Fine-grained, aspect-based sentiment analysis on economic and financial lexicon. Knowl.-Based Syst. 247:108781, 2022 ISSN 0950–7051. https://doi.org/10.1016/j.knosys.2022.108781
Consoli, L., Pezzoli, T., Tosetti, E.: Neural forecasting of the Italian sovereign bond market with economic news. J. Royal Stat. Soc. Ser. A Stat. Soc. 1–28 (2022). (in press)
Google Scholar
Dridi, A., Atzeni, M., Reforgiato Recupero, D.: FineNews: fine-grained semantic sentiment analysis on financial microblogs and news. Int. J. Mach. Learn. Cybern. 10(8), 2199–2207 (2018). https://doi.org/10.1007/s13042-018-0805-x
Article Google Scholar
Gentzkow, M., Kelly, B., Taddy, M.: Text as data. J. Econ. Lit. 57(3), 535–74 (2019)
Article Google Scholar
Goodell, J.W.: Covid-19 and finance: agendas for future research. Financ. Res. Lett. 35, 101512 (2020)
Article Google Scholar
Hansen, S., McMahon, M.: Shocking language: understanding the macroeconomic effects of central bank communication. J. Int. Econ. 99, S114–S133 (2016)
Article Google Scholar
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: 32nd International Conference on Machine Learning (ICML 2015), vol. 2, pp. 957–966, United States, ACM (2015)
Google Scholar
Kwak, H., An, J.: A first look at global news coverage of disasters by using the GDELT dataset. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014. LNCS, vol. 8851, pp. 300–308. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13734-6_22
Chapter Google Scholar
Leetaru, K., Schrodt, P.A.: GDELT: global data on events, Location and Tone. Technical report, KOF Working Papers, pp. 1979–2012 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP 2014–2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1532–1543, United States, ACL (2014)
Google Scholar
Shapiro, A.H., Sudhof, M., Wilson, D.: Measuring news sentiment. Federal Reserve Bank of San Francisco Working Paper (2018)
Google Scholar
Tetlock, P.C.: Giving content to investor sentiment: the role of media in the stock market. J. Financ. 62(3), 1139–1168 (2007)
Article Google Scholar
Thorsrud, L.A.: Words are the new numbers: a newsy coincident index of the business cycle. J. Bus. Econ. Stat. 38(2), 1–17 (2018)
Google Scholar
Alberti, V.: Tracking EU Citizens? Interest in EC Priorities Using Online Search Data - The European Green Deal. Publications Office of the European Union, Luxembourg (Luxembourg) (2021)
Google Scholar
Zhang, D., Hu, M., Ji, Q.: Financial markets under the global pandemic of COVID-19. Financ. Res. Lett. 36, 101528 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

European Commission, Joint Research Centre (DG JRC), Ispra, VA, Italy
Sergio Consoli, Marco Colagrossi, Francesco Panella & Luca Barbaglia

Authors

Sergio Consoli
View author publications
You can also search for this author in PubMed Google Scholar
Marco Colagrossi
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Panella
View author publications
You can also search for this author in PubMed Google Scholar
Luca Barbaglia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergio Consoli .

Editor information

Editors and Affiliations

University of Sydney, Sydney, Australia
Irena Koprinska
University of Bari Aldo Moro, Bari, Italy
Paolo Mignone
University of Pisa, Pisa, Italy
Riccardo Guidotti
Warsaw University of Technology, Warsaw, Poland
Szymon Jaroszewicz
Heidelberg University, Heidelberg, Germany
Holger Fröning
UniCredit, Rome, Italy
Francesco Gullo
University of Lisbon, Lisbon, Portugal
Pedro M. Ferreira
Roche, Basel, Switzerland
Damian Roqueiro
Barcelona Supercomputing Center, Barcelona, Spain
Gaia Ceddia
Halmstad University, Halmstad, Sweden
Slawomir Nowaczyk
University of Porto, Porto, Portugal
João Gama
University of Porto, Porto, Portugal
Rita Ribeiro
UPC BarcelonaTech, Barcelona, Spain
Ricard Gavaldà
University of Naples Federico II, Naples, Italy
Elio Masciari
University of North Carolina, Charlotte, USA
Zbigniew Ras
ICAR-CNR, Rende, Italy
Ettore Ritacco
University of Pisa, Pisa, Italy
Francesca Naretto
Aalen University of Applied Sciences, Aalen, Germany
Andreas Theissler
Warsaw University of Technology, Warszaw, Poland
Przemyslaw Biecek
KU Leuven, Leuven, Belgium
Wouter Verbeke
University of Duisburg-Essen, Essen, Germany
Gregor Schiele
Graz University of Technology, Graz, Austria
Franz Pernkopf
AMD, Dublin, Ireland
Michaela Blott
UniCredit, Rome, Italy
Ilaria Bordino
UniCredit, Milan, Italy
Ivan Luciano Danesi
National Agency for New Technologies, Rome, Italy
Giovanni Ponti
Unicredit, Rome, Italy
Lorenzo Severini
University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
University of Bari Aldo Moro, Bari, Italy
Giuseppina Andresini
University of Lisbon, Lisbon, Portugal
Ibéria Medeiros
University of Lisbon, Lisbon, Portugal
Guilherme Graça
Northwestern University, Chicago, USA
Lee Cooper
Roche, Basel, Switzerland
Naghmeh Ghazaleh
University of Lausanne, Lausanne, Switzerland
Jonas Richiardi
Novartis, Basel, Switzerland
Diego Saldana
Novartis, Basel, Switzerland
Konstantinos Sechidis
Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico, Milan, Italy
Arif Canakoglu
Politecnico di Milano, Milan, Italy
Sara Pido
Politecnico di Milano, Milan, Italy
Pietro Pinoli
University of Waikato, Hamilton, New Zealand
Albert Bifet
Halmstad University, Halmstad, Sweden
Sepideh Pashami

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Consoli, S., Colagrossi, M., Panella, F., Barbaglia, L. (2023). On the Development of a European Tracker of Societal Issues and Economic Activities Using Alternative Data. In: Koprinska, I., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2022. Communications in Computer and Information Science, vol 1753. Springer, Cham. https://doi.org/10.1007/978-3-031-23633-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-23633-4_3
Published: 31 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23632-7
Online ISBN: 978-3-031-23633-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Development of a European Tracker of Societal Issues and Economic Activities Using Alternative Data

Abstract

Similar content being viewed by others

Using the GDELT Dataset to Analyse the Italian Sovereign Bond Market

Automatic Extraction of Events from Open Source Text for Predictive Forecasting

Modeling Inter-country Connection from Geotagged News Reports: A Time-Series Analysis

Keywords

1 Introduction

2 Google Search Data

3 DNA: Dow Jones Data, News and Analytics

4 GDELT: Global Dataset of Events, Language and Tone

5 Data Visualization and Analytics

6 Conclusions and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the Development of a European Tracker of Societal Issues and Economic Activities Using Alternative Data

Abstract

Similar content being viewed by others

Using the GDELT Dataset to Analyse the Italian Sovereign Bond Market

Automatic Extraction of Events from Open Source Text for Predictive Forecasting

Modeling Inter-country Connection from Geotagged News Reports: A Time-Series Analysis

Keywords

1 Introduction

2 Google Search Data

3 DNA: Dow Jones Data, News and Analytics

4 GDELT: Global Dataset of Events, Language and Tone

5 Data Visualization and Analytics

6 Conclusions and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation