Abstract
We provide an overview on the development of a tracker of economic activities and societal issues across EU member states mining alternative data sources, that can be used to complement official statistics. Considered alternative datasets include Google Searches, Dow Jones Data, News and Analytics (DNA), and the Global Dataset of Events, Language and Tone (GDELT). After providing an overview on the methodology under current development, some preliminary findings are also given.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
A number of novel big data sources have the potential to be useful for socio-economic analyses [9]. These alternative sources of information include, for example, administrative data (e.g., tax and hospital records), commercial data sets (e.g., consumer panels, credit or debit card transactions), and textual data (e.g., social media, web searches, news data). In some cases, these data sets are structured and ready for analysis, while in other cases, for instance text, the data is unstructured and requires some preliminary steps to extract and organize the relevant information [2]. These unconventional data sources have been particularly relevant during the COVID-19 pandemic [14, 24], when this information has been used to integrate and augment the official statistics produced by national and international statistical agencies [4]. In general, the evolution of this field is contributing to the development of various decision-making instruments that help policymakers in designing policy interventions with the potential of fostering economic growth and societal well-being. These trends are inspiring the research activities at the European Commission’s Competence Center on Composite Indicators and Scoreboards (COIN)Footnote 1 at the European Commission, Joint Research Centre (JRC)Footnote 2. This contribution describes our currently on-going research work, aimed at developing a tracker of economic activities and societal issues by obtaining policy-relevant insights from data sets which are considered unconventional in social sciences as well as stimulating the adoption of cutting hedge modeling technologies in the EU intuitions.
2 Google Search Data
Beginning with the work in [7], Google Search data have been used as a proxy of a variety of economic measures, especially in those contexts in which official statistics are not easily available. The JRC has studied the effects of Google Searches in monitoring the interests of European citizens in three main fields related to the pandemic crisis: health, economy and social isolation.Footnote 3 Web searches heavily depend on their link with the underlying phenomenon. As a result, scientists are required to be able to find the most relevant set of queries in each language and institutional environment. This task is especially difficult in a cross-country context, since locating the relevant queries is either time-consuming or even impossible (due to language barriers). To overcome this issue, authors in [5, 6, 23] recently exploited Google Trends topics, that are language-independent aggregations of various queries belonging to the same concept from a semantic perspective, enabling cross-country studies. Through the Google Trends APIFootnote 4, it is possible to get access to Google Search data by the Search Volume Index (SVI) of both queries and topics, normalized to query time and location. Each data point filtered by time range (either daily, weekly or monthly) and geography (either country or ISO 3166-2), is divided by the total number of searches to get a measure of relative popularity. The figures are based on a uniformly distributed random sample of Google Searches updated once per day from 2004, thus there may be some difference between similar requests. Google also displays when possible the top-25 searches and topics linked to any particular topic or query. Top queries and topics are the most frequently searched queries (or topics) by users in the same session at any particular time and location.
3 DNA: Dow Jones Data, News and Analytics
We consider also newspaper articles as an alternative dataset. Several papers have tried to understand the predictive value of news for measuring financial and economic activities, such as GDP, stock returns, unemployment, or inflation [3, 15, 20,21,22]. In particular, many works have used the sentiment extracted from news as a useful addition to the toll-set of predictors that are commonly used to monitor and forecast the business cycle [1, 8, 11,12,13]. For this task, we rely on a commercial dataset of economic news obtained from the Dow Jones Data, News and Analytics (DNA) platform.Footnote 5 We use in particular the articles published by Thomson Reuters News consisting of several million news texts, full-text, since 1988. The content is about a wide set of topics, ranging from financial matters, to macro-economic announcements or political implications on national economies. We use this news data set to build a set of real-time economic sentiment indicators for the EU27 countries and the UK, focusing on a number of topics of interest [3, 10]. The sentiment indicators are: (i) fine-grained, i.e. they are bound in the [–1, +1] interval; (ii) aspect-based, meaning that they are computed only about the specific topic of interest [3, 10]. Sentiment indicators are computed for the different European countries by filtering directly on a direct mention in the text of the articles. Along with this extracted sentiment signal, for each filtered topic and country we also report the volume time-series, that is the number of sentences dealing about that specific topic-country under analysis, representing a measure of the popularity of the specific topic in the selected country. For each time-series, daily averages of sentiment and volume scores are calculated. Lower-frequency aggregations at monthly or quarterly frequencies, are also allowed.
4 GDELT: Global Dataset of Events, Language and Tone
GDELTFootnote 6 is the global database of events, locations and tone that is maintained by Google [17, 18]. It is an open big data platform of news collected at worldwide level, containing structured data mined from broadcast, print and web sources in more than 65 languages. It connects people, organizations, quotes, locations, themes, and emotions associated with events happening across the world. It describes societal behavior through eye of the media, making it an ideal data source for measuring social factors. The data set starts in February 2015 and is freely available to users via REST APIs.Footnote 7 GDELT processes over 88 million articles a year and more than 150,000 news outlets, updating the output every 15 min.Footnote 8 We use GDELT themes to filter out news related to certain social or economic topics (e.g., “industrial production”, “unemployment”, “cultural activities”, etc.), limiting only to the news of the European country we are interested about. After this processing, we compute as output the (i) Article Tone, that is, a score between \(-1\) and \(+1\) expressing whether a certain message conveys a positive or negative sentiment with respect to a certain topicFootnote 9; (ii) Topic Popularity rate, that is, the number of articles referred to the searched topic normalized by the total number of articles in the period.
In our application, we first select a list of representative keywords for the topic of interest along with the country to focus on and the period of extraction. The list of curated keywords is further extended programmatically by means of synonyms, which are computed using the Sense2Vec python libraryFootnote 10. By using the Word Embeddings [16] from the pre-trained GloVe model [19], we select only the articles from GDELT such that the topics are related to one of the selected themes of interest. Once collected the relevant news data, we are then able to calculate the Articles Tone score and the Topic Popularity rate by averaging the obtained measures from GDELT for the selected articles by the period of extraction.
5 Data Visualization and Analytics
We construct alternative indicators using the described datasets on various social and economic topics, representing broad categories of variables, such as: “economy”, “industrial production”, “unemployment”, “inflation”, “capital market”, “cultural activities”, “housing market”, “international trade”, “monetary policy” or “loneliness”. We are building a number of services in order to provide access to the processed data along with intuitive and user-friendly visualizations. We rely on Business Intelligence (BI) and construct an interactive dashboard by means of the Microsoft Power BI infrastructure.Footnote 11 The dashboard allows users to choose which data to visualise by filtering the country, topic and time, and is available at https://knowledge4policy.ec.europa.eu/composite-indicators/socioeconomic-tracker_en.
We are also running a number of empirical exercises to analyse the relationships between the information extracted from our unconventional datasets and official releases of social and economic variables. We are particularly interested in nowcasting social and economic variables, that is, forecast the value of a variable during period t when the official release of the value will occur only in period \(t^{*}\), with \(t^{*}>t\). For European countries the typical delay in the release of official statistics ranges from 30 to 45 days. The goal of our studies consists then in nowcasting the value of the economic or social variable in real-time and before the official release of the statistical agencies. We use standard forecasting models augmented by the alternative indicators as additional regressors and compare their performance relative to the models without them. Timely and reliable forecasts for these signals play a relevant role in planning policies in support to the most vulnerable [6]. Given the delay and infrequent publication of official figures from statistical agencies, the importance of reliable unconventional indicators is even more prominent in times of high uncertainty, as also emphasized by the recent COVID-19 pandemic. Our early results, that we plan to extensively report in the form of an extended paper, show that our unconventional variables are relevant predictors in various nowcasting applications.
6 Conclusions and Future Work
We present our work-in-progress related to the development of alternative economic and social indicators from various unconventional data sets, including GDELT, Google Search, and newspaper articles. The currently on-going project aims to provide intuitive and user-friendly access to the data analysed by using an interactive BI dashboard, as well as producing improved nowcasting and forecasting methods to analyse various socio-economic measures for countries in the EU. When mature, we will discuss the results of our nowcasting applications by producing an extended version of this work which we plan to submit to a scientific outlet.
We are in particular aiming at a specific case with the goal of nowcasting inflation in different EU countries. In particular, at this purpose we intend to use advanced neural forecasting methods using deep learningFootnote 12 to obtain improved performance over classical forecasting approaches. The obtained preliminaries results seem to show that the information extracted from the considered alternative datasets have a predicting power for the inflation indicator in several EU countries. A thorough statistical analysis of these results needs however to be performed before we can release any robust conclusion on the subject.
Notes
- 1.
European Commission’s Competence Center on Composite Indicators and Scoreboards (COIN): https://composite-indicators.jrc.ec.europa.eu/.
- 2.
The Joint Research Centre (JRC) of the European Commission (EC): https://ec.europa.eu/info/departments/joint-research-centre_en.
- 3.
- 4.
- 5.
DNA platform: https://www.dowjones.com/dna/.
- 6.
GDELT website: https://blog.gdeltproject.org/.
- 7.
- 8.
See http://data.gdeltproject.org/gdeltv2/lastupdate.txt for the English data, while http://data.gdeltproject.org/gdeltv2/lastupdate-translation.txt for the translated data.
- 9.
- 10.
Sense2Vec library: https://pypi.org/project/sense2vec/.
- 11.
Microsoft Power BI: https://powerbi.microsoft.com/.
- 12.
References
Barbaglia, L., Consoli, S., Manzan, S.: Forecasting GDP in Europe with textual data. Available at SSRN, 3898680:1–38 (2021)
Barbaglia, L., Consoli, S., Manzan, S., Reforgiato Recupero, D., Saisana, M., Tiozzo Pezzoli, L.: Data science technologies in economics and finance: a gentle walk-in. In: Consoli, S., Reforgiato Recupero, D., Saisana, M. (eds.) Data Science for Economics and Finance, pp. 1–17. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66891-4_1
Barbaglia, L., Consoli, S., Manzan, S.: Forecasting with economic news. J. Bus. Econ. Stat. 1–12 (2022). (in press). https://doi.org/10.1080/07350015.2022.2060988
Barbaglia, L., Frattarolo, L., Onorante, L., Pericoli, F., Ratto, M., Tiozzo Pezzoli, L.: Testing big data in a big crisis: nowcasting under COVID-19. working paper available at SSRN, 4066479:1–38 (2022)
Brodeur, A., Clark, A.E., Flèche, S., Powdthavee, N.: COVID-19, lockdowns and well-being: evidence from google trends. J. Public Econ. 193, 104346 (2021)
Caperna, G., Colagrossi, M., Geraci, A., Mazzarella, G.: A babel of web-searches: Googling unemployment during the pandemic. Labour Econ. 74, 102097 (2022)
Choi, H., Varian, H.: Predicting the present with google trends. Econ. Record 88, 2–9 (2012)
Consoli, S., Pezzoli, L., Tosetti, E.: Emotions in macroeconomic news and their impact on the European bond market. J. Int. Money Finan. 118, 102472 (2021)
Consoli, S., Reforgiato Recupero, D., Saisana, M. (eds.): Data Science for Economics and Finance. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66891-4
Consoli, S., Barbaglia, S., Manzan, S.: Fine-grained, aspect-based sentiment analysis on economic and financial lexicon. Knowl.-Based Syst. 247:108781, 2022 ISSN 0950–7051. https://doi.org/10.1016/j.knosys.2022.108781
Consoli, L., Pezzoli, T., Tosetti, E.: Neural forecasting of the Italian sovereign bond market with economic news. J. Royal Stat. Soc. Ser. A Stat. Soc. 1–28 (2022). (in press)
Dridi, A., Atzeni, M., Reforgiato Recupero, D.: FineNews: fine-grained semantic sentiment analysis on financial microblogs and news. Int. J. Mach. Learn. Cybern. 10(8), 2199–2207 (2018). https://doi.org/10.1007/s13042-018-0805-x
Gentzkow, M., Kelly, B., Taddy, M.: Text as data. J. Econ. Lit. 57(3), 535–74 (2019)
Goodell, J.W.: Covid-19 and finance: agendas for future research. Financ. Res. Lett. 35, 101512 (2020)
Hansen, S., McMahon, M.: Shocking language: understanding the macroeconomic effects of central bank communication. J. Int. Econ. 99, S114–S133 (2016)
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: 32nd International Conference on Machine Learning (ICML 2015), vol. 2, pp. 957–966, United States, ACM (2015)
Kwak, H., An, J.: A first look at global news coverage of disasters by using the GDELT dataset. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014. LNCS, vol. 8851, pp. 300–308. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13734-6_22
Leetaru, K., Schrodt, P.A.: GDELT: global data on events, Location and Tone. Technical report, KOF Working Papers, pp. 1979–2012 (2013)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP 2014–2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1532–1543, United States, ACL (2014)
Shapiro, A.H., Sudhof, M., Wilson, D.: Measuring news sentiment. Federal Reserve Bank of San Francisco Working Paper (2018)
Tetlock, P.C.: Giving content to investor sentiment: the role of media in the stock market. J. Financ. 62(3), 1139–1168 (2007)
Thorsrud, L.A.: Words are the new numbers: a newsy coincident index of the business cycle. J. Bus. Econ. Stat. 38(2), 1–17 (2018)
Alberti, V.: Tracking EU Citizens? Interest in EC Priorities Using Online Search Data - The European Green Deal. Publications Office of the European Union, Luxembourg (Luxembourg) (2021)
Zhang, D., Hu, M., Ji, Q.: Financial markets under the global pandemic of COVID-19. Financ. Res. Lett. 36, 101528 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Consoli, S., Colagrossi, M., Panella, F., Barbaglia, L. (2023). On the Development of a European Tracker of Societal Issues and Economic Activities Using Alternative Data. In: Koprinska, I., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2022. Communications in Computer and Information Science, vol 1753. Springer, Cham. https://doi.org/10.1007/978-3-031-23633-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-23633-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23632-7
Online ISBN: 978-3-031-23633-4
eBook Packages: Computer ScienceComputer Science (R0)