1 Introduction

Computational social science (CSS) can be broadly defined as the area of the social sciences that makes computing power an essential tool to conduct the analysis. The field has a long tradition in economics that goes back to the 1970s when economists started to use computers to solve numerically economic models. Since then, there has been an exponential growth in applications as documented by four Handbooks of Computational Economics published between 1996 and 2018 (see Amman et al., 1996; Hommes & LeBaron, 2018; Schmedders & Judd, 2013; Tesfatsion & Judd, 2006). Computational economics can be broadly characterized in three main areas of activity: numerical methods to solve economic models, agent-based models, and computationally intensive techniques to analyse and model big datasets. The limited goal of this chapter is to provide an overview that focuses on the analysis and modelling of large datasets, while I refer to Fontana and Guerzoni (2023) in this Handbook for a review of agent-based models (ABM) and their use in economic problems. The availability of big data offers the possibility to investigate long-lasting questions using more detailed information about economic behaviour. In addition, these datasets allow to uncover new empirical facts that were not previously known due to lack of information.

What exactly is “big data” in the context of economic applications? It can be defined as datasets that require advanced computing hardware and/or software tools to conduct the analysis. One such tool is distributed computing that shares the processing of a task across several machines, instead of a single machine as typically done by economists. Examples of large datasets used in economic analysis are administrative data (e.g. tax records for the whole population of a country), commercial datasets (e.g. consumer panels), and textual data (e.g. such as Twitter or news data) just to mention a few. In some cases, the datasets are structured and ready for analysis, while in other cases (e.g. text), the data is unstructured and requires a preliminary step to extract and organize the relevant information. As discussed in Einav and Levin (2014), economists are still in the early stages of analysing big data and are learning from developments in other disciplines. In particular, there is renewed interest in machine learning (ML) algorithms after the early applications of the 1990s (Kuan & White, 1994). Varian (2014) discusses techniques that can be used to analyse large datasets.

How can big data contribute to a better understanding of the economy and to support policy? In the highly aggregate context of macroeconomic analysis, big data offer the opportunity to bring to light the heterogeneity in consumers and firms that is typically neglected in official statistics. The high granularity of big datasets can be exploited to construct indicators that are better designed to explain certain phenomena, for example, along a geographic or demographic dimension. In addition, many economic models make assumptions about deep behavioural parameters that are difficult to estimate without detailed datasets. An example is represented by the work of Chetty et al. (2014b) where individual information about the school performance of a child is matched to his/her path of future earnings derived from tax data of the Internal Revenue Service (IRS). In other situations, big data allow to measure quantities that we could not measure until now. A field that is benefiting from these alternative sources of data is development economics. For instance, Storeygard (2016) uses night-light satellite data to estimate the income of sub-Saharan African cities.

Another important dimension in which big data can contribute to economic analysis is by offering information that is not only more granular but also more frequent in the time dimension. At times when economic conditions are rapidly changing, policy-makers need an accurate measure of the state of the economy to design the appropriate policy response. An example is provided by the early days of the Covid-19 pandemic in March 2020 when policy-makers felt the pressure to act in support of the economy despite the lack of official statistics to measure the extent of the slump, as discussed by Barbaglia et al. (2022). Many relevant economic indicators are observed infrequently, such as gross domestic product (GDP) at the quarterly frequency and the unemployment rate and the industrial production index at the monthly frequency. In addition, these variables are released with delays that range from a few days to several months. For these reasons, big data have the potential to produce indicators of business conditions that are more accurate and timely.

More generally, private companies are amassing significant amounts of data that could be used to complement official statistics and inform economic policy. As discussed by Bostic et al. (2016), the approach of governmental agencies to produce official statistics is based, to a large extent, on consumer and business surveys. The approach guarantees the accuracy and the representativeness of the sample, although it comes at the cost of being an expensive and time-consuming exercise. Hence, the availability of alternative datasets offers the possibility of extracting information that can complement the evidence obtained from the surveys (a deeper analysis on the issue of the use of digital trace data and unconventional data in official statistics can be found in Signorelli et al., 2023).

However, we are also faced with a new set of issues regarding data governance and ethics issues as discussed by Taylor (2023) in this Handbook.

The chapter is organized as follows. I first review some of the recent work in economics and finance that leverages large datasets and emphasize the role of big data in allowing the researcher to conduct the analysis. I then draw some conclusions and discuss areas of potential development of the field.

2 Big Data in Economics

In this sect. 12.1, I discuss the main findings of recent applications of big data in economics and finance. I organize the discussion by data source with the intention to provide a more consistent review of the results. The goal of this section is not to be exhaustive, but rather to offer a concise overview of some of the main applications of big data to economics.

2.1 Administrative Data

Administrative data refer to data collected by governmental agencies as part of their mandate. As discussed by Card et al. (2010), the main advantages of administrative data, relative to surveys, are their large samples, the low attrition and non-response rates, and the small measurement error. In addition, administrative datasets are very detailed in terms of the information available regarding individuals. However, the researcher is confronted with significant challenges in conducting the analysis given the restricted access to the data. Typically, the researcher is required to provide the code to the government agency that actually conducts the analysis, slowing down significantly the development of the research project.

An influential paper using tax record data is Chetty et al. (2014). The goal of the paper is to investigate intergenerational mobility in the USA. They use a sample of 40 million children born between 1980 and 1982 and relate their income at age 30 to the parents’ income. This administrative dataset represents a unique setting to evaluate intergenerational mobility since it provides a large sample going back to the 1980 and allows to link children and parents with very high accuracy.

Information from the Social Security Administration (SSA) is used in Kopczuk et al. (2010) to investigate income inequality and social mobility in the USA starting in 1937. They find that inequality decreased up to the early 1950s and increased steadily since then. In terms of social mobility, they show that it has been relatively constant over time, including at the top end of the income distribution.

Big administrative datasets are also used to evaluate educational attainments and teaching effectiveness. Dobbie and Fryer (2011) uses administrative data from the New York City Department of Education to evaluate the effect of charter school programmes on students’ achievement. The evidence suggests that charter schools have a significant positive effect on improving the academic performance of poor children across several metrics. One of the possible explanations for these improvements is that the schools employ high-quality teachers. The issue of measuring the quality of the teachers and their impact on student performance is investigated by Chetty et al. (2014a) and Chetty et al. (2014b). They use a sample of one million students and match data from the school districts and tax records to track the evolution of earnings for the children in the sample. They find that measures of teacher’s value added (VA), such as student’s test scores, do not show a significant bias as proxies of teacher’s quality. In addition, by matching students to their subsequent tax record, Chetty et al. (2014b) find that elementary school teachers with higher VA have a positive effect on college attendance and average earnings, among other measures.

Another source of administrative data is the credit register used by Jiménez et al. (2014) to evaluate the effect of monetary policy on bank’s lending behaviour. The credit register records all loans and contracts between the public and the banking sector in a country. They show that a lower interest rate has the effect of increasing bank’s risk-taking behaviour which leads to an increase in the supply of credit, in particular to more risky borrowers.

2.2 Financial Data

Financial transaction data represent a prominent source of big data in economic analysis. An early application is represented by Gross and Souleles (2002) that use a random sample of 24 thousand credit card accounts to investigate the effect on debt of changes to credit limits. Their results show that individuals respond to an increase in credit limits by borrowing more, in particular for those that started near the limit. Another more recent application using credit card data is Gallagher and Hartley (2017) that use a random 5% sample of individuals with credit history. They use hurricane Katrina as a natural experiment and find that households that lived in areas most affected by the flood experienced large reductions in debt, mostly due to the decline in home loan obligations. Horvath et al. (2021) use credit card data to evaluate the behaviour of consumers during the 2020 pandemic. They find that credit card spending and balances declined rapidly during March/April 2020, in particular in areas with the highest incidence of cases. The recovery in spending started in May 2020 with riskier borrowers leading the way relative to those with high credit score. Dunn et al. (2020) use daily credit card data to assess the geographical and sectoral impact of the pandemic on consumer spending. They show that their measure of spending closely proxy for the monthly retail trade official statistic, which demonstrates the benefit of using big data to monitor the economy in real time. A similar analysis is provided in Bodas et al. (2019) and Carvalho et al. (2020) for Spain.

Calvet et al. (2009) use administrative data on the asset holdings and demographic information of all taxpayers in Sweden. The aim of the paper is to evaluate the financial sophistication of households in avoiding investment mistakes, such as under-diversification, inertia in risk taking, and holding losing stocks while selling winning stocks. They find that households with higher wealth and education levels are more sophisticated and less prone to investment mistakes.

2.3 Labour Markets

Labour market statistics have historically been data-rich due to the direct involvement of government agencies in the administration of unemployment benefits. Recently, private companies have started collecting information about the labour market. Naturally, the question is the representativeness of these private datasets for the overall labour market and the US economy. Horton and Tambe (2015) is a recent survey of the various sources of alternative labour market data that have emerged in recent years and provide a detailed discussion of the advantages and disadvantages of using such data. Napierala and Kvetan (2023) in this Handbook provide a complementary analysis of the role that big data can play in the analysis of the evolution of job skills.

An example of the use of alternative labour market data for policy is provided by Cajner et al. (2019). They use payroll data from the private company ADP to construct employment measures similar to those constructed by the Bureau of Labor Statistics (BLS) using the Current Employment Statistics (CES). They find that the two measures of employment complement each other and jointly they provide information about the dynamics of the labour market. This is a very important contribution since it shows that alternative data can provide information that is complementary and highly correlated with official statistics. The additional advantage of these private data sources is that they are available at higher frequencies and allow the researcher to segment the sample geographically and by demographic characteristics. This benefit is discussed in Cajner et al. (2020) that shows the real-time behaviour of the weekly employment measure during the Covid-19 pandemic relative to the monthly official statistic from CES. Similar results are also obtained by Gregory and Zhu (2014).

2.4 Textual Data

An alternative source of data that is gaining interest in economics and finance is textual data. In this case, the goal is to use text from newspapers, speeches, company reports, and Twitter, among others, to construct measures that help understand economic behaviour or predict economic variables. Gentzkow et al. (2019) provide a recent overview of the work done so far.

An important source of text data is newspaper articles that might be considered a proxy for the information set available to the public when making an economic decision. An early paper is Tetlock (2007) that extracts sentiment from a column of The Wall Street Journal and finds that it is useful to predict daily returns of the aggregate market. Baker et al. (2016) aim at measuring economic and political uncertainty by counting the number of articles that contain a set of keywords associated with uncertainty. They show that their measure is highly correlated with measures of uncertainty. Other recent applications analyse news to construct proxies for economic sentiment (see Barbaglia et al., forthcoming; Larsen & Thorsrud, 2019; Shapiro et al., 2020; Thorsrud, 2020). Monitoring the sentiment of consumers and businesses has a long tradition in economics, and it is typically based on surveys. The contribution of these papers is to show that sentiment based on newspaper articles has a similar behaviour to survey-based sentiment. These indicators are found to have forecasting power for several macroeconomic variables that is incremental relative to the typical macroeconomic predictors (Barbaglia et al., forthcoming). Larsen and Thorsrud (2019) investigate the relation between news and consumer expectations and find that the topics extracted from the news contribute to explain the consumers’ decision to update their inflation expectations.

Another line of research has investigated the role of communication in the implementation of monetary policy. Hansen and McMahon (2016) use the text of verbal and written communication by the Federal Reserve to understand its role in predicting economic variables. They find that the forward guidance embedded in the central bank statements is more relevant relative to the communication of the state of the economy. Hansen et al. (2018) investigate the role of increasing transparency in the central bank communication by analysing the internal deliberation of the policy-makers. They find that their communication patterns changed significantly after transparency was introduced.

The GDELT projectFootnote 1 is another source of textual data that has been used in several applications. Consoli et al. (2021) use sentiment analysis to understand the dynamics of sovereign yields in Europe. Acemoglu et al. (2018) use GDELT to identify events of political and social unrest in Egypt and to evaluate their effect on stock returns.

A data source that is gathering momentum in economic and financial analysis is Twitter. Baker et al. (2021) use Twitter messages to construct a Twitter Economic Uncertainty (TEU) indicator similar to the EPU indicator proposed by Baker et al. (2021) that is based on newspaper articles. Their results show that there is a very high correlation between TEU and EPU.

2.5 Mobile Phone Data

Mobile phone data represents an additional source of big data for economic analysis. This type of data is potentially very high dimensional since it tracks the location of a user over time. An economic application is represented by Blumenstock et al. (2015) that use mobile phone data to measure the socio-economic status of the caller. This is a particularly useful initiative for developing countries where official statistics are not very reliable and well-developed. Milusheva (2020) uses mobile phone data to track the effect of the movement of people from high-disease areas to low-disease areas on malaria spreading. A similar idea is developed in Iacus et al. (2020) that investigate the effect of the containment measures on the spreading of the Covid-19 virus. Their findings suggest that a measure of mobility constructed from mobile phone data is a highly accurate predictor of the initial spread of the virus in Italy and France.

2.6 Internet Data

The emergence of the internet has created the opportunity for researchers to collect online data to proxy for economic variables of interest (see Edelman, 2012, for a detailed discussion). An example is provided by the emergence of eBay as a marketplace for the exchange of goods that allowed economists to test market design mechanisms and to investigate the behaviour of bidders and sellers. An early paper is Bajari and Hortacsu (2003) that examine the empirical regularities of eBay auctions and estimate a model of bidding.

An area of intense recent work has been measuring social ties based on online platform, such as Facebook. Bailey et al. (2018a) discuss the construction of the Social Connectedness Index (SCI) which measures the friendship connections between Facebook users living in different geographical areas of the USA and abroad. An application of the SCI to explain the housing market is provided in Bailey et al. (2018b). They find that social connections contribute to explain the surge in house prices which they argue to be the result of the similarity of experience and expectations about the housing market.

Cavallo and Rigobon (2016) uses price data that are scraped from online stores to construct measures of inflation. These measures are found to track well the official statistics and have the advantage that can be calculated at high frequencies. Goolsbee and Klenow (2018) use a large dataset of e-commerce transactions to calculate the inflation rate. They find that during the period 2014–2017, the inflation rate was 3% lower relative to the official Consumer Price Index (CPI).

Another big dataset that has recently gained interest among economists is Google Trends. It represents a measure of the intensity of queries in the Google search engine regarding a set of keywords in a certain geographic area. The big data feature of Google Trends is that the time series for the search terms is the outcome of the aggregation across millions of queries by Google users around the world. Google Trends can be interpreted as a sentiment measure since it captures the public interest on a specific topic at a certain point in time. An early contribution using Google Trends is Choi and Varian (2012) that finds that including appropriately selected trends improves the accuracy of nowcasts for several economic variables. D’Amuri and Marcucci (2017) use job search-related queries to forecast the unemployment rate in the USA. Their results show that using Google Trends improves accuracy also relative to professional forecasters and are particularly accurate during turning points that are difficult to predict in real time. Castelnuovo and Tran (2017) construct an indicator that they call Google Trends Uncertainty (GTU) that aims at capturing Economic and Political Uncertainty (EPU) in the spirit of Baker et al. (2016) using series from Google Trends.

2.7 Other Data

An interesting application of seismic data to economics is represented by Tiozzo Pezzoli and Tosetti (2021). They use seismic data to identify vibrations produced by human activity, such as air and road traffic and manufacturing activity among others. They find that the indicator they construct is strongly correlated with several official measures of economic activity.

Another source of alternative big data is obtained from satellite images that are used in a variety of CSS applications. However, only recently, economists realized the potential of satellite image data for economic analysis. Donaldson and Storeygard (2016) and Gibson et al. (2020) provide overviews of the application of satellite data in economics and a primer on remote sensing.

Chen and Nordhaus (2011) use night-light satellite data to improve GDP measures for developing countries, which is particularly relevant when official statistics are missing. The paper shows that luminosity provides informational value that can help improve the accuracy of output measures. Galimberti (2020) performs a similar exercise with the focus on the forecasting ability of the measures of economic activity based on the luminosity data. The results indicate that these measures are useful to improve the accuracy of simple forecasting models, although country-specific models deliver better forecast performance relative to the pooled model. In a similar context, Hu and Yao (2021) propose an econometric methodology to use luminosity to improve GDP measures. Henderson et al. (2011) provides a detailed discussion of applications of night lights to measure national income, in particular in the case of developing economies. Another application of night-light data is represented by Storeygard (2016) that evaluates whether the distance of cities from a port influenced their growth in sub-Saharan African countries. The role of the satellite data in this case is to provide a measure of economic activity at the city level that are not otherwise available from official statistics.

3 Conclusion

The discussion in this chapter demonstrates how big data can be valuable to answer long-standing questions and to test the validity of economic assumptions. An illustration is the work with administrative data discussed earlier that shows the great potential of providing economic researchers access to these data, but highlights also the severe limitations of scaling up the availability of these data to a wider audience of users. Another challenge is represented by the fact that many of these alternative datasets are collected by private companies that might have low incentives to share the data with researchers. However, big data have a significant public role to play which calls for a framework that facilitates sharing of the information. An example of the public relevance of using big data is to produce real-time indicators of business conditions. In this respect, the collaboration between the Federal Reserve and the payroll processor ADP (Cajner et al., 2019) indicates how the private big dataset can complement the existing information provided by statistical agencies to support economic policy in real time. This collaboration is likely to set the path for more extensive partnerships between the private sector and statistical agencies. As argued in Bostic et al. (2016), the current model of the production of economic data is the domain of governmental agencies that are funding and running the collection of data, typically in the form of consumer and business surveys. This model is likely to evolve in the future as companies collect increasing amounts of economic data that are valuable, and most likely cheaper, to the production of official statistics.