1 Overview

Financial technology has often been touted as revolutionary for financial services. The promise of financial technology can be ascribed to a handful of key ideas: cloud computing, smart contracts on the blockchain, machine learning/AI, and finally—big and alternative data. This chapter focuses on the last concept, big and alternative data, and unpacks the meaning behind this term as well as its applications.

Before defining alternative data, it is useful to understand what non-alternative datasets are. Our definition of a conventional, non-alternative dataset is one created for the purpose of formal reporting, either by mandate or a voluntary basis, meant to measure an economic concept useful to investors. Conventional data are sources of information widely used and understood by practitioners, policymakers, and researchers. While conventional data may capture a specific economic concept well, it is measurement may be improved, as the data might suffer from a slow reporting lag, a lack of granularity, or imprecision.

Often times, big data and alternative data are used synonymously, although the terms merit distinction. In particular, big data are simply data that is larger along one of the four following dimensions, typically called the four V’s: volume, velocity, variety, and veracity. Meanwhile, the definition of “alternative data” actually pertains to the nature of the data. “Alternative data” is data that are not typically assembled for the purpose of financial analysis. In particular, alternative data have three key properties. First, the alternative data should provide information about a fundamental and value-relevant economic construct. Second, the alternative data should be reliable and not easily manipulated by interested parties. Examples of alternative data include geolocation foot traffic data, credit card transactions, Website visits, and shipping container receipts. Finally, alternative data should have novelty or rarity. When data are adopted by the mainstream, the edge provided by such data may diminish. And so, what is alternative today may not be alternative in a few years. For example, with the introduction of the iPhone in 2007, stock analysts of Apple used various creative means to estimate the number of iPhones being sold using packaging data, a type of alternative data. Soon after, Apple itself started reporting the number of iPhones being manufactured and shipped, effectively reducing the value of the alternative data from packaging as the company directly reported it. Similarly, in the 2009s, it was probably relatively novel for quantitative traders to analyze the text of company reports or news. Now, such methodologies are par for the course. More recently, since 2018, Apple stopped reporting the iPhone unit sales in their financial statements, so analysts look to alternative data again.

This chapter will discuss the sources of alternative data, different types of alternative data that can be used in various contexts, the market for alternative data, as well as the economic framework behind the use of alternative data in financial markets. Finally, we also share several tips when working with financial data, from the data acquisition process to exploratory data analysis to deployment for live use. Little statistical background is required to understand this chapter and the discussion is primarily at a conceptual level, but technical details are provided. The key conceptual idea is that alternative data solves one of two problems faced by a user of data: forecasting the future or measuring something happening today.

Finally, in the remainder sections of this chapter, we discuss the economic framework for using alternative data in the context of specific domains. While all financial sub-industries are likely to use alternative data, we focus on applications that may feature alternative data more prominently: quantitative trading, credit scoring, and macroeconomic forecasting. We then conclude with brisk coverage of how alternative data are used in other industries in finance, such as real estate and insurance, and how economic data were employed during COVID-19 to various aspects of economic activity.

2 Where Does Alternative Data Come from?

2.1 Sources of Alternative Data

There are three sources of alternative data. First, even publicly available information can serve as alternative data. Examples include company regulatory filings or other public documents, the news, or social media. One drawback of these data sources is that all investors have access to the same information, so either the information advantage garnered from these sources is low, or it takes significant sophistication to take advantage of the information when few other industry players can. An example of this is patent filings, which are difficult to link to business databases as the United States Patent and Trademark Office tracks organization names, forcing a researcher to trove through millions of records accounting for transfer of ownership of patents or changes in company ownership (for example, when Microsoft acquires LinkedIn, any IP of LinkedIn now belongs to Microsoft). Another difficult to process source of information which is difficult to process is shipment data from customs and border control agencies, which involve millions of companies and hundreds of millions of shipment records. Other creative examples of this are Web data, such as BuiltWith, who trove through Web sites on a continuous basis collecting information about installations of company applications on Web sites.

Whereas public data provide a level playing field for all players, proprietary data are now becoming increasingly common. There are two types of data; one is collected by cooperatives. Dun and Bradstreet and Cortera operate data sharing cooperatives, whereby suppliers report their customers’ payments—how much do customers order and how timely are their payments on previous orders? Suppliers operate on a “give-to-get” basis. Moody’s operates a cooperative for data sharing among banks to pool loan outcome data. Although this model can acquire and aggregate data quickly and build positive network effects, the data that are reported to the data aggregator may also be subject to firm-specific bias, distortions, and measurement error.

Finally, data generated by business processes, typically known as exhaust data, are generated as the by-product of corporate or government recordkeeping for other primary purposes in an automated fashion. These include supermarket scanner data at the store keeping unit (SKU) level and banking records like credit card transactions or bank transfers. Relatedly, sensors on various devices such as satellites or mobile phones also collect data. Mobile phone data have provided features such as the movement patterns of individuals over time, which, when aggregated, provide information such as the frequency of customer visitation in a retail shop. Recently, these mobile phone data have been very important in characterizing movement and social distancing patterns during the COVID-19 pandemic.

Sometimes, companies with exhaust data sell their data. These data providers are classified alternative because few people in the financial industry have begun applying this data, yet there is great economic value. Other times, they are companies that have exhaust data that are valuable, but do not realize it. For example, publishers operating financial Web sites may see their content as valuable, but what may be equally interesting is identifying the viewership of that content. Knowing that large, sophisticated investors are reading about stock A but not stock B could be valuable information.

2.2 Types of Alternative Data

Here, we list a number of different data categories, based on classifications commonly used by alternative data providers as well as in academic research. These categories are generally but not entirely mutually exclusive.

  1. 1.

    Advertising—Data on how companies are spending advertising dollars, across which venues. Examples include Kantar Adspender.

  2. 2.

    B2B Commerce—This refers to commercial transactions in the supply chain between businesses, such as Dun and Bradstreet, or other sources of supply chain data such as truck shipments or bills of lading data. At the aggregate level, this can also be referred to as trade data. At the firm level, this may also include indicators such as buyer intent sourced from digital research activity.

  3. 3.

    Consumer Transactions—Data on consumer transactions sourced from e-commerce Web sites, credit card transaction records, or email receipts.

  4. 4.

    Employment—Job postings data such as LinkUp or Burning Glass Technologies, data from crowdsourced employ surveys such as Glassdoor, data from recruitment data aggregators such as Entelo, and possibly payroll data aggregators.

  5. 5.

    ESG—Data on corporate social responsibility performance of the firm, sourced from the efforts of ESG-specific data providers such as MSCI KLD ratings, Refinitiv Asset4, Reprisk, TruCost, and others.

  6. 6.

    Event Detection—News sites and social media sites can be scanned by algorithms meant to detect anomalous events tied to a specific company. This can be useful for high-frequency traders, medium frequency traders, credit scoring or general business intelligence.

  7. 7.

    Expert Views—Premised on the idea that intelligent, informed pundits provide value-relevant information on the Web, there are a few ways to aggregate their opinions. Primarily, social media platforms such as Twitter, Estimize, Seeking Alpha, and StockTwits provide pundits a platform with large reach. However, not all contributes may necessarily have useful information.

  8. 8.

    Geolocation—Using readings provided by mobile phones and other sensors, this type of data collects the location of an agent. Sometimes, this agent is an individual, sometimes a truck or shipment. What is of interest is not the agent but the locations of transit. For example, how many people entering Walmart may be correlated with revenues.

  9. 9.

    Mobile App Usage—How often are users in an app? How many have downloaded the app? Such estimates could be proportional to revenues for companies which earn significant share of their dollars from app purchases or advertisements, such as Zynga.

  10. 10.

    Reviews and Ratings—Consumer ratings provide information on how a product is perceived. Positive ratings can also predict future sales.

  11. 11.

    Satellite and Weather—Various sensors provide measurements of different kinds. Night-time lights can provide an estimate of the luminosity of the earth’s surface, infrared scans can be used to estimate pollution, and photos can be used to measure whether a store has a filled parking lot or whether oil reserves are empty or full.

  12. 12.

    Sentiment—What is the current mood of investors in general or surrounding a specific stock? This type of data can be collated from social media Web sites, where investors openly discuss data.

  13. 13.

    Store Locations—Companies such as SafeGraph provide comprehensive data on store locations, which can be useful to investors, corporate competitors, and other constituents. These data are particularly useful when combined with geolocation data.

  14. 14.

    Internet of Things—The Internet of things refers to measurements from Internet-connected devices, including information on geolocation, usage statistics, and search histories.

  15. 15.

    Online Search and Digital Web Traffic—Platforms such as Google provide indicators such as Google Trends, which can be used to make economic inferences around trends. Web traffic data provided by publishers themselves, or companies which aggregate from publishers, can also be used. For example, Bloomberg provides the NEWS_HEAT index which provides interest in specific securities among investors. Many additional data providers exist in this space.

  16. 16.

    Pricing—Consumer or B2B transactions provide revenue estimates accruing to a firm, but their profit margins may be affected based on pricing and marginal cost of products. Therefore, tracking pricing could potentially be useful. This data can be inferred from transaction records (for example, email receipts, which provide items and prices), or by scraping company Web sites and pricing schedules.

  17. 17.

    Public Sector/Government Filings—Government filings are often publicly accessible and provide useful information such as company registration, property records, health benefits paid to employees (which can be drivers of cost), balance sheet and financial information, and health and safety records. In addition, for publicly listed firms, corporate disclosures oftentimes involve key material events such as announcements of future business dealings like corporate acquisitions or buybacks. While this approach has long been used by investors, the data are considered alternative in two respects: (1) There remain many corporate filings in many countries that have not yet been harvested, (2) the cost of implementation can be fairly high, and thus, adoption is not complete.

What datasets are most popular? As of 2018, Eagle Alpha produced the following infographic based on their own customers’ interest. Interestingly, those related to consumer retail (reviews and ratings, consume transactions, geolocation) are most popular among investors, while B2B and satellite are least popular. We do not believe these data lack broader application, but rather due to the scarcity of the data, academic research is limited. Thus, prospective buyers of this data have little guidance on whether such datasets pass the cost-benefit test. As we allude throughout this chapter, we believe this data has substantial value and these characterizations will change.

figure a

Source Eagle Alpha1.1 the market for alternative data

2.3 The Market for Alternative Data

One large friction with alternative data is that the existence of so many datasets makes search costs high. It is difficult to know what is out there, and even where to start looking. While there is no doubt those operating in the early days of alternative data had to pay high search costs to discover alternative data, the world looks much different today. Veldkamp [64] argues that when information is costly to produce, information markets naturally emerge as consolidators of information. One has to look no further than simply the existence of the Wall Street Journal or Bloomberg to see evidence of the existence of basic information markets. Alternative data are arguably even costlier to produce, and so intermediaries should naturally emerge to facilitate this market.

There is an emerging ecosystem of service providers in the alternative data space to help solve the following questions:

  • What datasets are out there?

  • How do I use these datasets?

  • How can I operationalize these datasets, and what service providers can assist my adoption of alternative data?

One example is Eagle Alpha. Founded in 2012, Eagle Alpha is an early-bird in the alternative data space and is now one of the largest players in the industry. Eagle Alpha has numerous product offerings, but generally operates as a broker between buy-side clients such as hedge funds or private equity firms and data providers. They operate a database of over 1000 data owners as of June 1, 2020, who operate on the platform seeking to monetize their data. They also provide consulting services to help with dataset identification and data management. Their list of providers is not public, but they offer the community freemium access to conferences, whitepapers, webinars, and other content.

Other platforms have emerged to consolidate and organize the space of data providers. Alternativedata.org is an open offering that collates and lists different datasets. On their platform, over 400 data providers list their data for purchase or acquisition. Adaptive management is a closed offering that focuses on identifying datasets but also provides extract-transform-load services to help investors ingest data into their databases. There are likely many more not named here. Traditional data providers have also entered the space. Examples of these data platforms include FactSet “Open Marketplace,” a platform which at the time of this writing includes over 100 alternative datasets in addition to the standardized datasets like firm fundamentals and stock prices. CapitalIQ and Refinitiv (formerly Thomson Reuters) also curate a select set of alternative data offerings. As these data providers are also providers of traditional data offerings, a key benefit these data providers offer is integration with existing workflows of many large institutional investors.

The above solutions only discuss how to find and implement alternative data. However, what data are useful? Many data brokers do not assess all possibilities with alternative data. Resources are limited and the use of alternative data may be different for each user. However, data providers can reduce costs for R&D through partnerships with academic researchers or third-party agencies such as ExtractAlpha, who specialize in research and evaluation of data. However, relative to data providers or data brokers, these research resources are relatively rare with most players focusing on aggregation and intermediation.

2.4 Characteristics of an Appealing Alternative Dataset

Beyond financial cost or legal considerations, the ideal dataset has the following key features:

  1. 1.

    Entity resolution for target customers—In our experience, this aspect is the costliest aspect of working with an alternative dataset from a time perspective. Whether one is meant to trade stocks, score credit risk of individuals, or monitor households, the object of interest should be mapped to a standard “identifier” with which the user can easily link records.

    For users at most investment firms, the application is quantitative trading. While the company can be identified by a number of symbols, in the case of stock trading the object of interest would be International Securities Identification Number (ISIN), stock exchange daily official list (SEDOL) or a platform-specific measure such as Bloomberg ticker. We advise against using exchange tickers because they can change over time and the same ticker may represent different securities across exchanges. Multiple identifiers in the same database are useful, particularly to other types of databases. For example, in the USA, understanding the government contracts a firm receives requires tracking its DUNS number (the Dun and Bradstreet unique nine-digit identifier for businesses), and understanding its benefits packages requires its Employer Identification Number (EIN). In addition, much data today are tracked digitally, so having the related Web sites for the company can be useful. The more data the better. Other useful data include address and phone number for cross-checking across databases.

    As it relates to business data, another important property is “point-in-time” identification of ownership structure. Company ownership changes over time, like with LinkedIn as a standalone company, then as a subsidiary of Microsoft. Thus, not only would LinkedIn have to now be binned into Microsoft’s account, so too would be the subsidiaries of LinkedIn. If we receive data on LinkedIn’s subsidiary without considering its implications for its “ultimate owner,” this can be problematic for analysis. Ideally, LinkedIn would be mapped to a standalone company prior to its acquisition, and linked to Microsoft after, but not before its acquisition.

  2. 2.

    Thoughtfully featurized/flexibly featurizable—The data provider should provide useful features (variables in the dataset) that a user would immediately find useful, but also retain flexibility for users to consider different measurements or permutations. Consider two popular providers of job posting data. On one end, LinkUp gathers job postings of major employers in the USA. They have useful features, such as the duration of the job posting, the location of the posting and the occupation code, but also provide job descriptions in case for full flexibility. On the other end, Burning Glass Technologies goes further in identifying credentials or job skills required in a posting, but does not provide the posting for data mining. It is more feature-rich but less flexible. The user must trade-off a more feature-rich data source with another with fewer features but more flexibility. Whether one is better than the other is application-specific.

  3. 3.

    Meaningful economic content—Clearly, there is no point in measuring something and paying a non-trivial cost if it does not measure a powerful economic construct with some advantage over existing data. Congruent with the four V’s of big data, it is important to understand if the advantage of this data is its granularity of subject, its speed of reporting, or its ability to measure something more precisely than previously possible. For example, as with the case of data aggregators such as NovaCredit, Dun and Bradstreet, and Moody’s, the value proposition of the data provider is to provide aggregate data from individual firm-level contributions that would otherwise not be available to a single firm.

    Another approach is to measure a concept less well but in a timelier fashion. For example, one object of interest to investors is a company’s earnings, which get announced on a quarterly basis. While we cannot know better than the company themselves what profits they earned, we can forecast it early using many different signals. For example, if American consumers have been entering Walmart stores as evidenced by credit card statements or if mobile phone data suggest Walmart visitations have increased, we may forecast that Walmart’s revenue will increase.

    In some cases, there are constructs that were not measurable before, such as investor attention. Bloomberg’s NEWS HEAT index provides such a measure. Job postings provide the firm-level demand for workers. Previously, vacancies were only available at the industry level.

  4. 4.

    Novelty and differentiation—This can be considered to be the “wow” factor, which depletes over time with mainstream adoption. In the 2009s, it was probably relatively novel for quantitative traders to analyze the text of company reports or news. Now, such methodologies are par for the course.

3 Economic Framework

Alternative data does not mean alternative economics. Alternative data simply provide another methodology through which to capture variation of the same economic variables which may have already been previously measured. Such data may be beneficial for at least two reasons: (i) it either presents a more precise way of measuring existing economic concepts, or (ii) it presents a timelier measure compared to conventional data sources. That is, whether the outcome variable of interest is a company’s stock returns, a borrower’s likelihood to default, or some other economic or financial variable, the point of alternative data is to take a foundational economic building block and better measure it: either through the dimension of precision or through time.

Consequently, these two benefits of alternative data serve two distinct purposes. The first, viewed through the lens of increased precision of existing economic concepts, is forecasting. The second, viewed through the lens of more timely measurement of existing economic concepts, is nowcasting. The figure below shows these two methodologies across an example with two time periods.

Nowcasting originated in the field of meteorology for predicting current weather patterns in areas where there may not be many sensors. It has been applied to financial and economic applications because many key variables are reported with a lag. For example, for publicly listed companies in the USA, annual 10-K financial statement filings are due only after 90 days post the end of the fiscal period. Nowcasting attempts to forecast the current values which are reported with a lag based on more timely data from the same period.

Forecasting involves the prediction of future values based on values in the current time period. This specific framework for forecasting already incorporates the potential lags in the economic or financial variable reporting.

Although both applications are set up as statistical problems, they fundamentally represent different uses of the data. Forecasting pertains to whether the inclusion of alternative data improves predictions of future economic concepts. For example, changes in prices of items at the stock-keeping unit-level acquired from store scanners may be used to predict future inflation. Nowcasting pertains to whether the alternative data predicts the current ongoing economic concept which is typically reported with some delay. For example, the amount of credit card transactions in a given month may be used to predict a firm’s current revenues.

From the traditional lens of data science, this framework for nowcasting versus forecasting also applies to data science concepts of statistical inference and predictive analyses. On the one hand, nowcasting and forecasting can both improve statistical inferences as nowcasted and forecasted variables form additional variables for empirical tests of whether statistical relations are strong and robust. In particular, they may be used to address concerns related to reverse or simultaneous causality. On the other hand, nowcasting and forecasting can both improve predictive analyses as nowcasted values can be used as additional explanatory variables or the alternative data may directly be used to predict future values in a forecasting framework.

Those interested in a cursory statistical primer may read ahead. Otherwise, the remainder of this chapter discusses the conceptual basis for statistical modeling. It should be largely intelligible to those without formal statistical training, and the key distinction to remember is simply that alternative data are a drop-in for an economic construct implicit within traditional quantitative models.

3.1 Discretionary Versus Quantitative Trading

The application of data is generally thought of to be the province of “quants,” technically skilled individuals who can program computer code to execute strategies in a systematic fashion. Such individuals will begin with hypothesis testing, gathering data to build a simple model to test whether the data can be used to reliably correlate with a specified outcome. Once the model is tested for statistical significance, statistically robust relationships will enter more complex testing. Those that survive vetting will be employed in “production” use and implemented in practice.

While quants will no doubt be the key consumers of any explosion in alternative data, less technically skilled individuals will increasingly find alternative data useful. There is a surge in efforts geared toward building tools allowing non-technical users to access the insights from data, by abstracting away data processing through a user interface. For example, companies such as Thinknum take publicly scrapable data and provide a Web site portal where users can identify specific companies and quickly benchmark unique data points about a company such as its Glassdoor ratings or job postings against industry peers, or against its own historical trends.

3.2 Forecasting

Consider a variable of interest \( Y_{t + 1} \), capturing a specific economic concept of interest. The goal is to build a model using variables \( X_{1,t} ,X_{2,t} , \ldots X_{p,t} \) that can predict the variable \( Y_{t + 1} \) where \( t \) captures a time index,

$$ Y_{t + k} = f\left( {\left\{ {Y_{t - \tau } } \right\}_{\tau = 1}^{t - 1} ,\left\{ {X_{1,t - \tau } } \right\}_{\tau = 1}^{t - 1} ,\left\{ {X_{2,t - \tau } } \right\}_{\tau = 1}^{t - 1} , \ldots \left\{ {X_{p,t - \tau } } \right\}_{\tau = 1}^{t - 1} } \right). $$

In words, the equation above simply takes the stance that the outcome variables at some future are some function of some, any, or all historical values of a set of explanatory variables, which may or may not include itself as well as additional variables. In other words, forecasting is a very general problem that simply uses historical values to predict future values.

In practice, we start with the simplest model possible that just considers one lag of the model using a linear regression specification of the form:

$$ Y_{t + 1} = \alpha + \gamma Y_{t} + \beta_{1} X_{1,t} + \beta_{2} X_{2,t} \ldots + \beta_{p} X_{p,t} + \varepsilon , $$

where \( \varepsilon \) is some noise term with mean zero. Ordinary linear regressions (OLS) of the form above are commonly used as a starting point to understand the relation between different variables. Although the model is basic and most statistical packages come with built-in functions for this analysis, it is powerful. In fact, the Gauss–Markov theorem states that if the true underlying model is in fact linear and error terms are normally distributed (Gaussian), then OLS is the best linear unbiased estimator. We discuss the importance of bias versus variance in the statistical modeling subsection below. But put briefly, OLS requires few assumptions, is easily computed in the data, and is a useful starting point for statistical inference and prediction.

However, additional key statistical properties required for valid statistical modeling in the time series include that the series must be stationary as well as ergodic. For simplicity, we avoid discussing the exact mathematical statistics property in detail but discuss what the two requirements imply.

Stationarity. The first property, specifically defined as covariance stationarity, ensures that the statistical model is stable over time and that its statistical properties do not change. Simply put, stationary time series have a finite and stable variance going forward and also covariance with its own historical values. On the other hand, non-stationary time series have infinite variance. Since practically every statistical model must estimate some variance parameter of the data, this property also means that the statistical model can actually estimate something that is not blowing up. Although that variance does not have to be constant and can move around, it must exist.

In practice, we never observe the estimated values equal to infinity. So, what does the latter issue mean? Although the notion of stationarity requires some assumption on the underlying data generating process, in practice, statistical models estimated on non-stationary time series models tend to generate vastly different estimates across different time horizons and subsets of the data.

Without stationarity, time series predictive models tend to generate spurious correlations, the phenomenon where a statistical model shows some relationships which are not actually there. This issue falls under the class of overfitting issues, which we discuss more below. In fact, with the wrong variable transformations, the probability of observing a spurious correlation gets larger the more time series data there are!

Ergodicity. The second property ensures that additional time series data, also known as sample paths, make the estimates of model parameters more precise. In practice, this means that if we continually re-estimate model parameters as more data become available through time, the statistical model gets closer to the true underlying relationship, and gets better in a predictive sense. Without ergodicity, a time series statistical model effectively will not get better with a longer sample of data!

Although these two properties pertain to different characteristics of the time series data, you can see why they must go together. The first ensures that statistical model parameters actually exist and can be estimated. The second ensures that longer time series data are actually useful to collect and process. Without these properties, different sampling periods of data can result in truly different data generating parameters, and in fact, including additional data does not guarantee that the overall performance of the model improves.

The role of the data scientist is to evaluate, by using domain knowledge based on economic theory or visual representations of the data, whether specific variables are stationary and ergodic. Thankfully, many non-stationarity variables can be transformed to become stationary. For example, stock prices are non-stationary since stock prices can theoretically increase forever and therefore have infinite variance, but stock returns are stationary. Therefore, an astute data scientist will not set up a statistical model predicting stock prices directly, but instead predict returns (or capital gains/losses). Should the end goal be to come up with a target stock price, the forecasted return may be inverted to get a predicted stock price level when multiplied by the previous end of period stock price. Ways to transform non-stationary variables to become stationary include (i) taking the first difference (subtracting the current value by the lagged value) and (ii) considering the percentage change. These transformations may also be applied recursively. Under some situations, a variable that is initially non-stationary may still be non-stationary after first-differencing and must be differenced again.

When applied to financial markets, financial data scientists typically consider financial outcome variables. For example, for a stock trader, the key outcome variable of interest is a stock or market index return. This variable is stationary and ergodic. So, we replace Y with a security i’s stock return at time t,

$$ r_{it + 1} = \alpha + \beta_{1} X_{1,it} + \beta_{2} X_{2,it} \ldots + \beta_{p} X_{p,it} + \epsilon_{it} . $$

What types of variables do we use? In the absence of alternative data, we would typically use fundamental variables that we expect to predict stock returns. For example, the finance literature has considered hundreds of the headline normalized financial statement variables that are comparable across companies. In addition, non-headline items span over one thousand additional items just on the financial statement alone. When including alternative data, the data dimensionality increases even more.

3.3 Nowcasting

Compared to forecasting, nowcasting seeks to exploit differential delays in data reporting to use data generated in a timelier manner to predict outcome variables.

Consider a variable of interest \( Y_{t} \), capturing a specific economic concept of interest. The goal is to build a model using variables \( X_{1,t} ,X_{2,t} , \ldots X_{p,t} \) that can predict some set of variables \( Y_{t + 1} \) where \( t \) captures a time index,

$$ Y_{t} = f\left( {X_{1,t} ,X_{2,t} , \ldots X_{p,t} } \right). $$

Using the simplest model possible, consider a linear regression specification of the form:

$$ Y_{t} = \alpha + \beta_{1} X_{1,t} + \beta_{2} X_{2,t} \ldots + \beta_{p} X_{p,t} + \varepsilon , $$

where \( \varepsilon \) is some noise term with mean zero. The setup for nowcasting looks highly similar to forecasting where the only difference is that the right-hand-side variables are all from the same time period.

So, what exactly are the differences beyond the index subscript of the data? Fundamentally, the different timing of the statistical problem uses different sources of data and requires different assumptions. Forecasting requires using historical values to predict future values. In contrast, nowcasting uses some current values to predict other values. Although there is still a dimensionality of lead-lag effects in time because the explanatory variables are realized and recorded faster than the outcome variable, the data are fundamentally pertaining to the same period of time. To better see this, we discuss the concept of sources of variation to understand what statistical models are using to estimate parameters and create predictions.

Sources of Variation. For statisticians and data scientists, information is contained within the variation in the data. (For those readers who have taken a mathematical statistics class, it is no coincidence that in maximum likelihood estimation, the variance of the score function is called the “information matrix”!) Variation in the data comes from only two sources which capture all of physical existence: space and time.

Space captures different positions and characteristics of different entities or objects at the same point of time, whereas time captures how the position and characteristics of a specific entity or object change through time. Variance in space is also called a “cross-section” which you can imagine as considering different objects within the same slice of time. Therefore, cross-sectional analyses compare objects within the same time period, and time series analyses compare the same object through time.

With this vernacular, we can now discuss the exact conceptual differences between forecasting and nowcasting. The source of variation for forecasting comes from time series variation. The source of variation comes from the cross-section. That is, nowcasting only uses contemporaneous variables and considers how a set of explanatory variables affect the outcome variable (it does not consider the outcome variable’s history or the explanatory variable’s history).

One may think that, for example, the sales level of a company increases through time to unbounded values due to economic growth and inflation. But, nowcasting it with foot or Web traffic data will continually keep pace with such variables, therefore, causing no problems for the statistical models. However, although the main source of variation is from cross-sectional relation between sales and Web traffic data, the data used to estimate the model still come from multiple snippets of time. Therefore, if variables are non-stationary as in this case, then the variance of the estimates and model can still blow up to infinity. Therefore, nowcasting still requires stationary variables. In the example with sales and foot traffic data, both variables should be converted into percentage changes.

Astute readers may realize that one can simply combine forecasting and nowcasting, effectively combining variation from both space and time together. Indeed, such analyses may uncover additional statistical relations that individual sources of variation (space or time) may not be able to identify. Such data structures are called panel data, where information is available for the same set of entities across different periods of time. In terms of the required statistical properties, including the lagged terms of additional data sources turns a nowcasting exercise into a forecasting exercise.

With this conceptual framework in mind, we are one step closer to operationalizing alternative data. However, before putting the data to work, we must first discuss additional complications of using large amounts of data below.

3.4 Statistical Modeling and High Dimensional Data

Even if one has novel and economically important data, alternative data comes with a potential additional set of issues: too many features to choose from. An important follow-up question is how to combine different data sources or different methods of capturing a given data source. For that, we consider a class of methods for variable selection which innovates upon the standard linear regression. All such statistical methodologies can be represented through an optimization process and are often referred to as “shrinkage methods”. For example, with ordinary least squares, the objective function is to minimize the sum of squared errors and to estimate the conditional mean, while with quantile regressions, the objective function is to minimize the equally-weighted deviations from the median and to estimate the conditional median. Both have their advantages and disadvantages based on the properties of the data and the related assumptions on the data’s generating process.

Methods like OLS or quantile regressions suffer from a dimensionality problem. As the features (that is, loosely speaking, the number of columns within a dataset where each row is an observation) increase, these models can quickly overfit the data. Overfitting is a modeling error that occurs when a specific objective function is too closely fitted to the number of data points, resulting in an overly complex model that appears to fit the data well, but due to artificial optimization properties. Models that are overfitted to a specific dataset tend to perform poorly along many measures of accuracy such as the \( R^{2} \), mean-absolute error, or the root-mean squared error when confronted with new data, even if the new data were generated from the same exact data generating process.

As the number of estimated parameters increases closer to the number of observations, the degrees of freedom, which is the remaining source of variation in the data that is not mechanically captured by the model, decreases. As the degrees of freedom approaches zero, each parameter is estimated by one particular observation. In these scenarios, essentially any objective function can suffer from overfitting issues. Therefore, additional methods were developed to reduce the bias that arises from overfitting and to preserve the degrees of freedom left within the data to allow for proper tests of model fit and model quality.

To understand the importance of the degrees of freedom and model complexity, consider the variance-bias trade-off. Errors in statistical models come from three sources: (i) the bias term from model misspecification from the algorithm (also known as underfitting), (ii) the sensitivity of predicted values due to variations in inputs into the model (also known as overfitting), (iii) and the irreducible component fundamental to the statistical problem itself. The latter component can never be solved with any statistical model, hence, the decision of different statistical techniques to use focus on the first two components. To see this, consider the expected squared error relative to any predicted value \( \hat{y} \) generated from any model based on a given set of inputs \( \hat{y} = \hat{f}\left( x \right) \) where \( \hat{f} \) is the estimated model, where the true data generating model is \( y = f\left( x \right) + \varepsilon \) with the noise term \( \varepsilon \) being mean zero. The total mean squared error is then given by,

$$ E\left[ {\left( {y - \hat{y}} \right)^{2} } \right] = \underbrace {{\left( {E\left[ {\hat{y}} \right] - y} \right)^{2} }}_{\text{Bias}} + \underbrace {{{\text{var}}\left( {\hat{y}} \right)}}_{\text{Variance}} + \underbrace {{\sigma^{2} }}_{\text{Fundamental}}, $$

where \( \sigma^{2} \) is the variance of the actual outcome variable \( y \). To see the right-hand side of the equation from the left, replace the true value \( y = f\left( x \right) + \varepsilon \), recall that the noise term is uncorrelated with the estimated function by construction, and add and subtract the expected fitted value from the model \( E\left[ {\hat{y}} \right] \). This trade-off is related but distinct from the ideas related to measurements of statistical fit like the \( R^{2} \) measure and is a fundamental trade-off in statistical modeling.

Conceptually, statistical models with high bias may miss relevant correlations in the data while those with high variance are susceptible to noise. All statistical methods trade-off between these two sources of error, potentially achieving different levels of variation in forecast errors. On the one hand, models with lower bias in parameter estimations (where the bias is the expected value of the estimator relative to the true underlying value of the parameter) tend to capture true relations in the data but may instead suffer from overfitting, generating noisy predictions due to small changes in the input data. On the other hand, models with more bias tend to be more simplified compared to those with low bias, but the predicted values generated from these methods tend to be less sensitive to small variations in the input data, meaning that the model tends to have lower variance.

Overfitting is also alleged in academic publications on empirical asset pricing. Harvey et al. [38] document over 240 academic publications proposing new empirical asset pricing factors. They argue non-statistically significant results would not be published in academic outlets, and statistical inferences should account for the missing strategies that were back-tested but not published. They argue most research findings in the field are likely false due to the multiple sampling and overfitting of the same datasets.

One commonly used set of tools to address overfitting is shrinkage estimators. These are effectively models that introduce some penalty term in the optimization function. Although conceptually one can always overfit a statistical model no matter the penalty that is introduced, the penalty terms and additional empirical methodologies add discipline for statistical practice.

When discussing credit scoring, we discuss the LASSO methodology. When discussing macroeconomic forecasting, we discuss partial least squares.

4 Quantitative Trading

The use of alternative data has perhaps gained infamy nowhere more than in quantitative trading. For example, on the award-winning television show, Billions, legendary investor Bobby Axelrod often mentions novel sources of information such as “satellite data” to predict movements in the company’s quarterly earnings based on the number of cars entering the parking lots of specific companies. Terms such as “quantamental” have sprouted up to describe the enhancement of fundamental analysis traditionally executed manually by humans, with data-driven processes now being executed systematically by computer algorithms rather than on a bespoke or manual basis by humans. However, the use of alternative data need not replace human processes, as more recently, as alternative data providers have sought to reach a wider audience, discretionary investors have also started to adopt alternative data. Industry observers have noticed a distinct rise in interest from discretionary investors.

Alternative data are relevant for quantitative investing in various ways, from informing decisions at the asset class level—like currencies and commodities—to security-level—like predicting a firm’s earnings. We first begin by describing the historical context behind the explosion in quantitative strategies, and then discuss a number of useful cases for quantitative strategies. As we describe in Sect. 4.2, Generation 1 of “quantamental” investing focuses on predicting revenues of companies. Generation 2, meanwhile, focuses on expanding beyond predicting a company’s revenues.

4.1 Strategy Decay and the Impetus for Alternative Data

The impetus for applying alternative data to quantitative trading arises from the competitive nature of financial markets. The widespread use of quantitative strategies boomed in the early 1990s with the emergence of academic research documenting that simple strategies provide investors high average returns that “beat the market.”Footnote 1 For example, Jagadeesh and Titman [46] document that investors can make sizeable returns by simply identifying stocks that had performed well in the previous 11 months, skipping the most recent month, and holding that stock for at least one month. Further findings include the value and size style investment factors [35]. The value strategy rewards investors who buy stocks with relatively weak valuation ratios. What’s more is that these simple strategies appear to perform in large-capitalization stocks, giving hopes that investors can apply these strategies at scale without fear of moving the market.

Through the lens of classic efficient markets theory, the momentum strategy can be interpreted in three ways. First, in the “risk premia” view, momentum embodies a risk for which investors are fairly compensated. Daniel and Moskowitz [29] document that momentum has severe crashes which nearly wipe out equity investors. Second, from the view of “behavioral finance,” momentum reflects persistent evidence of investor mistakes that one can repeatedly use to obtain extra returns. Indeed, many have ascribed momentum to the role played by investors who misinterpret or remain inattentive to past prices. Third, in the “mispricing” view, momentum’s high returns should be eroded once enough investors are aware and deploy capital to compete away investor profits.

Whichever of these philosophies one subscribes to, this line of empirical asset pricing research has spawned some of the largest investment firms and hedge funds on the street. Long Term Capital Management was formed with key members of its founding team coming from academia, applying quantitative methods aimed to “arbitrage” mispricings in fixed income instruments.Dimensional Fund Advisers, founded by David Booth in 1981, applied a combination of value and size investment styles in mutual fund and ETF construction. Then, in the late 1990s, Cliff Asness, one of three researchers credited with discovering momentum, started AQR who today manages over $100 billion of capital. Abis [1] notes a more than threefold increase of mutual funds classified as following a “quantitative strategy” in the mutual fund industry from the year 2000 to 2016.

These initial discoveries propelled more. Hundreds of papers documented various investment strategies which provided statistical evidence over its sample period that a simple, parsimonious trading strategy could earn returns that beat the market. By the late 2000s, John Cochrane, the then president of the American Financial Association, used the term “factor zoo” to refer to the over 200 investment strategies that were discovered in literature and published in top journals. This total does not include the strategies that industry practitioners potentially discovered but did not disclose.

The existence of the “factor zoo” gives rise to two questions. First, were these strategies persistent phenomena (the risk premia or behavioral finance views), or fleeting mispricings? While some strategies, such as the momentum strategy, appear to remain profitable since their discovery, there has been a fair bit of evidence that such strategies were mispricings, wherein investors had limited capacity to profit. McLean and Pontiff [56] replicate 96 investment strategies, and find that the performance of mispricings documented in academic publications declines by about a quarter out-of-sample and 60% after the publication of the studies. The decline in performance is larger the better the in-sample returns are, as shown in the figure below. Of course, one alternative explanation is overfitting done by academic researchers. However, the decline in returns is greater among stocks that are more liquid with lower idiosyncratic risk, which appears more consistent with investors learning about such mispricings from academic research. Jacobs and Muller (2020) note, however, is that this appears to be a USA-specific phenomenon, with no evidence of post-publication decline internationally.

figure c

Source McLean and Pontiff [56]

The second question is about strategy correlations. Are there really 200 simple, mutually distinct investment strategies that investors can choose from? Or do investors crowd into the same strategy, whether knowingly or not? The financial crisis provided a test for quantitative investment strategies—and many strategies failed this test. Khandani and Lo [50], for example, document evidence that in August 2007, at the outset of the financial crisis, many quantitative strategies suffered large consecutive losses in the span of a few days. Interestingly, many of these strategies rebounded quickly after a few days’ losses. The figure below from their study shows this pattern for some major types of quantitative strategies executed by investors.

figure d

Source Khandani and Lo [50]

This pattern could be consistent with investors selling off and closing out their positions, leading to losses on the stocks typically traded by a strategy. As investors finish selling, pressure subsides and the market recovers. Vinesh Jha of ExtractAlpha re-produces this analysis for the main equity factors.

figure e

Source Jha [47]

4.2 The Alternative Data Framework for Quantitative Trading

The fundamental goal of an investor is to predict risk-adjusted returns. A return refers to the percentage change in wealth an investor receives for investing in a security. For an equity security, it is different than the price change when factoring in dividends or splits/reverse splits. Additional considerations might be transaction costs, which pure return calculations ignore. With respect to 3.1, the simple change is to make the outcome variable \( r_{it} \):

$$ r_{it} = \alpha + \beta_{1} X_{1,it} + \epsilon_{it} $$

The basic idea is that X is a variable that measures an economic idea we expect to be related to returns. The data are meant to better measure the economic idea more closely. For example, suppose we believe that companies with positive investor sentiment are positively related to future returns. Traditionally, we may find this difficult to measure, so it may be inferred indirectly. Baker and Wurgler [6], for example, argue investor sentiment is positively related to the “closed end fund discount,” the discount on which mutual funds pay relative to assets. If investors want to be in the market, the discount is smaller. Now, with the advent of social media, wherein investor discussion around a stock might be monitored, we might have the ability to better-measure investor sentiment. Thus, we add value by measuring and replacing a block of an existing investment strategy with a better one through alternative data.

Equity returns move for a lot of reasons. Sometimes, investors trade for liquidity reasons which may induce non-fundamental movements in prices. Some alternative datasets are available for ten years or longer, allowing one to overcome the noise over a long period of time. However, one of the challenges with predicting stock returns using alternative data is a short history. Therefore, in order to gain statistical significance, a common approach is to focus instead on days where returns are driven primarily by information about the firm, namely, scheduled earnings announcements.

Earnings consist of revenues minus expenses. Ultimately, an investor takes home earnings the company has decided to distribute to investors. Therefore, earnings are the major object of interest for investors. However, a firm with a reduction in expenses beyond expectations or increase in sales beyond expectations can also be attractive.

How does one define market expectations? Generally, the views of all market participants are not known. However, it is common for many stocks to have stock analysts, usually working for “sell-side” firms such as Goldman Sachs and JP Morgan, to write analyst reports detailing company business models. Such reports usually produce forecasts of quantities such as sales and earnings per share which are aggregated by companies such as Refinitiv and Zacks into databases to form a consensus forecast.

The consensus forecast is the average of forecasts produced by analysts. How do we compare one firm beating expectations by 5 cents per share, with another by $5 per share? We can scale by disagreement in analyst forecasts.

$$ {\text{SUE}}_{\text{it}} = \frac{{{\text{earnings}} - [\mu]_{\text{it}}^{\text{earnings}} }}{{[\sigma]_{\text{it}}^{\text{earnings}} }}. $$

where earnings are the realized earnings that the company announces, \( [\mu]\) represents the average forecasted value by analysts, typically defined at the most recent forecast, and \([\sigma]\) is the standard deviation of the forecasts across analysts.

Thus, rather than predicting returns, we build a model to predict earnings surprises, expecting that beating earnings expectations, on average, will lead to outperformance around the days of earnings announcements, on average.

$$ {\text{SUE}}_{\text{i,t}} = \alpha + \beta \text{X}_{\text{i,t}} + \epsilon_{\text{i,t}} $$

A number of studies focus on earnings announcements and show that their indicator provides a statistically significant prediction of company earnings surprises. We now walk through the findings from this literature. We first start with earnings surprises. Then, we discuss cross-momentum, another common application of alternative data. Finally, we list a few other investment strategies.

Earnings surprises. By predicting whether a company’s earnings will exceed expectations, we can predict company returns around announcements. The economic theory is simple: companies outperform if they have better earnings. If alternative data are available to some investors before others, such investors can take advantage. The market will react positively on the day of earnings announcements once the company announces earnings.

The focus in the first generation of earnings announcement strategies has been to target revenues. Revenues are easier to predict than earnings, as costs come from a variety of sources. This logic is not perfect. For example, if a retailer offers large discounts to attract customers, even if revenues rise, earnings may fall if the additional quantities sold do not outweigh the average discount per unit. That said, this strategy has worked enough that it is a staple of quantamental investment strategies.

The prevalence of revenue surprise strategies owes largely to the availability of consumer retail data. Consumers often receive excellent free services without paying a dime while exchanging their data in return. Examples of this include Internet search, credit cards, and the usage of apps such as Internet GPS maps or a Web browser and ad-blocker. These free or low-cost service providers attract a large user base, thereby creating a substantial data asset. Examples abound. Internet search allows companies to build data products such as Google Trends, mapping products, or services receiving abnormal interest. Mobile apps ask for GPS readings, allowing them to access the location and movement of anonymized users. Such data are aggregated by companies such as Veraset to identify stores and brands receiving more footfall. Yodlee aggregates credit card spending across major financial institutions, enabling them to track revenues of retailers such as target, and so on.

Consider the case of Netflix. Netflix is more or less one product: subscription services. There are a handful of subscription prices. However, we can generally guess Netflix’s revenues by guessing the number of subscribers. This can be done in a number of ways. Most straightforwardly, data on email receipts or credit card statements provide statements about how many users purchase Netflix. However, credit card panels are often biased, coming from a single or handful of banks, while Netflix is a global player. Thus, this estimate may be buttressed by estimates of Web browser traffic through Alexa or SimilarWeb, which collate data from browser extensions. Of course, these Web traffic panels—while probably informative—may also not cover viewership of Netflix on a mobile phone. Therefore, one can possibly acquire data on app downloads or app usage.

A number of academic papers and industry writings confirm the relation between consumer activity indictors and revenue surprises. For example, Da et al. [28] study the relation between Google Trends and revenue surprises. Revenue surprises can either be defined relative to the consensus of analyst forecasts, or in the case of some academic papers, relative to prior sales growth.Footnote 2 Da et al. [28] find a positive relation between search intensity and revenue surprises, shown in the figure below for the search term and revenues of Garmin LTD, using the search term “GARMIN.”

figure f

Source Da et al. [28]

Mobile phone data has been shown to predict company revenues as well. Froot et al. [36] study how foot traffic measured through destination lookups on a mobile phone strongly explain foot traffic. They show a strong relationship between a foot traffic index and announcement returns. Agarwal et al. [2] conduct similar analysis using a nationally representative panel of credit card users from a bank in the USA. Digital revenue signal is an amalgamation of search, social, and Web traffic data, combined via machine learning. Industry vendor ExtractAlpha has many white papers demonstrating the continued, strong outperformance of this signal. Huang [43] scrapes Amazon ratings, finding that companies which receive increases in ratings outperform in terms of both revenues and earnings the following month.

Most of these prior studies focus on measurement of revenues accruing to retail-oriented businesses. However, business-to-business or expense-oriented studies are relatively rare. Tracking B2B expenditures could be highly value-relevant. This data are relatively difficult to get, however, as companies would likely prefer not to disclose their expenses unless required by financial reporting conventions. However, as mentioned before, suppliers oftentimes report into co-ops such as Dun and Bradstreet or Cortera. Hirshleifer et al. [41] examine trade credit data. They argue that when a company makes more purchases from its top trade creditor, it signals that the company has positive fundamentals as top trade creditors would only provide companies with materials or services if they sincerely believed in the company’s success.

Such companies outperform. Further, Kwan [52] documents the role of digital “intent” data, a type of data used by marketers to identify B2B sales leads, in predicting corporate policies. These data map organizations to the Internet content they consume by topic. For example, is Microsoft more interested in blockchain than they used to be, or office furniture? Kwan [52] finds that companies which read more spend more, particularly, on non-tangible items such as research and development expenses or sales and general administrative expenses. These values are highly value-relevant, predicting negative earnings surprises in the future.

Finally, labor cost is a significant component of earnings. Thus, it stands to reason real-time data tracking employment could be interesting. We speculate that—and are unaware of any research on the topic—various predictors of hiring could negatively predict earnings surprises. For example, high job posting demand could lead to greater hiring, indicating greater expenses. Of course, this hiring could be replacement hiring, in which case earnings may actually improve temporarily as workers leave until they are replaced. Job recruitment platforms such as Entelo or Human Predictions provide data for recruiters, which aggregate worker profiles across various platforms. Data from Internet venues such as Github or LinkedIn can provide further insight into earnings.Footnote 3

In addition, mapping mobile data to organization offices could theoretically provide a strong correlate of hiring activity. Of course, one has to be careful about filtering out visitors, focusing on those mobile devices that spend consistent time at an office location. We are excited to see any future research on this topic.

Cross-momentum. Cross-momentum captures the idea that two companies are economically connected. Examples of such economic links include through supply chain relationships, common owner relationships, or geographical similarities. If company A is economically linked to company B, news for company A is also relevant for company B. If B is a major customer of firm A, when company B does poorly, company A eventually does poorly—but there is a long delay before which company A realizes negative returns.

Formally, for two assets A and B:

$$ r_{t + 1}^{A} = \alpha + \beta r_{t}^{B} + \epsilon_{t} $$

In the equation above, B is the economically linked firm whose information at day, week or month y should affect the future returns of A. A value β>0 means that positive returns for asset B predicts a more positive return for asset A, so traders who notice positive return for asset B will invest more in asset A. A value β<0 means that a positive return for asset B predicts a more negative return for asset A, so traders who notice a positive return for asset B will divest from or short asset A.

Conceptually, any statistically significant and reliable non-zero estimate of β violates the market efficiency hypothesis, as relevant news are not priced in the same time period but in fact manifest over time, generating profitable trading strategies.

An example of a such an economic link may arise from supply chains. The customer of a company doing poorly may affect that company’s future performance. Cohen and Frazzini [27] document a lead-lag relationship starting from customers and propagating to the returns of suppliers. In the figure below, from the publicly available version of their paper, Callaway ELY is the customer of Coastcast PAR, and the downgrade in Callaway ELY predicts a negative stock return in Coastcast PAR related to a negative earnings per share realization in the future. When formed as a portfolio, the original paper returns show an average monthly long-short strategy excess return of 1.58% and monthly three-factor alpha of 1.55% and four-factor alpha of 1.38%, equivalent to annual excess returns of over 16%.

figure g

Source: Cohen and Frazzini [27]

Other studies also explore the idea of cross-momentum measuring economic links in other ways. Lee et al. [53] explore how similar patenting by firms yields predictable returns across technology peers at the monthly level. Hoberg and Phillips [42] explore industry momentum using their unique industry classifications, which are based on the product descriptions that USA publicly listed firms provide in their annual statements.

4.3 Additional Uses of Alternative Data

There are many other novel applications of alternative data in quantitative trading for which we do not have sufficient space to discuss. However, we highlight a particularly interesting recent development is in the use of alternative data to identify investor attention and underreaction. We discuss three use cases below. Companies such as Ravenpack started around the turn of the century to focus on quantifying the news, which investors have found very useful in modifying and supplementing existing trading strategies. In this light, the emergence of a news article can be thought of as information which arrives, and which investors may underreact to.

The momentum strategy we discussed before in Sect. 4.1 is considered a hallmark investment strategy. One leading explanation of why momentum stocks earn profits is that investors underreact to news. It was not until recent times, however, that news data was available to investors for systematic analysis. One of the leading vendors in this space is Ravenpack, a big data analytics focusing on media signals. Ravenpack’s main product monitors news and social media sources, tagging articles within milliseconds of their publication for topics and sentiment.

Using the news data from Ravenpack, Jiang et al. [48] decompose momentum into two pieces: the part which is not. For returns attributable to news, they calculate momentum as the cumulative returns over a week during 15-min intervals where a stock has a news article observed in Ravenpack. Stocks with high “news momentum” are those which have the highest stock returns during news periods, whereas low “news momentum” stocks are those which have the lowest stock returns during news periods.

Figure 3 of their paper, reproduced below, shows the returns to the “traditional” investment strategy momentum, which buys winners from the past 11 months (skipping the prior month) and sells losers from the past 11 months (skipping the prior month), versus their modified momentum strategy that buys the high news momentum stocks and shorts the low news momentum stocks. Relative to the momentum strategy over the same period, their modified investment strategy provides an eightfold increase over the traditional momentum strategy.

figure h

Source Jiang et al. [48]

Another new phenomenon emergent since the mid-2000s is the rise of social media. As the Internet has been a place for people to meet and engage in social activities, there has also been the use of the Internet to exchange ideas. Early explorations of stock message boards found inconclusive evidence of the relationship between Internet messaging and stock returns. After all, it is not obvious why someone with value-relevant information would publicly disclose to other investors for free. However, social media platforms now operate unique business models and draw upon larger segments of society, potentially enticing contributions through various incentives.

Chen et al. [22] explore Seeking Alpha, a Web platform that invites contributions from the investment community which filters for high-quality contributions through its editorial board. They find that stocks identified with positive tone tend to outperform at a three-month horizon, while those with negative tone underperform. The authors speculate the filtering mechanism may be the source of value-relevance on Seeking Alpha. Of special note is Estimize, a platform which fascinatingly invites crowdsourced earnings estimates. Traditionally, earnings estimates consensuses are derived from the contributions of broker estimates. However, market participants on the “buy-side”—those who are actually investing in stocks—also engage in forecasting. It is arguable that such investors tend to be more sophisticated with skin in the game, particularly, after regulation Global Settlements, which limited the compensation banks could give to analysts, sparking fears of a brain drain from the sell-side.

Of course, these measurements of investor attention are indirect, inferred from the content. Much of the content may not actually be read. Other sources of data include intent data [52], Bloomberg’s NEWS HEAT index, and possibly content provided by publishers themselves.

There remain many other opportunities in applying alternative data to quantitative trading, which we discuss in the conclusion.

5 Credit Scoring

Credit fuels economy activity, facilitating economic transactions between counterparties. Consumers use credit make large purchases such as cars or homes, but even simple transactions such as groceries or gas money are fueled by credit if the consumer uses a credit card. Traditionally, such consumer transactions require credit checks and a credit history. From the lender’s perspective, requiring some credit history and a decent credit score is sensible due to the risk that the borrower will not repay a loan.

Corporations also use credit. Corporate employees can use corporate credit cards for smaller purchases such as catering or restaurant dining. Corporations financing inventory, capital expenditure and even sometimes R&D using credit instruments such as debt or formal bank loans. They also implicitly borrow via trade credit, through which customers receive credit by delaying payments to suppliers.

A credit score is a quantitative or categorical ordinal metric which is designed to characterize either the relative (comparing one entity to another at a point of time) or absolute (probabilities of default) credit risk of an individual or entity. It is primarily used by lenders in assessing whether to lend to potential borrower. Traditionally, obtaining a credit score requires a credit history. Building a credit history requires inclusion into the formal financial system. It begins with a customer opening a bank account. Obtaining larger credit lines requires the individual to slowly build a credit history, have verifiable assets or income or otherwise signal credit worthiness. Building a credit history can take time. There are three consumer credit rating agencies in the United States: Experian, TransUnion, and Equifax. For large corporations, the Big Three credit ratings agencies, Moody’s, S&P, and Fitch, control around 95% of the credit ratings in financial markets for formal instruments such as debt. For small and medium sized enterprises, additional data providers play a role. Cortera and Dun and Bradstreet captures trade credit repayments. Additional credit scoring is done by various credit rating bureaus.

Although these credit rating agencies gather large amounts of information, they only capture information from those formally included in the financial system, po-tentially leaving out creditworthy borrowers who have no formal history. The objec-tive of alternative data in credit is to improve credit scoring models in terms of their ability to predict a borrower’s default or delinquency behavior. By identifying the fraction of firms who are indeed likelier to default than others, lenders can then ex-tend credit to those predicted to be less risky who may not have had a credit score if not for the use of alternative data.

Moreover, alternative data are beneficial for those who are excluded from the financial system. The World Bank has a series of global financial inclusion initiatives, including the Global Financial Inclusion Index. Leaps have been made recently. The World Bank’s 2017 Global Findex database—the latest year available—shows that 1.2 billion adults have obtained an account since 2011, including 515 million since 2014. This charge has led by emerging economies, such as India, which implemented a demonetization of a large fraction of existing currency outstanding in 2016. In conjunction with other reforms, these policies promoted the widespread use of bank accounts among Indian citizens, bringing many citizens into the fold of the financial system.

The existence of a “credit gap” also applies to small and medium enterprises. Corporations have two choices of access in debt markets: direct lending from banks and bond markets. Fixed income markets are only available to those with credit ratings, again intermediated by big providers such as S&P, Moody’s, and Fitch. Because losses given default when a corporate borrower does not repay their loans is large, lenders tend to be unwilling to lend blindly to borrowers who do not have rating. Moody’s Ultimate Recovery Database from 2007 through 2009 report that the average recovery rate on a loan that has defaulted in the USA was around 30 cents on the dollar. However, the credit rating companies do not rate every borrower. Therefore, many companies may lack access to financing due to their inability to get a loan. The World Bank estimates over half of small and medium enterprises do not have access to formal credit.

5.1 The Canonical Credit Scoring Model

In order to rank people based on credit risk, credit ratings agencies and data scientists typically use a statistical model designed to predict default events. Credits coring models can be applied to many types of entities, such as countries (sovereign ratings), companies (corporate ratings), and individuals (FICO score). For example, Moody’s long-term corporate bond rating ranges from Aaa through D, with the former representing the safest rating and the latter representing default. Credit ratings Baa3 and above to Aaa are known as “investment grade,” whereas anything below carry more credit risk and are labeled as “non-investment grade,” also colloquially known as “junk” status. Other corporate credit rating agencies use slightly different labeling but capture very similar thresholds which make ratings comparable across credit rating agencies. The mapping from the predicted probabilities to the letter grade rankings are based on relative predicted probability of defaults among peers. For example, the Moody’s credit rating for Tesla, the electric car producer, is ranked as B2, which is within the “junk” category, and six (out of twenty) rating steps away from the lowest credit ratings for firms actually in default.

Meanwhile, the FICO score, named after the original Fair, Isaac and Company rating ranging from 300 to 850. The same range is also used by the Experian, TransUnion, and Equifax, the big three credit score companies for individuals. Borrowers with scores above 660 are known as prime, and those with scores above 750 are known as “super-prime.” Most prime borrowers and above will be able to receive a loan, while borrowers with scores below 670 are known as subprime. Subprime borrowers are more commonly denied a loan, and if they receive a loan, they tend to carry higher interest rates to compensate for the higher perceived riskiness. The Financial Crisis of 2008 is commonly known as the “subprime mortgage crisis” which the defaults among subprime mortgage borrowers precipitated the crash of the financial system.

No matter the nomenclature, the classifications are based on relative predicted probability of a default event occurring using a host of financial variables. The relevant outcome variables are: the probability that a default event occurs and the severity of the default should it occur. For brevity, we focus on the former in the canonical set up of the credit scoring model which forecasts a binary response variable, defined as a “default event,” using a specification below,

$$ \begin{aligned} & 1\left\{ {{\text{Default}}\;{\text{Event}}_{t + k} } \right\} \\ & \quad = f\left( {\left\{ {1\left\{ {{\text{Default}}\;{\text{Event}}_{t - \tau } } \right\}} \right\}_{\tau = 1}^{t - 1} ,\left\{ {X_{1,t - \tau } } \right\}_{\tau = 1}^{t - 1} ,\left\{ {X_{2,t - \tau } } \right\}_{\tau = 1}^{t - 1} , \ldots \left\{ {X_{p,t - \tau } } \right\}_{\tau = 1}^{t - 1} } \right) \\ \end{aligned} $$

where the variables \( X_{1,t} ,X_{2,t} , \ldots X_{p,t} \) are explanatory variables and the target variable \( 1\left\{ {{\text{Default}}\;{\text{Event}}_{t + k} } \right\} \) is an indicator variable of a default event occurring within \( k \) future time periods relative to the current time period \( t \). Examples of commonly used default events are whether a loan becomes 30-days delinquent (that is, payments are 30 days late), 90-days delinquent, or whether a company declares Chaps. 7 or 11 bankruptcy. Depending on the credit model, previous default events may also be included as an explanatory variable to predict future default events.

Because the outcome variable in a credit scoring model is a binary response variable, we interpret predicted values from the function \( f\left( \cdot \right) \) as the probability of a default event occurring based on the definition of the default event function. Since the response is a binary variable, we also want a functional form \( f\left( \cdot \right) \) to also account for the fact that the response variable only varies between the value of 0 and 1. For example, a linear model specification such as those used to predict returns in equity returns from Sect. 4 may not be well-behaved as predicted probabilities may become negative or more than one due to the linear functional form of the specification. Therefore, commonly used functional forms include a class of “generalized linear models” which introduce linking functions of the data to account for the boundedness of the outcome variable. An example of a commonly used functional form include the logistical function, which takes the form

$$ f\left( {X_{1} ,X_{2} } \right) = \frac{1}{{1 + \exp \left( { - \left( {\beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} } \right)} \right)}} $$

where the values for the explanatory variables in \( X_{1} \) and \( X_{2} \) can range from zero to infinity. The specification can be generalized to accommodate an arbitrary number of explanatory variables. Graphically, the logistic equation compresses the space of explanatory values from minus infinity to infinity into an outcome space ranging from zero to one.

The elegance of the logistic regression functional form is that the logit function permits an interpretation that the outcome variable is the predicted log odds ratio. In this case, the odds ratio is the ratio between the probability of a default event relative to the probability of a non-default event. For example, in words, the odds ratio may be that Company A is five times more likely to be delinquent than not-delinquent over the next 90 days, where the default event is the 90-day delinquency binary response variable.

Apart from the logistic regression, another functional form is known as the probit function, which is based on the inverse normal distribution cumulative distribution function

$$ f\left( {X_{1} ,X_{2} } \right) =\Phi ^{ - 1} \left( {\beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} } \right), $$

where \( \Phi \) is the cumulative distribution function of a normal distribution. Conceptually, the probit function produces a similar map to the logistic equation, converting the possibility space from the explanatory variables ranging from minus infinity to infinity into outcome variables ranging from zero to one. But although the motivation for the probit specification is its simplicity, the interpretation of the outcome variables is less intuitive.

In the figure above, we see that both the logistic and probit functions range from a y-axis of 0 to 1, but the logit has a longer S-shape compared to a probit equation, all else equal (holding fixed the same parameters \( \beta_{0} , \beta_{1} \), and \( \beta_{2} \)). If the findings of the model are robust, the predicted values should not differ vastly based on the linking function used, but the quantitative performance in terms of forecast precision may vary slightly. One linking function does not dominate another ex-ante, and the best fitting linking function depends on the context. In cases where you are more concerned about tails of the distribution (data points are clustered at very low levels of “X” for one response value and high levels of “X” for the other response value), you may prefer the probit specification as it reaches the extremes faster. However, for settings where the explanatory variable space is more uniformly distributed, then a logistic specification may perform better. Other linking functions exist, but typically have similar sigmoidal (colloquially known as S-shaped) features to compress outcome variables to zero to one.

In addition to predicting binary default events, credit scoring models may also be expanded to also forecast the loss given default in the event of a default event occurring. Typically, the data used in such an exercise study differences across defaulted entities across time periods since the loss given default is not a well-defined concept for non-defaulted firms.

Now, we turn our attention to the selection of the explanatory variables \( X_{1,t} ,X_{2,t} , \ldots X_{p,t} \). Traditionally, there has been a dichotomy drawn between human and machine information in the provision of credit. There are two types of information that enters the canonical credit scoring model. Hard information, which assesses quantitative information pertaining to financial health, and soft information, which is qualitative or discretionary information about a borrower’s creditworthiness or the creditworthiness of a particular loan. The main reason why financial inclusion is necessary is that collecting the records around income or past repayment itself can only begin with formal financial inclusion: when a financial intermediary maintains records about an individual or entity.

To illustrate these definitions by way of example, if referring to a mortgage loan, hard information may include things such as the current market value of the house, hard characteristics such as a borrower’s income and repayment history and existing debts. Soft information may include things more difficult to objectively quantify, such as a loan officer’s “gut-feeling” about the borrower, knowledge about the neighborhood where a house is located such as its proximity to an upcoming construction project. Soft information may be thought of as that which is traditionally difficult to codify in an existing credit scoring model.

Alternative data can blur this distinction, codifying soft information systematically. The examples of alternative data used in credit scoring, include information on suppliers and customers of a company, and even psychometrics. Although not every borrower has proper financial statements, every borrower nowadays has a mobile phone, a personality, and a social media account. All these can be sources of relevant information for predicting the credit quality of a borrower. But before going into more detail on specific examples of alternative data for credit scoring, we must first discuss some statistical methods commonly used in this field. Because alternative data can contain many features, and credit default events needed for predicting credit quality may not occur very frequently, we must consider additional statistical methods to handle the large dimensionality of the prediction problem.

Shrinkage Estimators for High Dimensional Data. One example of a simple methodology to address high dimensional data is the least absolute shrinkage operator (LASSO) and elastic net specifications. Both methods belong to the same family of extensions of the OLS objective function to include a penalty term on the size of the estimated coefficients of the form:

$$ \hat{\beta } = argmin \left| {\left| {Y - X\beta } \right|} \right|_{{L_{i} }} + \lambda \mathop \sum \limits_{j} \left| {\beta_{j} } \right|, $$

where \( L_{i} \) is the \( l_{1} \)-norm for LASSO, \( l_{2} \)-norm for the ridge estimator, and elastic net specifications include both the LASSO penalty term and the ridge estimator penalty term, effectively mixing both methods. The benefit for the latter combination of methodologies is that while LASSO can only produce the same number of coefficients as there are data points at a maximum, and the latter approach can produce more non-zero coefficients than the number of data points available, similar to the ridge estimator. At the same time, the elastic net contains a stricter penalty rule compared to the ridge specification. The LASSO specification falls under the class of linear models and may also be combined with a linking function like the logistic or probit equation. The model is typically estimated through maximum likelihood estimation.

Conceptually, this family of shrinkage estimators takes the stance that not every variable used in the statistical exercise fundamentally matters or contains incremental information for the objective, predicting some outcome. Therefore, the penalty term tends to shrink potential estimates toward zero. Unlike with standard OLS, this analysis generates purposely biased estimators in hopes that it mitigates the overfitting issues. Therefore, in terms of the variance-bias trade-off, this analysis will tend to underfit rather than overfit the data compared to an OLS benchmark. However, the overall errors relative to the predictions using these methods can be dramatically improved as with large numbers of explanatory variables, OLS tends to overfit the data severely.

More sophisticated statistical models contain similar features, a main target objective function, as well as a penalty function for model complexity. As the exact specification of the objective function differs, the performance in different contexts also differs, and no single model tends to dominate across multiple contexts.

For credit scoring in particular, models typically consider a delinquency as a binary response variable taking the value zero if a loan does not default and one if the loan has defaulted. Other specifications may also include modeling the recovery rate or number of days that a loan is delinquent directly as a continuous variable. However, since defaults do not occur frequently and recovery rates or number of days delinquent are not well defined for well-performing loans, such statistical modeling requires additional considerations by using methods which account for censored or truncated data such as a Tobit regression or the Cox proportional hazard model used in survival analyses.

To keep the discussions as broad as possible without getting mired down in the statistical details, we discuss the applications of alternative data for credit scoring which focus on default as a binary response variable.

5.2 The Advantage of Having Data

As noted before, credit scoring has traditionally been built on tracking the historical record of individuals and firms. We also mentioned that in some cases, this data is rare. At the individual level, this data are rare because for example financial inclusion in a particular country is low. Another issue is human migration. With the rise of globalization over the past two decades, cross-border migration has become a more common phenomenon than in the past. In the USA, immigration has been rising, with the Pew Research Center reporting that in 2018, 44.8 million people in the USA were immigrants. For many of these immigrants, they may come to the USA for a high-income job. Yet, without a formal credit history transported from their home country, such ostensibly creditworthy individuals are left out of the financial system.

The three main credit agencies—TransUnion, Experian, and Equifax—each to some extent operate internationally. Thus, one solution would be for these agencies to collect and transport records across borders. However, for individual credit scoring, the data on specific citizens is subject to privacy laws and regulations. This is the issue tacked by Nova Credit, a financial services firm founded in 2015 founded by immigrants to the USA. Nova Credit works with regulatory authorities and from credit rating agencies in several countries, normalizing the data. The countries include Australia, Brazil, Canada, India, Mexico, Nigeria, South Korea, and the UK, allowing American underwriters to have information similar to the American domestic credit score.

Merely having data is particularly useful in the corporate credit sector, where this information is highly secretive. Usually, the data are achieved on a give-to-get model. For example, Dun and Bradstreet and Cortera, among others, provide data on corporate trade credit transactions. Trade credit is the credit implicit within a sale of inventory by a supplier to a customer. For example, suppose Kowloon Dairy, a dairy goods provider in Hong Kong, sells its product at 7-11 and Circle-K in Hong Kong. Rather than demanding payment upfront, the company will provide the seller 30 days—for example—and penalties for paying late. Dun and Bradstreet and Cortera will gather information from the supplier—in this case Kowloon Dairy—on the average repayment behavior of its customers. When aggregated across suppliers, who are thusly anonymized, this forms a composite of the creditworthiness of stores like 7-11 and Circle-K.

Why are the suppliers willing to provide data? Suppliers receive services from the trade credit reporting agency in return. Of course, not all suppliers are willing to report. We suspect that for example, Foxxconn, a major supplier of Apple, would not be able to mask herself were she to report truthfully about Apple, as their contract likely consists of billions of dollars in accounts. Thus, there may be a selection bias in trade credit data obtained in this way. It is also not a plausible data collection practice in certain countries, for example, Japan, where collecting this data is a taboo.

Another example of this model is Moody’s. Moody’s operates a consortium between banks where banks can offer provide anonymized data on default rates and firm characteristics for corporate loans. Here, banks participate because it allows them access to data that enables loan risk benchmarking.

5.3 Mobile Data

Today, while not everyone may have a credit score or a bank account, many people have access to a smartphone and a social media account. Statista reports the number of smartphone users around the world to be around 3.5 billion and the number of active Facebook users as over 2.7 billion monthly active users as of the second quarter of 2020.

For a perspective on how important smartphones and social media are, over 75% of Americans have smartphones and 47% of smartphone users say they could not live without their devices. In China, Statista reports 99.3% of all Internet users go online using their mobile devices, or about 897 million Chinese people out of the 1.4 billion in China. Out of all mobile phone users, only 3% do not use a smartphone.

Data from mobile phones and social media are also likely to be up-to-date and feature-rich, potentially useful for both forecasting and nowcasting. Vox reports that over 65% of Americans check their phones up to 160 times a day, and Digital 2019 reported that 56% of all Web site traffic worldwide was generated through mobile phones.

Such metadata on mobile phone usage do not include the exact contents due to privacy concerns, but include information about the use of mobile devices and social networks. This type of metadata is known to be collected by both public and private enterprises. In the realm of public enterprises, with the passage of the USA Patriot Act of 2001, Section 215 permitted the collection of such metadata from USA citizens by the National Security Agency. Its revelation was famously leaked by Edward Snowden, who was then a contractor for the government agency. In the realm of private enterprises, mobile data providers store this information and may sell it for use.

Unsurprisingly, applications have emerged studying the relation between mobile phone usage with consumer loan quality. San Pedro et al. [60] study data from a Latin American country and consider the effects of both mobile phone usage and social media network usage on loan delinquency. Mobile phone usage data included metadata on the number of daily text messages and phone calls, the duration of calls and the daily time between consecutive text messages and calls. Social network features include the number of unique correspondents with contacts, the number of reciprocated social network interactions like calls or “likes,” the fraction relative to total interactions, and the median time between reciprocated interactions.

They also considered features of the mobile device itself, such as the device brand, device operating system, device type, phone number type, status, late payments, and months since the initial SIM card activation. Finally, they also supplemented their data with additional mobility features stored in censors of the smartphones such as the radius of gyration, distance traveled, and the popular antennas that the cellphone was connected to. The goal of assembling the mountain of data in this academic study was to evaluate whether alternative data can supplement or replace existing credit scoring systems, and if so, which sources of data appear most useful for predicting credit events.

Finally, the main outcome variables come from a bank and are binary variables of whether a loan account is 30 days or 90 days past due. A potential issue in the data is that since defaults may require some time to manifest, researchers must be cognizant of when their sample was collected and what the amount of time was between when the loan was taken. In this setting, for the 30-days past due binary measure of delinquency, the researchers consider accounts at the end of their sixth month. Although loans can last more than six months and discarding subsequent data may result in less information, truncating the data in this way permits an easily interpretable empirical exercise. In words, simply “studying the probability of an account being 30-days delinquent within the first six months of the loan.” When studying the 90-day past due measure, they consider loans in the first nine months.

Given all of the features observed by the econometrician, the authors build a model to identify those that most accurately predict default. The main empirical models include several dimension reduction techniques, but the main method is the LASSO discussed before.

5.4 Measurement of Predictability

The researchers consider multiple measures of precision, but the main measure is the AUCROC, which is the area under the receiver operating characteristic curve, providing information about the ability of a statistical model to rank users according to their probability of default. These measures are well-defined for binary outcome variables and are commonly used performance metrics in credit scoring literature. The AUCROC value ranges from 0 to 1. A model whose predictions are 100% wrong has a value of 0, a model whose predictions are 100% correct has a value of 1, and a coin-flip random guessing model generates a value of 0.5. This measure is beneficial for two main reasons: (1) the measure is scale-invariant, meaning that multiplying delinquencies by 100 would not change the performance metric, (2) the measure is classification-threshold-invariant, meaning that changing whether a predicted probability of default of 0.5 is predicted as a binary default tor not does not affect the measure.

They find that the AUCROC of credit scores from the credit bureau is 0.571 for the 30-day delinquency measure and 0.591 for the 90-day delinquency variable. In comparison, a classification model using detailed call records information and data about the mobile phone features and socioeconomic features achieved an AUCROC value of 0.716 for the 30-day delinquency measure and 0.725 for the 90-day delinquency measure, a relative improvement of over 20%. In other words, the alternative data even outperformed the data from the credit bureau based on an individual’s historical financial transactions.

Björkegren and Grissen [11] find similar results using a digital lender, focusing specifically on the incremental predictability results for unbanked individuals in a Latin American country. Although their sample consists of individuals with average incomes slightly higher than the average national income level, their conclusion is that the digitization of lending and the alternative data can increase financial access to individuals who were previously not part of the financial system.

5.5 Digital Footprint Data

An innovation upon mobile data is digital footprint data—the digital exhaust of how users interact with digital services such as websites or applications. Digital footprints encompass a wide range of potential modeling features, such as the time users spend on a page, the types of content they consume, interactions with this content, information they enter into forms or provide to the digital service, or other characteristics of the user such as the type of device being used (laptop or mobile) or the speed of internet connection. The basic idea is that these characteristics contain rich economic content about the economic conditions of the user.

Berg et al. [10] demonstrate the power of the digital footprint approach using data from a German furniture store. The premise of the study is to build a credit scoring model using digital footprint data to identify customers who are likely to default on large furniture purchases. The data in the study is rich, including whether a borrower arrived on the loan application webpage through a price comparison site or whether the application came directly through the digital provider’s smartphone application. Another feature includes the users’ device type, such as whether they use a PC or mobile device, and among mobile devices, devices such as iOS versus Android. The data also include proxies for reputation and character, such as whether a borrower uses their full name in their email addresses. Customers having their names in the email address are deemed more honest, and consistent with this interpretation, are 30% less likely to default. Using this rich set of features, the authors build simple machine learning model to predict customer default.

In their baseline model based on digital footprint features alone, an AUC-ROC value of 0.696. Meanwhile, a credit score provided by the credit bureau alone produce an AUC-ROC value of 0.683. Interestingly, combining the two sources of information produces even better performance, improving accuracy by 5.3%. Importantly, the digital footprints data also appear to work equally well for non-credit-scored as well as credit-scored customers, again echoing the results of earlier studies which suggest alternative data can expand financial inclusion and credit access to individuals.

figure j

5.6 Psychometric Data

Conditional on a particular set of hard information, two borrowers can differ along social values, aversion to risk, personality or ethics. To the extent such differences may lead to different financial repayment behaviors among people of otherwise similar financial condition, a psychological profile may be a powerful predictor of repayment probabilities. Supporting this premise, academic studies have shown that psychometric profiling, the broad class of techniques for inferring personality characteristics, are a potentially powerful approach for credit scoring.

There have been a number of papers demonstrating the power of psychometric profiling for credit scoring. Arraiz et al. [4] and Klinger et al. [51] are two of the first. The psychometric tool they employ are simple surveys. In this survey, some, but not all of the questions, are related to financial conditions or financial sophistication—the rest are related to personality traits, values or beliefs. Importantly, while the answers to the questions themselves may be important, they manner in which the questions are answered are also important.

Examples of questions are shown in the reproduced figure below. At first blush, some survey questions have obvious value to a credit scoring model, while the value of other questions is less obvious. First, if a respondent responds with certain types of information, such as not working many hours a week, this may immediately provide hard information useful in models. Second, some of the information refers to financial sophistication. In example 3, the option of 100 gold coins in two years versus 50 gold coins today implies a person’s discount rate. If a person answered truthfully, they may provide very useful information about the exigency of a person’s financial needs. Of course, one might suppose that a financially sophisticated person could understand the intention behind these questions, feigning financial stability despite being in a state of financial concern. Third, other questions are more ambiguous in nature. These imply other psychological characteristics with non-obvious value, and thus which might be harder to game or manipulate for the respondent.

What is also interesting about psychometric tests is not just the content, but the secondary characteristics implied by the responses. For example, consider the digital footprint of the response: did the person take a long time to answer, or a brief second, implying the question set is less useful or the person is more careless? Is the answer consistent with other answers? In some questions, the respondent is asked to report what color or shape the person saw on the previous page and is given the option to “go back” to the previous page. Clicking this button would potentially be considered cheating, which one might imagine is telling of whether a potential borrower is a poor credit risk.

From these questions and answers, one can encode multiple personality characteristics, which then can be used to predict default in a sample of bank customers. What Klinger et al. [51] do is they build a model that correlates the psychometric characteristics of respondents to default behavior using a training sample, and this training sample can be extrapolated toward any future survey respondent. If the model predicts default reasonably well out of sample, this suggests that the model can be generalized. As surveys can be answered by anyone, they can be used to potentially any survey respondent—including those without previous financial history. The study shows that models built on citizens of a country appear to be able to predict credit default out-of-sample for other citizens. In some samples, the AUC achieved is around 70%. Interestingly, there also seems to be predictive power across countries, with the model trained on African citizens able to predict the credit repayment behavior of users in Peru [63].

figure k

Source Arraiz et al. [4]

This powerful idea of the use of psychometrics has made it into practice. The LendoEFL (previously, Entrepreneurial Finance Laboratory out of the Harvard Kennedy School and Lenddo now merged) has pioneered commercial collaborations with banks and financial institutions from around the world, leading to partnerships with banks in several continents.

Another, somewhat humorous example explores the physical appearance of individuals. Psychologists have known for a long time that people form impressions of other people’s behavior based on their appearance, leading to the high focus on “making a good first impression.” First impressions can be formed rapidly and have been documented to predict social outcomes like elections [63]. With the rise of fintechs, studying whether there may be human biases implicit in the marketplace is important to understand how credit gets distributed across different borrowers.

Duarte et al. [34] extend this line of research to study how first impressions affect financial decisions in the peer-to-peer consumer loan marketplace prosper.

The researchers address two questions: (i) does a borrower’s facial features relate to whether or not a loan gets funded, and (ii) whether a borrower’s appearance is actually related to their credit grade and default probability. If trustworthiness by way of appearance only reflects funding probability, without a corresponding increase in loan performance, it could be argued that these physical features simply reflect lender biases. The researchers asked 25 human raters to rate borrowers who uploaded a picture in terms of trustworthiness and the willingness (not ability) that a potential borrower will repay. The researchers found that faces rated to be more trustworthy were more likely to get a loan approved, and the interest rates among those with more highly rated trustworthiness based on their photos received an interest rate 0.5 percentage points lower than the average borrower.

The study finds more trustworthy-looking borrowers pay lower interest rates than less trustworthy-looking borrowers and also have lower probabilities of default. However, the lower interest rate does not fully offset the lower probabilities of default, meaning that lenders who lent money to more trustworthy-looking borrowers could earn higher excess returns net of average losses from default events. To disentangle trustworthiness from other potential interpretations, the researchers also asked the raters to rate whether they find the face attractive and whether the photograph suggested that the person begin rated was rich. They find that despite controlling for these features (averaged across all the raters), there was still predictability from the trustworthiness measures.

figure l

Source Duarte et al. [34]

In sum, Duarte [34] suggest that visual information and human judgements may both be useful for assessing credit worthiness. Anecdotally, PingAn, the insurance giant in China, has applied enhanced versions of this technique in their onboarding process. Their process even uses live video to assess facial and verbal reactions to questions. This provides some testament to the validity of this idea.

We have discussed many features that show promise in credit scoring. However, to differentiate the best features, we may need to discriminate which features provide incremental predictive power through shrinkage estimation. It may be the case that while many of these features improve upon traditional credit bureau models, only a handful of these features stand out when thrown together into a model.

6 Application in Macroeconomics

Macroeconomic measurement is an important area of application for big data analytics. Central Banks collect information on the state of the macroeconomy. The statistics they document such as gross domestic product, inflation, and employment—as well as their forecasts of these quantities—provide policymakers, business owners, and investors baseline inputs they can use to make important decisions. For example, a retailer worried about consumer demand may refer to consumer sentiment indices or economic growth statistics before deciding to expand, if they believe their business to be sufficiently exposed to the macroeconomy. Investors may buy securities that appreciate during volatile or bad times, such as gold or VIX-indexed securities, in anticipation of poor economic fundamentals. Or top-down “global macro” investors may use these statistics as inputs into economic models, taking active positions on currency baskets or equities they expect to outperform where they disagree with the consensus view.

In this section, we focus on measurement and not forecasting, for three main reasons. First, in the context of many emerging markets, macroeconomic measurement is often unreliable, with some governments even lying about statistics for political purposes. Second, for the purpose of decision-making, it is not obvious we want to forecast the official numbers rather than have better, more complete or granular numbers. For example, suppose a Central Banker aims to raise interest rates due to excessive lending in the banking sector. However, a weak labor market may make it harder to tighten. But, if Central Bankers had alternative data indicating a weak labor market owed to a specific sector, perhaps, they can offset this sector-specific weakness with another policy while raising rates. Third, oftentimes, alternative data series are too short to forecast macroeconomic time series. GDP in most countries is either released semi-annually or quarterly, and the relationship between GDP and a specific predictive variable may vary during market booms versus troughs. Thus, a long history may be essential.

However, forecasting is still important. When governments announce macroeconomic statistics that are surprisingly poor, market returns may be negative or it may affect the cost of capital for firms as investors fear worsening economic conditions. In practice, there are a number of economic quantities one could predict. But which matters the most? Joo et al. [49] study emerging market currency returns around macroeconomic announcements in the USA. For each economic quantity they study—non-farm payroll, GDP, unemployment rates, trade, and others—they collect the consensus forecast, calculating “surprises” as the deviation between the realized number and consensus. They find that non-farm payrolls—monthly announcements of non-farm employment—and quarterly GDP surprises move currencies and affect emerging market foreign exchange returns the most. However, this study covers the time period 2000–2006, which may be considered stale today. They also do not consider all macroeconomic indicators of import, such as oil prices.

Reflective of the changing nature of financial markets, there is some evidence that the efforts of alternative data practitioners are already having a marked impact on the market. Henderson et al. [40] documented that satellite data are useful to predict economic growth. Then, Mukherjee et al. [58] show that satellite data have replaced government announcements. Satellites can be used to take photographs of the earth’s surface, which are oftentimes used by hedge funds to measure oil reserves. Using this data, the researchers show that higher cloud cover quarters are associated with higher surprises and low cloud cover quarters are associated with small surprises. This mirrors anecdotes throughout the industry that traditionally impactful macroeconomic indicators in the USA are losing their market impact as macroeconomic data are factored into the price before the government announces it. Thus, while it is likely that statistics such as GDP or non-farm payrolls continue to bear special relevance, it is important to note that the availability alternative data for one indicator may reduce the relevance of forecasting for trading purposes.

6.1 Statistical Framework

While we emphasize measurement above forecasting, there will gradually be many time series that have histories long enough for macroeconomic forecasting. Suppose one is forecasting GDP. Over ten years, there will be 40 quarterly GDP releases. If one has ten or twenty different datasets, or even more, a good question is how these data should be combined. Traditionally, we think there should be far more observations put into a model than parameters we would like to estimate. For time series, the class of methods relevant are called dimension reduction methods.

Principal Components Analysis/Principal Component Regression. An alternative to the family of shrinkage estimators above is the principal component analysis (PCA) which is a variance decomposition approach, typically used as part of the exploratory data analysis. Applied to a matrix of data, PCA is designed to capture the maximum amount of variation within the data for each number of principal components considered, with the idea to generate fewer principal components for use than the number of variables in the original dataset. Theoretically, the maximum number of principal components that may be extracted is the rank of X. That is, if there are ten columns but the last one is a linear combination of the others, PCA can only generate nine principal components. However, in practice, even with highly correlated data, there will be slight noise in each variable’s measurement—typically known as the multicollinearity problem in a standard OLS setup—and so numerically sometimes an explanatory variable with many correlated variables may still appear full rank.

The first principal component loading vector \( w_{1} \) has the follow objective function:

$$ w_{1} = \mathop {\text{argmax}}\limits_{{{\text{s}} . {\text{t}} . \left| {\left| w \right|} \right| = 1}} \frac{{w_{1}^{{\prime }} \left( {X^{{\prime }} X} \right)w_{1} }}{{w_{1}^{{\prime }} w_{1} }} $$

where \( X^{{\prime }} X \) is proportional to the variance-covariance matrix of a set of variables (typically the explanatory variables in a prediction setting, or all the variables in a more exploratory setting). Then, defining the residual matrix as

$$ \hat{X}_{k} = X - \mathop \sum \limits_{i = 1}^{k - 1} Xw_{i} w_{i}^{{\prime }} $$

where \( Xw_{i}^{{\prime }} \) is the principal component, which is simply the tilted matrix resulting from multiplying the original data matrix with the loading vector. Then, the other principal component loadings come from solving

$$ w_{k} = \mathop {{\text{argmax}}}\limits_{{{\text{s}} . {\text{t}} . \left| {\left| w \right|} \right| = 1}} \frac{{w_{k}^{{\prime }} \left( {\hat{X}_{k}^{{\prime }} \hat{X}_{k} } \right)w_{k} }}{{w_{k}^{{\prime }} w_{k} }}. $$

Solving this optimization problem amounts to solving for the eigenvectors and corresponding eigenvalues of the variance-covariance matrix, where the ordering of the vectors is such that those with the largest (all eigenvalues for the variance-covariance matrix are positive) eigenvalue is the first vector and the last vector corresponds to the smallest eigenvalue.

Grounded in linear algebra, PCA equates to a tilt of vector spaces to re-center them with a change of bases such that the bases are orthogonal (uncorrelated) to each other. Each principal component is a linear combination of all the variables. This allows data scientists to compress a matrix with p-columns down to a handful of principal components based on the amount of variation that the data scientist wants captured in the analysis. In addition, when used in OLS in the same sample that was used to conduct the PCA, the orthogonality property of different principal components also means that when adding or removing principal components, estimated coefficients do not change, leading to potentially more stable results. However, when applied out-of-sample, the in-sample weights no longer guarantee that the tilted out-of-sample data matrix will result in orthogonal transformed variables, although the correlation should be low if the variance-covariance matrix in the out-of-sample data is numerically close to that of the in-sample data.

The transformed principal components are then used in a standard OLS or more sophisticated statistical models to predict some target outcome variable.

Conceptually, this variance decomposition approach takes the stance that to some extent, all variables in the original dataset may be useful constructs in terms of correlation across variables. This approach is fundamentally different from shrinkage estimators. Whereas shrinkage estimators take the stance that only a handful of variables are useful in predictions, PCA takes the stance that potentially all of the data are useful and tries to extract common components from the data.

However, a key limitation of the use of PCA in predictions is that nowhere in the objective function does the methodology incorporate an objective function with respect to forecasting. The implicit assumption is that the principal components are useful for predicting a specific outcome. Therefore, while PCA can be useful in decomposing common components to better understand the underlying correlational structure in the data, its usefulness for prediction may be limited, as those common components that are extracted may not necessarily be useful for predicting some particular output. As a result, statisticians have developed a conceptually similar approach but focusing on a variable’s predictive power for some outcome variable, called partial least squares, which we briefly discuss below.

Partial Least Squares. Similar to PCA in that it permits all variables to contribute in predicting a specific outcome and similar in that it reduces the dimensionality of the data due to some fundamental similar of variables, partial least squares (PLS) adopts a slightly different objective function from PCA. Rather than decompose the variance-covariance matrix of a set of explanatory variables, the objective function of PLS is to find a linear decomposition of both the explanatory variable matrix X and the outcome variable matrix Y such that

$$ X = T_{X} W_{x}^{{\prime }} + E_{X} $$
$$ Y = T_{Y} W_{Y}^{{\prime }} + E_{Y} $$

where \( T_{X} \) and \( T_{Y} \) are the scores for X and Y, respectively, \( W_{X} \) and \( W_{X} \) are the loading matrix containing weights for X and Y, respectively, where each individual component is a vector \( w_{x,i} \) and \( w_{x,j} \) for some \( i,j \) going from 1 to N the number of observations, and \( E_{X} \) and \( E_{Y} \) are the set of residuals for X and Y, respectively. The PLS method is set up so as to maximize the covariance between the matrices \( T_{X} \) and \( T_{Y} \), and this set up applies regardless if Y is a matrix of outcome variables or a single vector.

Also, grounded in linear algebra through the lens of an eigenvalue decomposition algorithm, the PLS is equivalent to computing the eigenvector and eigenvalue decomposition of the matrix \( \left( {Y^{{\prime }} X} \right)^{{\prime }} \left( {Y^{{\prime }} X} \right) \) for loadings on X and the matrix \( \left( {X^{{\prime }} Y} \right)^{{\prime }} \left( {X^{{\prime }} Y} \right) \), where \( \left( {X^{{\prime }} Y} \right) \) is proportional to the covariance of X and Y.

As with PCA, the actual estimation of the loadings is logically iterative. The first loading comes from the following objective function:

$$ w_{x,1} = \mathop {\text{argmax}}\limits_{{{\text{s}} . {\text{t}} . \left| {\left| w \right|} \right| = 1}} \frac{{w^{{\prime }} \left( {Y^{{\prime }} X} \right)^{{\prime }} \left( {Y^{{\prime }} X} \right)w}}{{w^{{\prime }} w}} $$
$$ w_{y,1} = \mathop {\text{argmax}}\limits_{{{\text{s}} . {\text{t}} . \left| {\left| w \right|} \right| = 1}} \frac{{w^{{\prime }} \left( {X^{{\prime }} Y} \right)^{{\prime }} \left( {X^{{\prime }} Y} \right)w}}{{w^{{\prime }} w}} $$

Then, similar for PCA, define the residual matrix for both X and Y as

$$ \hat{X}_{k} = X - \mathop \sum \limits_{i = 1}^{k - 1} Xw_{x,i} w_{x,i}^{{\prime }} $$
$$ \hat{Y}_{k} = Y - \mathop \sum \limits_{i = 1}^{k - 1} Yw_{y,i} w_{y,i}^{{\prime }} $$

where \( Xw_{i}^{{\prime }} \) is the principal component, which is simply the tilted matrix resulting from multiplying the original data matrix with the loading vector. Then, the other principal component loadings come from solving

$$ w_{x,1} = \mathop {\text{argmax}}\limits_{{{\text{s}} . {\text{t}} . \left| {\left| w \right|} \right| = 1}} \frac{{w^{{\prime }} \left( {\hat{Y}_{k}^{{\prime }} \hat{X}_{k} } \right)^{{\prime }} \left( {\hat{Y}_{k}^{{\prime }} \hat{X}_{k} } \right)w}}{{w^{{\prime }} w}} $$
$$ w_{y,1} = \mathop {\text{argmax}}\limits_{{{\text{s}} . {\text{t}} . \left| {\left| w \right|} \right| = 1}} \frac{{w^{{\prime }} \left( {\hat{X}_{k}^{{\prime }} \hat{Y}_{k} } \right)^{{\prime }} \left( {\hat{X}_{k}^{{\prime }} \hat{Y}_{k} } \right)w}}{{w^{{\prime }} w}}. $$

Although the mathematics involve transformations of both X and Y, the conceptual framework is similar to that of PCA. Rather than trying to decompose the variance-covariance matrix, PLS instead tries to decompose something more like a predictability matrix, which is the variance-covariance matrix of the explanatory variable with the target variables. The algorithm places the most weight in the predictors that are most strongly correlated in Y (i.e., those with highest predictive power), effectively setting the weight on a variable based on the coefficient from regressing a particular variable on the outcome.

In practice, the PLS algorithm tends to be better at predictions since the objective function takes the desired predictability into account, compared to a regression based on principal components. Similar to LASSO method, the PLS and PCA approaches also fall under the class of generalized linear models and can both be conducted with a linking function like the logistic or probit equation.

6.2 National Accounts

A country’s gross domestic product (GDP) is designed to capture both a measure of output of goods and services made within an economy. It is often interpreted as a measure of income. GDP per capita serves as a proxy for the amount of resources that the average individual within an economy has access to.

For decades, economists have relied on a combination of both the GDP per capita measures from national accounts as well as supplemental surveys of households and businesses to better understand the well-being of the residents in their economy. However, administering such surveys are costly and complex in and of themselves, involving many individuals to visit or phone randomly selected citizens. Moreover, because surveying takes time, often times GDP statistics come with a reporting lag. When economic conditions turn sharply, GDP statistics are a long-lagged indicator, rendering them useless for short-term applications.

In addition, biases in such sampling methodologies or manipulations of official statistics by government agencies may lead to biased inferences. For example, in 2013, Ghana decided to rebase the calculation of their official GDP numbers from its original 2006 base “to better reflect a recent activity in petroleum, communication technology, and construction sectors,” according to the statistics office.Footnote 4 Consequently, the seemingly arbitrary and unexpected rebasing increased the size of the GDP measure for Ghana by 40 percent. This followed a rebasing exercise in 2010 which increased GDP by 60 percent. The rebasing automatically made the fiscal health of the government in terms of the budget deficit and debt levels appear lower, as typically debt and deficits are scaled by GDP.

Finally, GDP is a national statistic, and does not allow for more granular statistics at the regional or sector level. Such statistics could potentially be quite desirable.

GDP Growth and Economic Welfare. Academic research has explored various different approaches to measuring GDP growth. The most successful has been the application of satellite data, particularly night-time lights data. Since 1972, the Defense Meteorological Satellite Program has conducted low earth-orbiting measurements of the earth’s surface. One of these pro-grams—the Operational Linescan System—involves measuring the radiance of the earth’s surface. While the primary purpose of these scans was originally to measure cloud cover, in 1994 the Earth Observation Group began collaborating with the DMSP to produce night-time light composites. A number of researchers explored the idea that night-time lights provide a unique indicator of GDP: the surface of the earth, from outer space, is usually only observable where there is a strong source of light. While wildfires or fishing boats can be seen from outer space, the most prominent source of lighting are human settlements.One of the many groups exploring the application of night-time lights for economic measurement were Clark et al. [26]. The figure below excerpted from their study shows the growth of South Asia based on this nightlights data from 1994 to 2010. Pictured in the center is India. One can see the incredible growth of South Asia through the increased luminosity of the earth’s surface.

figure m

Source Pinkovskiy and Sala-i-Martin [59]

However, the main focus of this study is to benchmark estimates of economic growth used by the government. Typically, economists have either relied upon national accounts data as well as government surveys. Both methodologies are flawed—the key is to quantify these flaws using the objective measure of night-lights as a benchmark. They find that survey data are unreliable relative to national accounts and that poverty rates have been falling faster than surveys would suggest.

In addition, their results are driven particularly by poorer economies which grow rapidly and for which lags and issues in survey design may introduce a systematic bias, which leads global policymakers to view these poor countries as poorer than they actually are. In addition, Fig. 3 from their analysis, reproduced below, shows that the logarithmic level of amount of light per capita (measured in terms of luminosity represented as a number from 0 to 63) correlated positively with log GDP per capita, suggesting that national accounts such as GDP per capita do seem to be appropriate.

Beyond the evaluation of economic development, alternative data have also been used when government official statistics may be unreliable and manipulated, as alternative data may also provide a basis through which to construct a less biased economic measure of the economy.

Chen et al. [23] find that local governments in China biased their GDP growth numbers, resulting in overstated GDP growth from 2010 to 2016 of 1.8 percentage points and lower investment and saving rates in 2016 of 7 points compared to reported numbers. The locally reported GDP data are adjusted by the National Bureau of Statistics, a federal agency, which make an average adjustment of 5% since the mid-2000s. As collecting the data directly is infeasible, the central government relies on reported statistics from local governments. It is speculated that in China, local governments are incentivized to overstate their GDP growth statistics as fast-growing regions are rewarded with more power and monetary incentives by the central government. The discrepancy between local and national estimates of industrial output are explained by the differences in local and national estimates of industrial output. Thus, in order to better understand the veracity of national versus local statistics, Chen et al. [23] use night-time lights data as a benchmark.

figure n

Source Clark et al. [26]

The alternative data used to validate the reported and estimated numbers come from a combination of additional variables like firm-level data on value-added taxes and measures of local economic indicators less likely to be manipulated by the local government.

Consumption. Consumption is the biggest driver of GDP growth for many countries around the world. Yet, measurement of consumption data is noisy and subject to reporting lags because statistics offices must consolidate the data. In addition, many theories of financial market returns are grounded in the relation between consumption growth and future realized returns. Intuitively, if you expect consumption tomorrow to decrease, stock returns tomorrow should be higher, and more people will prefer to dissave and instead consume in the future. Yet, Mehra and Prescott [57] show estimated consumption expenditure growth as captured by government statistical agencies simply exhibits too little variation to explain the large equity premium—the return above risk-free bonds—that one gets for simply investing in the stock market. Said differently, if households’ consumption varies very little over time, why should one be concerned about short-run fluctuations in the stock market? Why, then, don't households purchase more stocks and drive this premium down. For decades, many academic financial economists then tried to develop more sophisticated theories to rationalize this “equity premium puzzle.”

One might draw the conclusion that this particular financial theory is wrong. Alternatively, measurement of consumption data is noisy and subject to reporting lags because statistics offices must collect the data. Savov [61] considers another closely related variable to consumption, but one that is less subject to statistical agency adjustments: garbage. The study finds that garbage growth correlates with future consumption growth. Replacing consumption growth with garbage growth data, Savov [61] finds that the data actually fits economic theories of consumption-based asset pricing better than the consumption data from statistical agencies.

The figure below, reproduced from Fig. 1 in the paper, shows the visual performance of the relation between market returns and garbage growth as well as consumption expenditure growth, all in percentages. The gray shaded areas are recessions defined by the National Bureau of Economic Research. The first point to notice is that consumption expenditure growth appears very smooth and does not show a consistent pattern with returns. The second point is that the garbage growth appears to correlate positively with returns. The higher the garbage, and by associate, actual consumption by people in the economy, the higher the turns during that period, consistent with economic theory.

figure o

Source Savov [61]

6.3 Inflation

Consumer inflation captures the degree of price changes faced by consumers within an economy. Producer inflation tracks the price of outputs from producers, which are often inputs for other producers. Although inflation is an important economic concept and variable that affects monetary policy, financial markets, and people’s consumption and savings decisions through time across countries and assets, its measurement is a notorious area of debate among academics and policymakers since the development of macroeconomics [13].

Traditional, government-provided measures of inflation consist of tracking some basket of products. Typically, inflation statistics are produced in a higher frequency manner than GDP or other government statistics. In the US, the frequency is monthly. They are also more granular. Inflation statistics are produced for baskets of products, such as housing, clothing, food and other components of consumer consumption. Despite their increased frequency and granularity, inflation is problematic in many ways. First is that while monthly data are better than less frequent data, here are obviously times at which inflation would ideally be measured more frequently. Another issue is granularity. For example, the assertion that there is low inflation in apparel may mask the greater inflation for low-income households, for whom apparel has become unaffordable. Therefore, more granular inflation statistics could be valuable.

One final concern is that inflation is directly relevant for governments who need to raise capital. If the inflation rate is 3%, this means that absent an adjustment in foreign exchange rates, $1 lent to a government is worth 97 cents the following year. Therefore, the government must provide at least 3% return for a foreign investor to even consider lending money rather than holding in a domestic currency. Therefore, governments have an incentive to report lower inflation.

The digitization of commerce has facilitated big data measurement of inflation. Perhaps the most famous example of this is a study by two MIT economics professors Alberto Cavallo and Roberto Rigobon. Cavallo and Rigobon [17] founded the Billion Dollars Project in 2008. The study likely motivated by concerns over Argentina's inflation data which emerged in 2007. The Argentine government had by 2007 gone through many rounds of debt restructuring, owing large amounts of external debt with uneven economic growth.

Although focused on studying Argentina, Cavallo and Rigobon web-scrape data from various online retailers around the world. Their data include local retailers as well as large retailers like Cisco, Tesco and Walmart. The basic idea is that while certainly not all commerce was digital, one could presume that if online prices and offline prices were linked in a sufficiently tight manner, one could approximately replicate the inflation index based off of prices of goods online. The key advantage of this measure is that it would be timelier and would not be subject to manipulation by the government. After this exhaustive data collection effort, their study found that this online approach was able to match official inflation estimates from several Latin American countries other than Argentina. However, the data from Argentina in particular suggested that the online inflation rate was three times higher than official estimates.

Goolsbee and Klenow [37] is another study that use digitally collected prices to study inflation. They use online transaction data from an aggregator of e-commerce data. This data has the advantage that one can actually weight prices by the quantities at which people transacted. They find that inflation from 2014 to 2017 using online prices was one full percentage point lower than the corresponding consumer price index (CPI), potentially due to the fierce competition on online platforms compared to traditional retailing. In addition, because of additional features of the data, the researchers could track specific product changes and find that introduction of new products (updates to existing products or totally new items) is important in understanding inflation. For example, a monitor may have the same name and basic specifications like size and weight, but the resolution and quality of the screen may be different. Such differences may not be detectable in traditional measures of inflation due to the lack of these additional features. These researchers find that accounting for product changes shows that online inflation would have been an additional 1.5–2.5 percentage points lower than when the data were matched to price indices like the CPI.

A final data source that is common in studies of inflation comes from Nielsen, a global data analytics and S&P 500 company widely used for market research by consultants. Known as the Consumer Panel data, the dataset comes from a panel of 250,000 households in 25 countries where participants are financially compensated for participating in the research panel. The panel uses point of sale technology to capture sales and price data from major retail chains. The electronic scatter data are supplemented with field audits. Silver and Heravi [62] find that with appropriate weighting adjustments, the scanner data match inflation statistics used by government statistics offices. A key benefit of scanner data is timeliness of the data. Whereas inflation data may be released with several weeks’ lag, scanner data can be updated weekly or even daily with the right subscription.

6.4 The Labor Market

Alternative data may also be informative for the condition of the labor market. As part of employment practices, people must search for jobs and companies post job advertisements to attract applicants and hire the best people. Information covering either side of the market can be informative of the health of a company, and more broadly, the health of the labor market overall.

Non-Farm Payroll. Perhaps the key variable of interest related to the labor market is the non-farm payrolls (NFP). NFP is the measure of workers in the USA excluding farm workers and those in some additional sectors, such as military personnel and employees of government-appointed officials such as the Central Intelligence Agency, Defense Intelligence Agency, and National Security Agency. In addition, sole proprietors, non-profit employees, and private household/domestic employees. NFP captures the broad measure of core employment in the economy contributing to GDP and is produced by the Bureau of Labor Statistics. According to the government bureau, NFP accounts for approximately 80% of GDP.

The NFP is a key economic indicator for the USA, and quantitative trading around NFP announcements is a well-known and popular strategy. A higher NFP is better for the economy, so if NFP is predicted to be higher than expected, traders will go long the stock market and short bonds, and vice versa if the predicted value is to be lower than predicted. Traders have also implemented this belief through the foreign exchange market, since the NFP announcement typically occurs at 8:30 am in the morning, before the open of the stock market while the foreign exchange market is open 24 h a day. Typically, a better NFP correlates with a strengthening of the US dollar. In high-frequency trading at the 5–15 minute-level, traders may also do momentum trading of the currency around the announcement time.

In detailed analyses, a study from Deutsche Bank in 2015 titled “Macro and Micro JobEnomics” shows that using alternative data from LinkUp containing information on firm-level job posting and job creation, the predicted NFP is highly correlated with actual realized NFP. In fact, the alternative data source generates predictions close to the Bloomberg consensus as well. Therefore, this alternative data relevant for the macroeconomy.

figure p

Source Deutsche Bank [30]

In addition, when combining the alternative data with conventional data sources as well as historical data in an autocorrelation integrated moving average (ARIMA) model, the study finds the combined model produces the most precise estimates with only an average of around 52,000 jobs relative to the Bloomberg consensus prediction error of almost 60,000 jobs. The improved predictive power on NFP also has used for other economic variables such as inflation, consumer sentiment, and the unemployment rate. More active jobs from LinkUp positively predicts non-farm payrolls, more inflation, lower consumer sentiment, and lower unemployment rates.

figure q

Source Deutsche Bank [30]

Job posting data are also a useful for stock-level quantitative trading. For example, using data from LinkUp, which contains data on firm-level employment by month, a study by Deutsche Bank finds that a portfolio from 2019 through 2015 that invests in stocks of companies with the highest employment growth relative to market capitalization and shorts companies that have the lowest employment growth relative to market capitalization generates a positive and statistically significant positive return premium. The argument is that the job creation is a signal of future company performance, predicting higher cashflow growth in the future.

figure r

Source Deutsche Bank [30]

Another provider of job postings is Burning Glass technologies, which provide company-level data of job postings. The data are different than Linkup in that it focuses on the skills and certifications mentioned in the text of job postings. The data are different than Linkup in that it focuses on the skills and certifications mentioned in the text of job postings. Job postings only reflect demand for workers. This may reflect job growth or replacement of workers. Current employment conditions or changes in the labor force are also potentially important. Ideally, one would know the number of workers on payroll by various types, as well as the workers’ skills and rolls. In principle, one could observe payrolls from payroll management companies like ADP or SAP. However, it is unclear to what degree these data providers will be able to monetize these insights given privacy restrictions surrounding this data.

It is common now for employees to have social media accounts for their professional accomplishments. For example, software developers may develop Github pages or professionals may post their career histories on Linkedin. Agarwal et al. [3] find that employee turnover observed through online social profiles is negatively related to the company’s performance. The data have implementation issues, however, as often times individual update their professional profiles with delay. Other types of data include Glassdoor, which include data on employee satisfaction and comments.

6.5 Uncertainty

Not only are predicted future values important for financial markets and the macroeconomy, so is uncertainty. Managing risk is arguably the most important facet of finance, both in terms of financial markets and corporate decision-making. In particular, uncertainty about the macroeconomy affects decision-makers at all levels, from policymakers to CEOs to investment managers, creating measures of uncertainty out of alternative data can dramatically improve predictions on various real economic outcomes pertaining to many facets of the economy.

Over the past several years, uncertainty measure data have been made available through the likes of a Google Uncertainty Index developed in Castelnuovo and Tran [15] which is based the Google Search Volume index also using a set of search keywords related to risk and uncertainty. Additional alternative data used in this setting are typically based on textual analyses from various news articles. In this section, we discuss the application of textual analyses for the construction of news-implied financial market volatility as well as the economic policy uncertainty.

Financial Market Uncertainty. For the purposes of understanding trends in the data between various economic variables, we typically take a time series dataset’s start date as a given from back when the data were collected. In financial markets, the volatility index (VIX) captures the implied volatility based on a panel of S&P 500 stock options with a maturity of 30 days. In other words, it captures the expected standard deviation of the stock market index over the next 30 days. The creation of the VIX was motivated by the publication of Black and Scholes’ [12] paper in the Journal of Political Economy, which present to the world a closed-form solution to price European stock options.

Based on this research, the VIX was developed and disseminated in real-time by the Chicago Board Options Exchange (CBOE) on September 22, 2003 on a real-time basis and historical data from 1990 through 2003. On February 24, 2006, the CBOE made options on the VIX itself to trade. Since then, the methodology of the VIX has been updated slightly, and in addition to the VIX, the CBOE also uses the same methodology to create proxies for the two-week, three-month, six-month, and one-year volatility indices. No VIX data were available before 2004.

One might wonder how to study uncertainty over a longer period of time, or in markets or periods of history where the inputs to the VIX are not available. Manela and Moreira [55] is an academic study which aims to fill this gap. They construct a long monthly time series from 1890 to 2009 based on a measure of uncertainty based on the news. The idea is that there may be co-movement between the front-page coverage of the Wall Street Journal and the VIX. They employ using a class of statistical models called support vector machines during the sample for which there is VIX data, and apply it backward in time to create a fitted dataset of a news-implied VIX. The statistical model takes raw data and weights them slightly differently based on the importance and usefulness for predicting the actual VIX values in the training sample for which the researchers have both news and VIX data. The figure below, created by the authors, visualizes the effect from the support vector machines in transforming the variables.

figure s

Source Manela and Moreira [55]

Armed with this model, the authors apply it backwards in time to create a fitted dataset of a news-implied VIX. Over this longer period, they are able to assess whether their news VIX spikes during episodes of history where one might expect this measure to. Note that the objective of this study is not to build a trading strategy. Naturally, one should be cautious interpreting these results when building a trading strategy as the measure they construct for historical analysis has a peculiar look-ahead bias.

The underlying statistical assumptions needed for such data is that the VIX is stationary and ergodic, similar to standard forecasting regressions. In addition, practical assumptions must also be made, such as that the business press word choice set remains fairly stable over time, and that the words themselves provide a stable reflection of investors’ concerns.

Having such a long historical time series allows the study of additional economic, political, and military events on financial markets, such as the Spanish Flu of 1918, World War I, and World War II. The study finds that in terms of text that most relate to stock returns, war-related words explain over 40% of the variation, government-related words explain over 20%, and other categories are statistically insignificant. That more than half the variation in the data are due to rare disaster-type events suggest that financial market performance is inherently linked to these types of rare disasters. A similar result is documented in Liu and Matthies [54] using a similar news-based measure of long-run disaster risk affecting consumption, and consequently, stock market returns.

Specifically drilling more detail into government policy-related uncertainty, we discuss a news-based measure of economic policy uncertainty below.

Economic Policy Uncertainty. How people make consumption-savings decisions and how firms make corporate investment decisions depends on the perceived future economic policy in the economy. Yet, no official government statistics are produced on economic policy uncertainty.

Economic policy uncertainty is inherently different than economic policy itself. For example, firms expecting tax cuts next year can make their investment decisions based on that forecast and may be more likely to invest now since profits generated in the future from current investments will be taxed at a lower rate. However, if the economic policy uncertainty around the level of the tax is high, even if firms expect there to be a tax cut but are still unsure, they may decide to hold off on investment decisions for fear of making a wrong decision.

Recognizng the role policy uncertainty may play and gaps in available data, Baker et al. [5] create an index of economic policy uncertainty. The researchers in this study created a normalized index that captures discussions of specific words related to policymaker such as “legislation,” “congress,” “regulation”, and “White House” in well-respected media outlets in the USA. Similar bag-of-words are used for other countries in their respective languages, among a sample of likewise locally well-respected media outlets.

The economic policy uncertainty index (EPU) is correlated with the VIX, but also differs slightly during times when the financial markets do not perceive much uncertainty but there are high amounts of policy-specific uncertainty. The figure below shows the visual correlations of the VIX measure and the EPU.

figure t

Source Baker et al. [8]

Empirical analyses studying the importance of EPU show that higher periods of EPU correlate with lower corporate investment and hiring, an important realization for policymakers to reduce the uncertainty when making policy discussions.

In addition, the study finds that EPU actually appears to be a driving force of financial market volatility. Firms more exposed to government policies, such as those with a larger share of firm revenue from the Federal Registry of Contracts and data on government health care spending have higher stock price volatility when economic policy uncertainty is high. Similar results go through at the industry level.

In addition, the EPU measure is also available at more granular levels, decomposing variations in the EPU due to trade-related policy uncertainty as well as monetary policy-related policy uncertainty. Using this more granular data for trade policy uncertainty and non-trade policy uncertainty, Charoenwong et al. [18] show that higher trade policy uncertainty predicts supply chain diversification decisions by firms, particularly, for those who are more reliant on foreign suppliers.

After the publication of these results, policymakers around the world, hedge funds, data providers like Bloomberg and Refinitiv have also adopted the EPU measure, suggesting indeed that the data and research fulfilled a useful social role.

Additional data sources that may be considered to supplement the EPU itself may come from sentiment analysis from social media or texts from firms themselves. Hassan et al. [39] construct a firm-level equivalent to the EPU index a firm’s financial statement filings and earnings conference call transcripts. The basic idea of this approach is to quantify the extent of political risk faced by a given firm in a given quarter is simply to measure the share of the conversation between participants and firm management that centers on risks associated with politics. They find that firms exposed to more political risk retrench hiring and investment and increase lobbying and donations to politicians. In addition, the study highlights the importance of the firm-level measure as a source of variation, rather than industry-specific or aggregate political risk.

In the next section of this chapter, we discuss additional opportunities for using alternative data in a financial context.

7 Additional Opportunities in Alternative Data

Due to space limitations, we will provide shorter notes on various sub-domains in financial industry, the types of data that are most likely to be useful.

7.1 ESG

Over the past two decades, there has been a surge of interest in measuring the impact firms have on society beyond the returns they produce for shareholders. Corporate social responsibility (CSR) or environment, social, and governance (ESG) ratings refer to efforts by data providers to quantify the performance of firms along many disparate dimensions.

There are roughly two approaches to constructing ratings: a manual approach in which an agency conducts a review of a firm’s circumstances, filings, disclosures, and conduct, as well as a systematic approach involving automated readings of a firms’ news coverage and regulatory filings. Necessarily, the first type of ratings is slow (usually annual) and incomplete in coverage (focusing on the largest, most liquid stocks). MSCI KLD ratings are perhaps the first of such a measure. MSCI conducts an annual review process on a subset of the stock universe, focusing on the largest and most liquid stocks, identifying stocks with “strengths” in over 40 categories of ESG issues and “concerns” in others. Refinitiv Asset 4 is Thomson Reuter’s offering, also following an annual review by Refinitiv. In contrast, the systematic solution makes trade-offs of granularity versus speed and breadth of coverage. Perhaps, the most well-known are Reprisk or Trucost which offer claim broader coverage through textual analysis. Reprisk in particular claims global coverage, including private firms. The readings are less precise and biased toward firms with more public documentation, but allow one to examine firms that are omitted by other ratings.

One key challenge with this literature is that ESG ratings appear to disagree on which firms are socially responsible [33]. In addition, it is not obvious which measure investors use or if such measures correspond to firms’ actions. Huang et al. [44] is a novel study which uses intent data to dissect the behavior of firms and investors around ESG ratings. Investors experiencing surges in ESG tend to sell stocks with poor ESG records—particularly, using the KLD and Refinitiv measures—while apparently ignoring the Reprisk measure. In contrast, firms’ ESG-relevant reading appears correlated with improvements in Reprisk scores and Refinitiv scores, but not KLD scores. Thus, it appears there may be a gap between what investors use and what firms do.

In the future, a key area of focus will be the “E” component of ESG particularly as it relates to climate change. The Paris Climate Agreement Act signals the urgency among policymakers toward identifying ways in which we can reduce carbon emissions, part of which may require firms to actively disclose their carbon and environmental footprint. For example, France’s Article 179 is thought of to have wide ranging effects in imposing a mandatory disclosure requirement upon French investors. It remains to be seen whether such regulations will have the extensive impact that some predict.

7.2 Data on Private Firms

The wealth of information being collected about publicly tradeable firms is quite remarkable, but the vast, vast majority of firms in the world are not public. In 2019, the NYSE and Nasdaq reported 3712 distinct common stocks (excluding ADRs and OTC stocks). Yet, any casual observer could tell you there are millions of companies in the USA. Private firms, which do not have publicly listed equity, are relevant to a variety of financial stakeholders: banks, venture capitalists, and private equity firms, which would like to invest in these private firms. In addition, alternative data could be useful for industry participants such as competitors, acquirers, suppliers, collaborators, or customers.

What is key about private firms is that they do not have to disclose facts like a publicly listed firm does. Therefore, even basic facts such as the firm’s financials, terms of the firm’s last equity round or other financial obligations and headcount are relatively difficult to get, whereas having these data available is a foregone conclusion for the public firm investor. Therefore, there lies substantial opportunity in having data that provide even an approximate coverage of firm facts, such as financial condition, employment, and spending.

The first question regarding private firms is what firms there are in the economy and how do we keep track of them? These data are broadly referred to as firmographics. The traditional providers are companies such as Dun and Bradstreet or Orbis, which gather comprehensive country-wide datasets categorizing the ownership structure and key decision-makers at firms. These datasets offer a long history with rich and comprehensive features, such as industry codes and historical parent relationships, ownership structure, and key contacts of the company. Firmographic datasets are so broadly of interest, however, that many solutions exist, particularly, if one is simply interested in more recent data. A relative newcomer providing comprehensive global coverage of data is OpenCorporates in the UK. Another example would be Crunchbase, which focuses on firms interesting to venture capital investors, and PrivCo, which focuses on private companies. One must note that although these offerings are comprehensive, it is rare for the level of detail in these databases to match that of databases of publicly listed firms.

When examining a firm, a key starting point is the economic condition of the firm as per financial statements. Coverage of private firms depends heavily on the country context. In many European countries, private firms have to report financial statements into public registers, so data aggregation companies such as Orbis are able to gather this information globally. In Japan, Tokyo Shoyo Research has a comprehensive coverage of supply chain networks and private firm financials.

Although firms in the United States do not have to report their financials publicly, there are a number of data sources one can resort to. Trade credit reporting agencies such as Cortera and Dun and Bradstreet are of immediate use here. These firms collect bankruptcy filings—indicators that a company is not healthy—and meter the payment timeliness of firms to suppliers in their respective cooperatives. The coverage from external counterparties of private firms thus ameliorates concerns that the firm being covered misreports. Pitchbook uses a variety of methods to obtain the terms of financial deals between venture capitalists and investors. It claims to have a large coverage of venture capital deals in the USA, sourced from a variety of methods such as partnerships with investment firms and independent research. However, we caution that as the terms of such venture capital investments are not a mandatory disclosure, there may be lurking selection biases that are unknown.

In terms of tracking company spending, there exist many industry databases that cover substantial portions of a company’s spending activity. D&B and Cortera offer trade credit datasets, which can be interpreted as corporate spending activity. However, for public firms, we observe from our own research that what is covered in these trade credit reporting programs is only usually a fraction of firms’ accounts payables. In other words, while informative of a firm’s financial condition and spending growth, the actual fraction of spending coverage is likely quite low. Aberdeen’s Computer Intelligence Database has a broad-based coverage of software and hardware installation in the USA and in select international markets, including European countries, Canada and various countries in South America and Asia. Finally, private firms may also be well-tracked by digital exhaust datasets. Kwan [52] shows that Internet research activity by firms could be used to infer growth, although smaller firms have fewer employees, and are thus captured more sparsely than larger firms by these methodologies.

Finally, revenues and traction might be of interest to an investor. For those companies who have physical, brick-and-mortar locations in which they do business with customers, perhaps, the company’s success or traction can be implied by its foot traffic from mobile phones. For firms that have a purely digital presence, App Annie and others have good estimates of app downloads and app usage. For specific industries, usually each industry has one or tow database offerings that cover the competitive landscape and market shares of each participant in some shape or form. For example, the NPD Group offers data on video game sales, among other consume trends, through their various retail partners, covering over 60% of sale sin the market.

7.3 Operations of Financial Institutions

Beyond the application of data to the evaluation of investment risk or economic conditions, banks are businesses. Broadly speaking, beyond industry intelligence, corporations themselves have use for alternative data for operational purposes. Alternative data that are uniquely useful to a financial institution is beneficial ownership data—which trace out the ownership of companies.

Many countries maintain lists of foreign or domestic nationals who are barred from doing business. In the USA, one such list is maintained and promulgated by The Office of Foreign Asset Control. Being on this list means being barred certain transactions with US financial institutions.

Foreign beneficial ownership thresholds and rules vary by country. Generally, these restrictions tightened in the 2000s due to the Patriot Act. One concern behind the Patriot Act was that financial institutions unknowingly facilitated international terrorism by allowing terrorists to move money freely across borders. Anti-money laundering laws typically require that companies audit the ownership structure of all entities registering to do business with a bank. That is, not only is the sanctioned entity barred from doing certain transactions, the key is to ensure that no “beneficial owner” of the company is on a list of restricted individuals. Thus, if person A is on the restricted persons list, not only is person A on this list, but also any company to which person A has a large financial stake. In Hong Kong, the Financial Action Task Force requires banks to ensure compliance with respect to any beneficial owner of at least 25%. This threshold is stricter than many European jurisdictions where beneficial ownership thresholds are set at 50% or more.

The challenge of identifying corporate relationships can be daunting as corporate ownership relationships can be complex. A firm may be owned by one individual, or it may be owned by other entities, which are in turn owned by other entities, and in turn owned by an individual. For example, WhatsApp Inc is owned by Facebook, which is owned by Mark Zuckerberg. This is a two-tier organization, but some organizational structures can go much deeper. While traversing one or two links may not be difficult in the USA, if these links were between foreign entities in different countries, this exercise may be much harder. In addition, in some cases, these links can be intentionally obfuscated.

The first line of defense comes from databases on beneficial ownership. Databases such as Dun and Bradstreet, Orbis, Refinitiv, or OpenCorporates are particularly helpful here. These offerings have solutions for mapping out ownership structures—identifying not only who owns the firm, but how much and identifying key officers. The databases can be useful for automating or facilitating the process that would otherwise be infeasible.

One thing to note is that the coverage of these firms varies as government registers for firms vary in quality globally. Even if these to the extent that these databases are incomplete, these databases provide a valuable first step in helping to focus organizational resources away from less problematic cases and toward potential compliance risks.

7.4 COVID-19

As of the time of writing, the COVID-19 pandemic is far from over. And, as authors, we hope for the obsolescence of the foregoing, but even as a historical artifact, we believe the COVID-19 pandemic is a setting in which the power and importance of data are on full display. Fundamentally, there are two big questions that business data can help answer.

First, what can business data tell us about the pandemic itself, such as the rate of social distancing? Before inoculation, the best line of defense against a contagious disease is social distancing. Social distancing is a population-scale effort and identifying problem areas requires fine geographic or individual granularity. On the topic of social distancing, the clear and obvious winner has been mobility data. Companies such as Safegraph have made their data publicly available. Safegraph’s COVID-19 data consortium has over 5000 members in its ranks who have produced dozens of papers using this data. Charoenwong et al. [21], for example, explore the relationship between social distancing and information received through social media about the pandemic, finding that information about the severity of the pandemic increases social distancing.

The second question that data can help answer is what the current state of the economy during the pandemic is. This is a key input to decision-making. Government officials balance economic considerations from imposing mobility restrictions with public health considerations. Moreover, if investors can anticipate the spread of the pandemic or the decisions of government officials, they can potentially make investments that are best positioned to ride out any potential movements in the markets.

Naturally, there has been widespread interest from practitioners and academics in tracing out the academic recovery. Data providers from many industries have had unique perspectives on the issue. Fundamentally, the first impact of COVID-19 was a reduction in mobility. A loss of mobility may result in a lack of demand for retail, restaurants, and shopping, causing a “demand shock.” This demand shock potentially reverberates upstream through the supply chain, impacting those who sell their products through retail stores. Any closure in these stores also affects consumer incomes and purchasing power. On the other hand, mobility restrictions may lead to the closure of production facilities such as factories. A key assumption rests on the extent to which employers are capable of mobilizing their workforce from home.

For tracing the loss in consumer demand, mobility data, which we mentioned earlier, have been instrumental in estimating the magnitude of the fallout to key sectors, and are particularly important in restaurants and retail. Data providers in this space include UberMedia, Veraset, Safegraph, and Unacast in the USA alone. Mobility is useful for tracking visits to stores. However, one limitation of mobility data is that it is costly to draw the “polygon” that describes the physical boundaries of an establishment. Thus, it is not easy to track any arbitrary location, although most mobility data providers have inventories of retail establishments and homes which allow them to measure social distancing or visitations to retail locations.

Traditional data sources for measuring demand also remained useful. Chen et al. [24] use data from UnionPay to measure the offline impact of COVID-19 on consumer behavior in China, with sharp drops in consumption in-stores of roughly 90%. Baker et al. [7] study data from Mint, finding a large drop in US consumers’ spending particularly in restaurants and in retail.

Regarding the labor market, by April 2020, there resulted one of the sharpest rises in unemployment in history. The job market was impacted severely, with 20.5 million jobs lost initially, and many speculating that those jobs may never return. Campello et al. [14] use Linkup data to study this issue, finding a dramatic drop in corporate hiring particularly in local labor markets lacking depth and in low-skilled positions. Of course, job postings are flows, not stocks. Stocks may have remained unchanged, and so a reduction in hiring could mean that workers transition between jobs less, not necessarily that they are laid off. While this data are likely not commercially available, academic studies employ ADP data to study the impact of the pandemic on specific sectors as well as types of workers [25].

Survey data have been instrumental in further understanding the expectations of workers throughout this time period, as traditional sources of data cannot capture expectations of the pandemic or the impact of the pandemic on worker health or productivity. Bartik et al. [9], for example, issue a nationally representative survey which estimates the fraction of jobs being done from home. Many papers also use the Current Population Survey, which is collected on an ongoing basis in the USA, to track unemployment and labor conditions.

Most of the above-mentioned data can be applied analogously to the firm, although in general consumer data are easier to come by than firm data. Tracking spending in the B2B sector can be achieved through aggregators such as Dun and Bradstreet and Cortera. However, spending coverage is relatively incomplete relative to a consumer. Whereas databases like Yodlee capture the likely primary credit card of most US consumers and online account aggregators such as Mint capture the whole financial picture of its sites users, firms tend to be more secretive about their data. Surveys of firm managers are also potentially useful for capturing overall firm health or investment plans [20], although such plans themselves may be strategically disclosed or reported to the public [19]. There, we may instead consider examples of data that are not easily manipulated, like mobility data (for now). In theory, mobile data can be used in conjunction with office work and the extent of work-from-home activity. However, mobile phone data have limitations, such as the inability to inaccurately measure altitude, which renders it less useful for particular applications.

Beyond aggregate trends, sector-specific views are provided by a variety of industry-specific datasets. Eagle Alpha’s COVID-19 whitepaper provides many examples. One hard-hit sector was the airline industries. There are data providers who track airline tickets as well as airline flights, which are generally of public record. Changing consumer preference can be mapped by Google Trends, which show time series changes in specific types of products and services. Digital industries, which may have actually benefitted from the pandemic, can be measured in various ways, such as Web traffic, app usage and downloads, and so on.

Overall, COVID-19 provides a catalyst to public awareness and adoption of novel data sources, which we expect to have long-lasting effects.

8 Cautions About Alternative Data

8.1 Pitfalls of Alternative Data

As described throughout this chapter, alternative data holds promise for numerous players in the financial industry and for policymakers. However, alternative data are not without their pitfalls as well. Some datasets are poorly designed or have inextricable flaws despite the best intentions behind how the data were collected. We identify a number of problems to consider when evaluating the use of alternative data.

First, alternative data can have short histories. Certain applications of alternative data are impossible with a short history. For example, predicting quarterly GDP of a country would be difficult with 2 years of data. With eight observations, it would be difficult to produce a statistically significant relationship. However, with credit scoring, 2 years of data may be easily enough as each quarter, tens of thousands of households or small-medium enterprises are sampled. Without rigorous evidence, 2–3 years of alternative data may be necessary to demonstrate the viability of data for quantitative trading, but this depends on the trading frequency.

Second, the data may have non-trivial sampling issues which may bias analyses. For example, suppose one is attempting to measure the strength of household consumption to build a better macroeconomic model. Credit card data may be an interesting solution as it captures household consumption at a high frequency and for a granular set of purchases. This approach may have promise, but credit card data panels may be subject to both a selection bias, whereby only people of sufficient means to have credit cards enter the dataset, as well as survivorship bias, whereby those who canceled credit cards or were delinquent on their payments may result in unobserved suspended accounts. Both of these types of biases would distort the analysis for predicting credit risk. These problems are germane to most credit card data in general.

Third, retrieving data requires a substantial financial cost. Corporate subscriptions to datasets can run tens of thousands of dollars. Famously, Yodlee has sold its credit card data to hedge funds for several hundreds of thousands of dollars. Whether a cost is acceptable depends on the value a specific investor can extract from it. Suppose a dataset on email receipts costs $20,000 USD, which a company wants to use to predict the revenues of Amazon. A company with $1 million under management would pay 2% of their AUM, thus, rendering such datasets a difficult proposition. A company with $100 million in AUM may find the cost negligible. But, such a firm may have a substitutive dataset. For example, in the example of credit card data, perhaps the fund already has access to data on consumer searches for products, visits to websites, or email receipts. After running a back-test, the company may or may not gain much incremental profit from the new dataset. In this case, the financial cost may not be worthwhile, even for the larger investor.

Fourth, there may be significant manpower involved. One key challenge is evaluating a dataset. Conducting entity resolution can take time, and many data providers by definition are not financial institutions themselves. Thus, it is difficult for them to know exactly how to apply that data to the financial domain. Partnerships with academic researchers or third-party agencies such as ExtractAlpha—which specialize in research and evaluation of data—can be useful, but these resources are rare. For much of the same reasons, once an organization commits to acquiring a dataset whether via purchase or building it, operationalizing the dataset for organization consumers may be costly.

Finally, there may be legal issues, which we discuss in more detail next.

8.2 Legal Consideration

Before even considering acquiring data by Web scraping or purchasing it, you must check the legality of the data. In some cases, there may be data providers who may be offering data to which they do not have the rights to. Engaging with such companies is part of data risk management.

Unfortunately, as the market moves quickly both in terms of data demand and the proliferation of alternative data sources, the legal framework has not kept up in lock step. The legal framework for data sharing and privacy depends on the jurisdiction of the collection and place of use of the data.

Legal and ethical concerns pertain to two parts of alternative data: (1) Web scraping and (2) the protection of individuals’ privacy in data sharing agreements. The latter pertains to whether automated data collection programs that explore publicly available information or information hidden behind logins, violate the terms of use of various sites. This has been the topic of at least 19 high-profile court cases in the USA, whose details are available in the public domain.

For example, the Infocomm Media Development Authority in Singapore launched a Trusted Data Sharing Framework as part of its regulatory sandbox which includes templates for data sharing agreements, including specific clauses for the protection, privacy, and security of the data used.

Internationally, the Investment Data Standards Organization, a USA-based organization that publishes standards for alternative data, was established on October 23, 2017. It is a non-profit organization made up of companies in the alternative data industry, including providers, intermediaries, and users of alternative data. The guidelines pertaining to alternative data pertain to the processes and risk management strategies for identifying, maintaining, and securing PPI in datasets used for investment management. In addition, the organization has also published guidelines on Web crawling.

In addition, for the alternative investment industry specifically, the Standards Board for Alternative Investment (SBAI), formerly the Hedge Fund Standards Board, is a global standard-setting agency for the alternative investment industry and is also an affiliate member of the International Organization of Securities Commissions. It has published the Standardized Trial Data License Agreement which addresses issues relating to new data trailing processes, including personal data protection, clauses pertaining to the prevention of insider training, and “right-to-use.”

Users of alternative data should carefully study the regulatory regime under which they are regulated prior to the use of the data, even if the source of the data comes from an alternative jurisdiction.

9 Conclusion

In this chapter, we covered many different types of alternative data through the lens of a common statistical framework. We discussed three primary domains: quant trading, credit scoring, and macroeconomic forecasting. Due to limited space, we devoted less discussion to issues that deserve no less attention: alternative data for ESG, the private firm, for compliance in financial institutions and for corporate strategy. We largely overlooked in this iteration the housing market and insurance markets.

To conclude, we conjecture a few of the spots where opportunity remains large for prospective alternative data consumes and suppliers. As predictions are difficult in general, we note we make these predictions as of the third quarter of 2020.

Although there has been a major penetration of alternative data in quantitative trading, we believe significant opportunities remain. The USA is a relatively sophisticated market, so some—although not all—of the strategies we mentioned here may be captured by market participants. We believe, however, that significant opportunities remain in mapping out B2B spending, either by using the digital approach or by using data from supplier cooperatives or other methods. These data are not easily attainable, and thus it is possible that opportunities will persist for a while. In addition, outside of the USA, penetration of alternative data in capital markets is likely lower, given evidence that opportunities to earn returns using traditional strategies are higher in emerging markets than in developed markets. Thus, we suspect one might find success in adapting the approach taken in even the early generations of alternative data to other contexts such as those companies listed in Europe, Hong Kong, Japan, or mainland China. Of course, these will be significant undertakings, but this entry-barrier might be attractive.

Much in the same way as with equity markets, we believe that alternative data in less developed markets could be a ripe opportunity. An entrepreneur or data provider wishing to embark on the application or provision of alternative data may seek to apply a well-tested business model to a new market, provided American or European data providers do not have reach in that market already. For example, Dun and Bradstreet and Orbis have global reach within their data verticals. However, company filings differ by exchange and region, and taking a localized approach to analyzing disclosure rules is likely to be an ample opportunity for even major exchanges outside the USA.

In addition, much of the same value-relevant information is also useful to private equity or venture capital investors. Thus, while many data providers focus on the publicly listed firm, chasing the allure of earning fees from high-flying hedge fund clients give rise to many substantial opportunities in the private space. This is potentially increasingly true as pools of capital toward unlisted stocks deepen, and many companies choose to forgo public markets in the USA [32].

Also, at risk of sounding obsolete by the time of publication, we also suspect the COVID-19 pandemic will stimulate many responses in the private sector in the use and adoption of data practices and standards. First, whether a company conducts itself in a socially responsible manner will assume greater importance going forward, with companies with high ESG ratings outperforming in the recent pandemic [31]. Second, many of the data consumed by investment firms or banks today will soon be demanded by corporations or governments directly. For example, mobile data are useful for predicting company revenues, but also for monitoring social distancing or the absence thereof. In addition, real-time nowcasting of the economic impact of a future mobility restriction or pandemic could be useful in policy decision-making. Although unfortunate in its genesis, we believe one silver lining of a world in which face-to-face interaction has become less safe is the acceleration of the digitization of firms, which should serve as a catalyst for the development of data collection and financial technology in general.

We look forward to the next wave of developments in the alternative data landscape a year or two from now. We believe that we will be pleasantly surprised by future developments in many ways.