1 Introduction

In this paper we quantify the ability of Gross Domestic Product (GDP henceforth) and other aggregate income and expenditure variables to summarise well-being, defined by a broad set of economic, social, environmental sustainability and demographic indicators.Footnote 1 One of the main characteristics that allowed GDP to gain a central role among all the possible economic measures of well-being, is its quantitative, monetary and synthetic nature. Although other variables share these three desirable properties – e.g., Gross National Income (GNI), household income, and household spending – no other variable plays as predominant a role in policy decisions as the GDP. Together with the system of national accounting (SNA), GDP can be calculated across different countries on the same scale, thus allowing comparisons among them (Hoekstra 2019). But the monetary nature of GDP is also its main limitation. It is indeed difficult to assign a monetary value to all those human activities affecting living standards and well-being but lacking a regular market. And even where markets do exist, they might be highly imperfect, with prices that do not incorporate all the possible effects that the production and consumption of such goods and services generate – this is especially valid for activities generating significant externalities. For all these reasons a number of scholars and policy making institutions have challenged the idea of relying uniquely upon GDP as a tool to quantify well-being and to evaluate policies (Michalos 1982; Stiglitz et al. 2009). It is worth noticing, that also the system of national accounting explicitly indicates that “[...] GDP is often taken as a measure of human development index, but the SNA makes no claim that this is so and indeed there are several conventions in the SNA that argue against the welfare interpretation of the accounts” (United Nations 2009). For all these reasons, in the past decades a literature has developed around the creation of new well-being indicators. The reference to well-being (also when evaluating policies based solely upon GDP) allows policymakers to account for a “fully rounded humanity” rather than only for the mere production of goods and services. On the one side, the shift in political and policy focus towards the positive aspects of a “rounded humanity” can be considered as a positive development of the economic discipline because it encourages the contemplation of of being well – rather than only doing well. On the other side, however, also some perils arise from an extreme focus on well-being. Because many broad measures are seldomly based on individual questionnaires asking the interviewed some form of psychological introspection (e.g., question on life satisfaction, happiness, ...), they might distort the objectives of the policy agenda from scopes that satisfy the society at large. In particular they might focus on needs that tend to satisfy the private sphere of the individuals rather than the public one Taylor (2011).

As a consequence, dashboards of indicators have been created to better evaluate well-being in a society, by taking into account dimensions such as environmental sustainability, social equity, cultural development, economic vulnerability as well as the demographic one (see OECD 2011; Pinar et al. 2014; Roser 2014; Fitoussi and Durand 2018a, b; Ferran et al. 2018; Bacchini et al. 2020; Kalimeris et al. 2020, among the others). All these works focus on the multifaceted and complex nature of well being and, at the normative level, do not imply the complete replacement of GDP with the new metrics. They rather suggest that GDP shall be accompanied by alternative statistics.

We contribute to this literature by quantitatively evaluating the ability of GDP to capture the information embedded in a large set of social, economic and ecological indicators, which are constitutive of well-being (they are included in the construction of the most important well-being indicators. See OECD 2011; UNDP 2022). We make use of a widely known dimensionality reduction technique, the generalized Principal Component Analysis (PCA). Moreover, we combine PCA with the Random Matrix Theory (RMT) approach, which allow us to single out only the components that are statistically significant (see e.g., Onatski 2010). More precisely, this procedure involves comparing the empirically estimated eigenvalues with the distribution of eigenvalues that is generated by a Gaussian random model with only spurious correlations.Footnote 2 Using the leading component or the GDP as univariate measures of well-being, we reconstruct two alternative synthetic series for all the indicators. By comparing the synthetic series with the original counterparts, we quantitatively measure the ability of GDP to summarize the variability of all the indicators. We apply this strategy to nine advanced OECD economies and our findings suggest that univariate measures, and GDP among them, are only imperfect proxies of well-being. With respect to the ability of GDP to approximate single indicators, substantial heterogeneity is found at the country level with the possibility of poor performance, especially over the demographic and social equity dimensions. Overall, our results, confirm that one shall rely upon multivariate composite indices of well-being, which are more apt at capturing the interactions between different indicators also pertaining to heterogeneous domains.

2 Methodology

We start from N time series of well-being indicators observed for T periods and all sampled at the same frequency \(\Delta t\). We denote the matrix of time-series by \(\varvec{\tilde{X}}(t)\) and their complex Hilbert transformation by \(\varvec{X}(t)\).Footnote 3

2.1 Dimensionality Reduction

According to the generalized PCA (Ng et al. 2001) the time series of the indicators can be expressed as:

$$\begin{aligned} \varvec{X}(t) = \varvec{A}(t) \varvec{V} \end{aligned}$$
(1)

where \(\varvec{A}(t)\) is a \(T \times N\) loading matrix and \(\varvec{V}\) is a \(N \times N\) matrix of eigenvectors.Footnote 4 These eigenvectors are associated to the N-dimensional vecto r of eigenvalues \(\varvec{\lambda }\), computed from the spectral decomposition:

$$\begin{aligned} \varvec{C} \varvec{V} = \varvec{\lambda } \varvec{V} \end{aligned}$$
(2)

where the correlation matrix \(\varvec{C}\) can be estimated by \(\varvec{\widehat{C}} = \frac{1}{N} \varvec{X}(t) \varvec{X}(t)'\).Footnote 5 The correlation matrix \(\varvec{\widehat{C}}\) is positive semi-definite and bears N non-negative and distinct eigenvalues \(\varvec{\lambda }\) with their associated eigenvectors \(\varvec{V}\). According to Principal Component Analysis (PCA, Vidal et al. 2016), each eigenvalue can be expressed as a linear combination of the original series and corresponds to a principal component, also explaining a portion of the total variance of the data proportional to its magnitude. Thus, the empirical density function of the eigenvalues can be expressed as:

$$\begin{aligned} \widehat{\rho }(\varvec{\lambda }) = \frac{dn(\varvec{\lambda })}{d\varvec{\lambda }} \end{aligned}$$
(3)

where \(n(\varvec{\lambda })\) indicates the number of eigenvalues larger than \(\varvec{\lambda }\).

To focus solely on principal components which are statistically significant one can compare the empirical density function \(\hat{\rho }(\varvec{\lambda })\) with a theoretical benchmark distribution of eigenvalues that would have been generated under a known null-hypothesis. The random matrix theory (RMT) provides a well-specified theoretical null-hypothesis for such a statistical significance test (Onatski 2010). But for the RMT to hold, it is required that no autocorrelation exists in the series and that the series are infinitely dimensional – in the sense that both \(N,T \rightarrow \infty \), with \(Q = \frac{T}{N}\) finite. To be free from these two tight restrictions, one can alternatively rely upon the less demanding rotational random shuffling (RRS) simulations which, in the limit, converge to the same theoretical distribution of RMT (Iyetomi et al. 2011; Aoyama et al. 2017; Kichikawa et al. 2020), represented by the Marchenko-Pastur distribution:

$$\begin{aligned} \rho (\varvec{\lambda }) = {\left\{ \begin{array}{ll} \frac{1}{2\pi \varvec{\sigma }^2}\frac{\sqrt{\left( \lambda _{M} - \varvec{\lambda }\right) \left( \varvec{\lambda } - \lambda _{m}\right) }}{\varvec{\lambda }} &{} \text {if } ~ \lambda _{m} \le \varvec{\lambda } \le \lambda _{M} \\ 0 &{} \text {else} \end{array}\right. } \end{aligned}$$
(4)

where \(\lambda _{m} = \varvec{\sigma }^2 \frac{\left( 1-\sqrt{Q}\right) ^2}{Q}\) and \(\lambda _{M} = \varvec{\sigma }^2 \frac{\left( 1+\sqrt{Q}\right) ^2}{Q}\) represent the lower and upper bounds. Deviations between the empirical distribution \(\hat{\rho }(\varvec{\lambda })\) and the theoretical one \(\rho (\varvec{\lambda })\), indicate the presence of some statistically significant components which can summarize the co-movements between the empirical indicators. This is exactly the reason why PCA is considered a dimensionality reduction technique. In particular, the number of significant components is equivalent to the number of eigenvalues which exceed the theoretical (or simulated with the RRS) upper bound \(\lambda _{M}\) (Laloux et al. 2000).

2.2 Construction of Synthetic Indicators

Once the significant principal components have been selected according to the above-described procedure, one can construct a synthetic indicator of the original time series (“synthetic PC” henceforth) as follows:

$$\begin{aligned} \varvec{\widehat{X}}_{J}(t) = \varvec{A}_{J}(t) \varvec{V}_{J} \end{aligned}$$
(5)

where the index J is an integer indicating the number of significant eigenvalues, \(\varvec{\widehat{X}}_{J}(t)\) is a \(T \times N\) matrix with the synthetic series, as generated using the J leading principal components (thus the index J), \(\varvec{V}_{J}\) is a \(J \times N\) matrix of the estimated complex eigenvectors associated to the J significant eigenvalues, and \(\varvec{A}_{J}(t)\) is a \(T \times J\) matrix with the associated loadings. The synthetic series will be different from the original ones, and they represent the indicators that one would observe if the noisy component of each of the original indicators would be ignored.Footnote 6

Furthermore, one can evaluate the quality of GDP at summarizing information about well-being (as provided by the large set of the N original series) by assuming that the GDP is the leading component that summarizes well-being. Formally, this corresponds to assuming that GDP replaces the component loadings \(\varvec{A}_{J}(t)\), under the condition \(J=1\). With this assumption, one can obtain an alternative synthetic indicator (“synthetic GDP” henceforth) of the original time series as follows:

$$\begin{aligned} \varvec{\widehat{X}}_{G}(t) = \alpha \varvec{A}_{G}(t) \varvec{V}_{1} \end{aligned}$$
(6)

where \(\varvec{\widehat{X}}_{G}(t)\) is the \(T \times N\) matrix with the GDP-based synthetic series (thus the index G), \(\varvec{A}_{G}(t)\) is a \(T \times 1\) matrix with the Hilbert transform of the GDP series, \(\varvec{V}_{1}\) is the \(1 \times N\) eigenvector associated to the largest eigenvalue \(\lambda _{1}\) and \(\alpha \) is a rescaling factor measuring the deviation scalar from the dominant eigenvector.Footnote 7

Once the synthetic indicators \(\varvec{\widehat{X}}_{J}(t)\) and \(\varvec{\widehat{X}}_{G}(t)\) are available, one can evaluate the quality of the matching with the original indicators \(\varvec{X}(t)\). For that, we use the root mean squared error (RMSE) and the Hilbert correlation coefficient between the complexified series.Footnote 8 Other more complicated alternatives are possible, for example the dynamic time-warp (DTW). However, it is not the aim of this paper to evaluate the quality of different similarity measures.

3 Empirical Application

We employ 39 different time-series indicators capturing economic, environmental, social equity and demographic dimensions.Footnote 9 Some of the indicators available in our analysis are also the loading components of the OECD better life index OECD (2011) and of the UNDP (2022) report. But one of the main limitations of the Better Life Index (recognized also by the authors) is that it is a cross-sectional index, incomparable over time. Therefore, in our analysis we only include those variables that can be consistently observed over a sufficient amount of periods (e.g., life expectancy at birth, infant mortality rates, ...), together with other variables excluded from the above mentioned well-being metrics, but that possibly relate to production of goods and services (e.g., energy consumption, housing prices, gender pay gap, ...). This allows us to compute the correlation of the principal component with the main univariate indexes often taken as indicators of national development and well-being also accounting for time delays.

All the indicators are sampled at annual frequency and cover the 1995–2015 period forming a balanced panel dataset for each country. Over the cross-section dimension, our analysis is performed on nine different advanced OECD economies.Footnote 10 The panel dataset format and the variable measurement, together with the normalization, ensures a perfect comparability between the sampled countries.

After having transformed the indicators into stationary series by means of the first difference transformation, we apply the generalized PCA and the RMT procedures to test for the statistical significance of the estimated principal components. For most countries (all but Great Britain) we find that only the largest eigenvalue exceeds the RMT upper bound. This implies that one can significantly summarize a certain fraction of the data variance by means of a single variable. In particular, the fraction of variance captured by the first principal component is called the absorption rate and is reported in Table 1 (second column for all the four sub-tables). The leading principal component explains between 30% and 43% of the total variance provided by the original 39 series for the nine considered economies. Spain (ESP) displays the highest absorption rate (42.58%) whereas Germany (DEU) the lowest one (30.25%). In addition, Great Britain is the unique country for which the leading component is not statistically significant.Footnote 11

Overall, measuring well-being only by means of the best linear univariate predictor, represented by the leading component, one would lose about 57% to 70% of the overall information embedded in the series of the original indicators. This is a relevant amount of the total variation and points toward the failure of any univariate measure to account for the complex and multifaceted relationships between all the different indicators pertaining to different domains.

Table 1 Fraction of variance explained by the dominant eigenvalue of the generalized PCA (absorption rate) and correlation between the four univariate indicators (reported in the table titles) with the principal component

The last three columns in each sub-table (see Table 1) report the correlation between the leading principal component estimated over the 39 indicators reported in Appendix A and a possible synthetic univariate indicator, taken among the most important economic variables also taken as “targets” by policymakers.Footnote 12 An interesting result that we obtain from the comparison of these correlations between different indicators, is that for almost all countries the GDP is the one that correlates the most with the leading principal component. Also Figure 1 presents similar information by visually comparing the evolution of GDP, GNI, household income and expenditure in the US, with the evolution of the leading principal component (leading PC henceforth) for the same economy. The correlation is outstandingly high for all indicators in the USA, with GDP recording the highest value (91%). Overall this suggests that, shall one be confined to the usage of a single indicator to capture well being, the idea of employing GDP might not be completely far fetched, as a first order approximation. In fact, it is fair to affirm that GDP growth mimics the dynamic of the best univariate linear approximation, specifically designed to maximizes the absorption rate. However, as already mentioned, even the leading PC accounts only for about one third of the total variance and therefore, whenever possible, one should untie him/her-self from the constraint of the adoption of a single univariate indicator as a measure of well-being. We will also see in what follows that the fraction of variance not captured by these univariate indicators is particularly relevant for some domains (e.g., social justice and demography) making therefore any univariate indicator inappropriate for summarizing well-being.

Fig. 1
figure 1

Time series for the leading principal component in the United States (blue) together with the time series of the four different univariate indicators (red) reported in each subtitle (colour figure online)

Fig. 2
figure 2

Selected original indicators (dashed black lines with squares) and synthetic indicators constructed starting from the leading PC (blue lines) and the GDP (red lines) for the USA

The reminder of the paper aims at evaluating the ability of GDP and the leading principal component to reconstruct the original indicators. In what follows we focus our analysis on the USA and on GDP. Extensive results for the other eight countries and for the other three variables (i.e., GNI, household income and expenditure) are presented in the on-line supplementary material.Footnote 13

To further quantify the loss of information when using a univariate measure of well-being we proceed with the construction of the synthetic indicators as described in Equations (5) and (6) (respectively “Synthetic PC” and “Synthetic GDP”) for all the 39 original variables of our sample. Results about four of these indicators, pertaining to different domains (economic, environmental, social and demographic) are presented in Fig. 2. In addition, in Table 2 we also report the Percentage Root Mean Squared Error (RMSE%) for each combination of original and synthetic indicators, the correlation coefficient between them, and the correspondent p-values. Figure 2 reveals that the two synthetic indicators (the red and the blue lines) display similar patterns in all the domains considered. This is coherent with the high degree of correlation previously reported in Table 1. However, their performance in tracking representative varies wildly across the domains. For instance, the original series in the economic domain, as represented by the percentage change of industrial production in the manufacturing sector, is replicated with high precision (an error of about 6 to 9 percentage points). A similar qualitative result (but quantitatively worsened, cf. Table 2) is observed in the pollution domain, represented by the percent change of greenhouse gas emissions.Footnote 14 In contrast, the synthetic reconstructed indicators are not capable of tracking the patterns for the variables selected as representative of social equity (the percent variation in the gender wage gap, with a RMSE of about \(20.5\%\)) and demography (measured by the percent variation in life expectancy). This is straightforwardly visible from the two panels in the bottom of Fig. 2. This can be explained by the fact that variables pertaining to these two domains are only slow moving and therefore they contribute less to the total variance of the dataset. Accordingly, the PCA assigns them lower weights in the loading matrix \(\varvec{A}(t)\). As a consequence, neither GDP nor the leading PC are effective in their prediction. As a consequence, also the correlation coefficients between the original series and the synthetic ones are low and not significant (see Table 2) indicating the poor ability of the univariate indicators to capture variables in all the domains at the same time. Once again, these results point toward the failure of GDP and any univariate indicator to precisely map the well-being of a country.

Table 2 Comparison of original and synthetic indicators. For each indicator constructed with the leading PC or with the GDP growth, we report the Root Mean Squared Error in percentage terms with respect to the original indicator, the correlation between the series and the p-value of the correlation coefficient
Table 3 Cross country comparison of original and synthetic indicators. We report the Root Mean Squared Error in percentage terms with respect to the original indicator

Finally, Table 3 reports the percentage RMSE of the same four indicators for all countries in our sample. We find similar heterogeneity across indicators, with substantially larger RMSE for the social equity and demographic domains. Overall these results confirm that univariate measures of well-being are doomed to fail in their attempt of summarizing the information stemming from a large set of different indicators and pertaining to heterogeneous domains.

4 Conclusions

We can draw two main conclusions from our work. First, our results suggest that if one has the aim of measuring well-being, he/she cannot simply rely upon univariate measures. This is because even the best linear estimator, aimed at maximizing the explained variance of a large set of variables (i.e., the PCA), is only able to capture a small portion of the whole indicators’ variance (about 30% to 42%). Overall, this suggests that univariate measures of well-being are doomed to fail and one shall rely also upon multivariate composite indices of well-being (Bacchini et al. 2020; Kalimeris et al. 2020) and sustainability (Pinar et al. 2014; Luzzati and Gucciardi 2015). These measures are more apt at capturing the complex interactions between different indicators also pertaining to very heterogeneous domains. Second, we also find that among the univariate alternatives aimed at summarizing a multitude of dimensions related to well-being, GDP fares quite well with respect to other alternatives. In particular, it delivers a performance which is similar to the one provided by the leading principal component of the series (i.e. the best linear indicator). Furthermore it has a lower prediction error vis-à-vis other univariate variables, such as the gross national income, the household income or the household expenditure. This militates in support of the predominant use of GDP as the leading measure of economic performance of a country Malay (2019).

This work could be extended in several ways. First, one might enlarge the number of series in the sample, especially with respect to the social, equity and environmental dimensions. This might lead to different estimates for the loadings \(\varvec{A}(t)\), assigning different weights to the single indicators when constructing the synthetic indicators based on principal component analysis. The effect of the inclusion of new variables is a priori unclear. Clearly, the variance explained by the first factor might increase or decrease, depending upon the degree of correlation between the new indicators and the ones already in our sample. Furthermore, when new indicators are included, the number of significant factors might also vary, forcing one to account also for the higher order principal components. Second, one might enrich the analysis by considering economies at a different stages of the country development process. In particular, this type of research could be useful to detect whether the usage of GDP, interpreted as a univariate indicator of well-being, is more or less appropriate in developed vs. developing economies. For both these extensions, however, the main difficulty lies in data availability, especially in relation to domains different from the economic one.