Data
To understand the structure and dimensionality of development we rely on the WDI dataset, which is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. The WDIs comprise a total of 1549 variables with yearly data between 1960 and 2016 for 217 countries. As such, it represents the most current and accurate global development database available (The World Bank 2018b).
Even though the WDI dataset is the most comprehensive set of development indicators available, it contains many missing values. Only for the most developed countries the dataset is (nearly) complete. For many other countries—particularly low and middle income countries—many indicators are partly or completely missing. This is problematic, as for most dimension reduction methods a dataset without missing observations is required. To make our analyses possible, we therefore had to select a subset of indicators, countries and years with few missing observations and to fill in the remaining missing observations using gapfilling techniques (see next section). To avoid arbitrariness of the subset selection, a scoring approach was used (see Sect. 2.2) and the 1000 subsets with the highest scores were selected. These 1000 subsets contained a total of 621 variables, 182 countries and the years ranging from 1990 to 2016. The subsets cover almost all categories of variables. The categories with their respective number of variables in the entire WDI dataset and the subsets are “Economic Policy & Debt” (120 out of 518), “Education” (73 out of 151), “Environment” (74 out of 138), “Financial Sector” (29 out of 54), “Gender” (1 out of 21), “Health” (123 out of 226), “Infrastructure” (19 out of 41), “Poverty” (0 out of 24), “Private Sector & Trade” (103 out of 168), “Public Sector” (31 out of 83), and “Social Protection & Labor” (48 out of 161). Jointly these subsets are representative for the original dataset while avoiding large gaps.
Gapfilling
The dimensionality reduction approach we have chosen (see Sect. 2.3) relies on a full matrix of distances between the different country–year data points. However, given the large amount of data gaps this global distance matrix cannot be computed directly. In the following, we develop an approach to find subsets of the WDI database which we can gapfill and use for estimating distances among data points.
In order to choose subsets of the WDI database covering a wide range of WDIs, countries, and years, but also having as few missing values as possible, the following method was applied: A series of subset was created from the full WDI dataset using a combination of thresholds for the maximum fraction of missing values for the WDIs, \(f_v\), and countries, \(f_c\), as well as a starting year, \(y_{\text {start}}\), and an ending year, \(y_{\text {end}}\). We assigned a score to each of the resulting subsets by using a grid search over the parameters, \(f_v, f_c \in (0.05, 0.15, \ldots , 0.65)\) and \(y_{\text {start}}, y_{\text {end}} \in (1960, 1961, \ldots , 2017), y_{\text {start}} < y_{\text {end}}\). The size of this parameter space is 80997, each with a different combination of missing value thresholds and starting and ending year combinations. The 1000 subsets with the highest scores were finally chosen to build the global distance matrix. For an overview of the entire method, see Fig. 1.
Each subset was created from the full WDI dataset by choosing consecutive years with starting year, \(y_{\text {start}}\), and ending year, \(y_\text {end}\); WDIs with a higher missing value fraction, \(p_v\), than the corresponding threshold were dropped \((p_v > f_v)\). Then, countries with higher missing value fractions, \(p_c\), than the corresponding threshold were dropped as well \((p_c > f_c)\). The number of remaining countries, \(n_c\), and WDIs, \(n_v\), was recorded and the resulting subsets were filtered to retain more observations (the number of countries times the number of years) than variables, leaving a total of 77,610 subsets of the WDI for score calculation.
To account for different scales of the parameters, the values had to be rescaled, i.e. we calculated \(n'_v\) from \(n_v\) by scaling the values from subsets linearly to a minimum of 0 and a maximum of 1, analogously for \(n'_c\), \(f'_c\), and \(f'_v\). The final score was then calculated as
$$\begin{aligned} \text {score} = \sqrt{n'_v n'_c} - \sqrt{f'_c f'_v}. \end{aligned}$$
This score calculates the geometric means of the variables of interest. The geometric mean has the advantage over the arithmetic mean that it is very sensitive to single bad values. As we want to maximize the number of countries and WDIs chosen and have as few missing values possible, the final score is the difference between the geometric means. For further processing, the subsetted WDI data matrices with the 1000 highest scores were selected.
Finally the subsetted WDI data matrices with the 1000 highest scores were selected and a gapfilling procedure using Probabilistic PCA (Stacklies et al. 2007) was performed on the centered and standardized (\(z\)-transformed) variables using the leading 20 dimensions.
Dimensionality Reduction
Dimensionality reduction describes a family of multivariate methods that find alternative representations of data by constructing linear or, in our case, nonlinear combinations of the original variables so that important properties are maintained in as few dimensions as possible. A plethora of algorithms is currently available for dimensionality reduction, both linear and nonlinear (Arenas-Garcia et al. 2013; Van Der Maaten et al. 2009; Kraemer et al. 2018), but PCA is dominating in applied sciences because of ease of use and interpretation.
One method to find an embedding from a known distance matrix is “classical Multidimensional Scaling” (CMDS; Torgerson 1952), this method is equivalent to PCA if the distance matrix is computed from the observations using Euclidean distance. CMDS finds coordinates in a reduced Euclidean space of dimension \(i\) minimizing
$$\begin{aligned} {\Vert \tau (D) - \tau (D_i)\Vert }_2, \end{aligned}$$
where \(D\) is the matrix of Euclidean distances of observations and \(D_i\) the matrix of Euclidean distances of the embedded points. \(\tau (D) = -\frac{1}{2} HSH\), is the “double centering operator”, with \(S = [D_{ij}^2]\), \(H = [\delta _{ij} - \frac{1}{n}]\), and \({\Vert X \Vert }_2 = \sqrt{\sum _{ij} X_{ij}^2}\) the \(L_2\)-norm. CMDS and therefore PCA tend to maintain the large scale gradients of the data and cannot cope with nonlinear relations between the covariates.
“Isometric Feature Mapping” (Isomap; Tenenbaum et al. 2000) extends CMDS, but instead of Euclidean distances, it respects geodesic distances, i.e. the distances measured along a manifold of possibly lower dimensionality,
$$\begin{aligned} \Vert \tau (D_{\text {geo}}) - \tau (D_i) \Vert _2. \end{aligned}$$
Specifically, Isomap uses geodesic distances, \(D_{\text {geo}} = [d_{\text {geo}}(x_i, x_j)]\), which are the distances between two points following a \(k\)-nearest neighbor graph of points sampled from the manifold.
Isomap is guaranteed to recover the structure of nonlinear manifolds whose intrinsic geometry is that of a convex region of Euclidean space (Tenenbaum et al. 2000). Isomap unfolds curved manifold which makes the method more efficient than PCA in reducing the number of necessary dimensions in the presence of nonlinearities.
To construct the geodesic distances, a graph is created by connecting each point to its \(k\) nearest neighbors and distances are measured along this graph. If the data samples the manifold well enough, then the distances along the graph will approximate the geodesic distances along the manifold. The value of \(k\) will determine the quality of the embedding and has to be tuned.
We applied Isomap on the 1000 previously generated subsets of the WDI database. To find the optimum value \(k\) for each subset, \(k_i\), Isomap was calculated first with \(k_i = 5\) and the residual variance for the embedding of the first component was calculated (see below). This process was repeated increasing the values of \(k_i\) by 5 in each step until there was no decrease in the residual variance for the first component any more (Mahecha et al. 2007). In order to get an intuition of Isomap, we recommend the original publication of the Isomap method (Tenenbaum et al. 2000) which contains an excellent didactic explanation of the method.
Ensemble PCA and Ensemble Isometric Feature Mapping
An observation consists of a country name and year. To calculate a linear embedding (ensemble PCA) over the union of all countries, years and variables chosen before, we used a Probabilistic PCA (\(d = 80\), where \(d\) is the number of dimensions used in the probabilistic PCA) to gapfill all the observations and variables occurring in the subsets of the WDI dataset and applied a normal PCA to the gapfilled dataset.
We developed “Ensemble Isometric Feature Mapping” (e-Isomap) to produce the final nonlinear embedding based on the different gapfilled subsets of data. E-Isomap combines \(m = 1000\) geodesic distance matrices created from the subsets of the previous step and constructs an global ensemble geodesic distance matrix, \(D^{*}_\text {geo}\), from the geodesic distance matrices of the \(m\) Isomaps.
Let the total set of observations be \(I = \{1, \ldots , n\}\) (a country–year combination) and the observed variables \(V = \{1, \ldots , p\}\) (the WDIs). We first perform one Isomap \(i \in \{1, \ldots , m\}\) per subset of \(I\) and \(V\), \(I_i\) and \(V_i\) respectively, where \(|V_i|\) is the number of variables for Isomap \(i\). The geodesic distance matrix for Isomap \(i\) is \(D_{\text {geo},i} = {(d_{\text {geo}, i}(x_j, x_k))}_{j,k}\) with \(j, k \in I_i\). If a pair of observations \((x_j, x_k)\) does not occur in Isomap \(i\), it is treated as a missing value. First the geodesic distance matrices are scaled element-wise to account for the different number of variables used,
$$\begin{aligned} d'_{\text {geo}, i}(x_j, x_k) = d_{\text {geo}, i}(x_j, x_k) \sqrt{\frac{|V|}{|V_i|} }, \end{aligned}$$
which are then combined into a single geodesic distance matrix \(D_{\text {geo}}^{*}\) by using the maximum distance value,
$$\begin{aligned} d_{\text {geo}}^{*}(x_j, x_k) = \max _i d'_{\text {geo}, i}(x_j, x_k). \end{aligned}$$
Missing values are ignored if all values are missing for a pair \((x_j, x_k)\) and they are treated as infinite distances. Taking the maximum avoids short-circuiting distances and as long as there are few missing values. This provides an accurate approximation of the internal distances.
Finally the \(k\) nearest neighbor graph \(G\) is constructed from the distance matrix, and each edge \(\{x_i, x_j\}\) is weighted by \(\frac{|x_i - x_j|}{\sqrt{M(i)M(j)}}\), where \(M(i)\) is the mean distance of \(x_i\) to its \(k\) nearest neighbors. This last step is called c-Isomap (Silva and Tenenbaum 2003) and it contracts sparsely sampled regions of the manifold and expands densely sampled regions, the c-Isomap step proved to give a more evenly distributed embedding. Finally the geodesic distances are calculated on \(G\) and classical scaling is performed to find the final embeddings.
Quality Measurement of an Embedding and Influence of Variables
The quality for the embedding is estimated by calculating the residual variance (Tenenbaum et al. 2000) computed as
$$\begin{aligned} \text {residual variance}_i = 1 - r^2 ({\hat{D}}, D_i) = 1 - \text {explained variance}_i, \end{aligned}$$
where \(D_i\) is the matrix of Euclidean distances of the first \(i\) embedded components and \({\hat{D}}\) is the matrix of Euclidean distances for PCA and the matrix of geodesic distances for Isomap in original space. Note that because \(D_i\) and \({\hat{D}}\) are symmetric, we only use one triangle for the calculation of the residual variance. This notion of explained variance is different from the one usually used for PCA, which is derived from the eigenvalue spectrum, but the measure used here has the advantage that it gives comparable results for arbitrary data such as the HDI and Isomap.
To assess the influence of single variables on the final e-Isomap dimensions, we calculated the distance correlation (dcor, Székely et al. 2007), which is a measure of dependence between variables that takes nonlinearities into account. Due to the strong nonlinearities in the dataset and the embedding method, a simple linear correlation would not have provided sufficient information about the relationships between variables and the embedding dimensions.