1 Introduction

Most studies that rank countries on the basis of their health status used the life expectancy or the mortality rate as indicator of the health status of a country, thereby implicitly assuming that health is a one-dimensional concept (cf. Charlton et al. 1983; Nolte and McKee 2003, 2008). However, this is not in line with the definition of health of the WHO, according to which health is “a state of complete physical, social and mental well-being, and not merely the absence of disease or infirmity. Health is a resource for everyday life, not the object of living, and is a positive concept emphasizing social and personal resources as well as physical capabilities” (WHO 1946). This definition suggests that health is a multi-faceted concept.

Nowadays, there is much information available on national health. How should all this information be combined? In other words, what is the appropriate conceptual framework for measuring health (Cutler et al. 1997)? What lessons can be learned from such a framework with respect to cross-country differences in health?

In our attempt to answer these questions, we applied factor analysis on various national health indicators for 171 countries over the period 2000–2005 to examine whether health has more than one dimension. Factor analysis is an excellent instrument to identify what different indicators of a latent construct (like health) have in common and to separate common factors from specific factors. We used the outcomes of the factor analysis to construct two new health measures. The first one refers to the health of individuals and the second captures the (quality of) health services.

Our measures differ substantially from indicators used in previous studies on health and also lead to different rankings of countries. As rankings are not that informative without further information, we analyzed the distance between each country and the sample mean. Differences between countries are much more pronounced for our measure on health services than for our measure on the health of individuals. Using cluster analysis, we classified the countries in six homogenous groups. Health differs substantially across these clusters.

The remainder of the paper is structured as follows. The next section explains factor analysis, while in Sect. 3 this method is applied to various indicators of health. Sect. 4 presents our rankings and a cluster analysis, while Sect. 5 offers a discussion of some of our findings. The final section presents our conclusions.

2 Methods

2.1 Model

Most previous studies on health employed an arbitrarily chosen one-dimensional indicator of health. The question is whether these indicators represent all dimensions of health. Furthermore, most indicators of health contain measurement errors that may lead to biased estimates (Klitgaard and Fedderke 1995). This is especially the case for samples including developing countries. To come up with a better measure for health and to determine whether health has a multidimensional character, we employed a so-called Explanatory Factor Analysis (EFA). The first step in this analysis is to check whether the data used is suitable for an EFA using the Kaiser–Meyer–Olkin measure of sampling adequacy testing whether the partial correlation among variables is low. A test statistic above 0.6 indicates that the data is suitable for an EFA (Kaiser 1970). An alternative test is Bartlett’s test of sphericity, that checks whether the correlation matrix is an identity matrix in which case the factor model is inappropriate (Lattin et al. 2003).

The objective of an EFA is to identify what different indicators of a latent variable (like health) have in common and to separate common factors from specific factors. Following Wansbeek and Meijer (2000) and Lattin et al. (2003), the EFA model can be written as:

$$ x_{i} = \Updelta \xi_{i} + \varepsilon_{i} $$
(1)

where x i is a vector containing the M indicators for observation i, i = 1…k (in our case the various indicators of health), ∆ is a vector of factor loadings of order M × k, and ξ is a vector of latent variables with mean zero and positive definite covariance. The random error term ε is assumed to be uncorrelated with the latent variables.Footnote 1 Under these assumptions, the covariance matrix of x i is:

$$ \Upxi = \Updelta \Upphi \Updelta^{\prime } + \Upomega $$
(2)

where Ξ is the parameterised covariance matrix that can be decomposed in the covariance matrix of the factors Φ and the diagonal covariance matrix of error terms Ω. The model is estimated with the Maximum Likelihood (ML) method. By assuming that the factors and the disturbance term are normally distributed, it follows that the indicators are normally distributed. The log-likelihood function can be written as:

$$ \ln L = \ln \left| \Upxi \right| + tr\left[ {S\Upxi^{ - 1} } \right] $$
(3)

where S represents the sample covariance matrix. Minimizing this fit function means choosing the values for the unknown parameters so that the implied covariance matrix comes as close as possible to the sample covariance matrix.

The next step is to decide on the number of factors to represent health on the basis of the scree plot, which plots the number of factors against the eigenvalues of the covariance matrix of the indicators. In general, there are two ways of interpreting the graph. According to Kaiser’s Rule, only factors with an eigenvalue exceeding unity should be retained (Kaiser and Dickman 1959). An alternative way is to look for an ‘elbow’ in the scree plot, i.e., the point after which the remaining factors decline in approximately a linear fashion, and to retain only the factors above the elbow.

After deciding on the number of factors, it is possible that the factors of the (standardized) solution of the model are difficult to interpret. In that case, rotating the factor loadings may yield a solution that is easier to interpret because the matrix has a simpler structure. Ideally, each indicator is correlated with as few factors as possible. The rotation technique that we used to interpret the factors is the Oblimin rotation, which allows for correlation among the factors and minimizes the correlation of the columns of the factor loadings matrix. As a result, a typical indicator will have high factor loadings on one factor, while it has low loadings on the other factors (Harris and Kaiser 1964).

All indicators received factor scores for the various dimensions (factors) identified. These factor scores were used to come up with the so-called Bartlett predictor, i.e., the best linear unbiased predictor of the factor scores:

$$ \xi_{i} = \Upphi \Updelta^{\prime } \theta^{ - 1} x_{i} $$
(4)

These factor scores were used as indicator of the health status of a country.

2.2 Data

The selection of indicators of health is based on two rules. First, data should be widely available for a large number of countries. Here we faced a trade-off, as some indicators were only available for a limited number of countries. Second, to aggregate the data from micro level to macro level, the data should be gathered in a consistent way across countries and over time periods. We used data from the World Development Indicators of the World Bank and from the Statistical Information System of the World Health Organization.

We grouped our data on the health of individuals in three broad categories. Our first category contains various indicators on lifetime. It is quite common to proxy the health status of a country by the population’s life expectancy or mortality rate. In this category, we also included the number of healthy years that a person has and the prevalence of children with malnutrition measured by the share of children that is underweighted. Our second category refers to the prevalence of various communicable diseases. These include diseases that are transmitted from person to person or through insect bites and that can be fatale. Most diseases in this category can be epidemic and may form a serious treat for the health status of a country, especially in developing countries. Finally, our third category includes various non-communicable diseases. These are not caused by transmission, but by incident or by lifestyle. These diseases are more common in industrialized countries.

We applied factor analysis on 27 national indicators of the health of individuals. Table 1 presents the indicators used and their sources.

Table 1 Indicators of the health of individuals

A different measure for the health status of a country is the quality of its health services. Therefore, we also applied factor analysis on 10 indicators of national health services. The indicators used and their sources are given in Table 2. Our first category includes indicators of the availability of health care. The more capacity there is, the earlier a patient will be seeing a doctor and get care. The second group of variables captures immunization. We argue that the immunization rate is a policy variable determined upon by the government (cf. Lake and Baum 2001).Footnote 2

Table 2 Indicators of health services

For both measures we used averages over the period 2000–2005 for a sample of 171 countries, giving 4,446 observations for the health of individuals and 1,710 observations for health services.Footnote 3 For some countries one or two indicators were not available, yielding 214 missing observations for the health of individuals and 83 for health services, which is in both cases less than 5%. In order not to lose valuable information, we applied the EM algorithm to compute the missing observations. The EM algorithm was suggested by Dempster et al. (1977) to solve maximum likelihood problems with missing data. It is an iterative method, the expectation step involves forming a log-likelihood function for the latent data as if they were observed and taking its expectation, while in the maximization step the resulting expected log-likelihood is maximized.

3 Results

The Kaiser–Meyer–Olkin measure of sampling adequacy and Bartlett’s test of sphericity indicated that our data could be used for an Explorative Factor Analysis.

First, we analysed individual health. Because our data is measured on an interval or ratio scale and is normally distributed, Table 3 shows Pearson’s correlation coefficient. The results indicate that the correlations between the different indicators are often quite low, although generally significant. Therefore, we consider the different indicators of individual health as imperfect measures of health containing measurement errors (see also Pan American Health Organization 2001; Häkkinen and Joumard 2007; Klitgaard and Fedderke 1995).

Table 3 Correlation matrix: indicators of the health of individuals

To extract the right number of factors out of the various indicators, we used the scree plot (see Fig. 1). According to the Kaiser rule, more than six factors should be identified. However, this is probably a so-called Heywood (1931) case where some solutions of the unique variances of the indicators are smaller than zero. If instead the elbow criterion is used, individual health can be represented as a one-dimensional construct. Both models were compared using a likelihood ratio test. In this case, the multiple-factor model does not fit the data significantly better than the one-factor model. The goodness-of-fit test statistic for the one-factor model is 2795.91, which is χ2(324) distributed, is highly significant (compared to a saturated model) at the five percent significance level, suggesting that the one-factor model is appropriate.

Fig. 1
figure 1

Scree plot of the eigenvalues and number of factors for indicators of health of individuals

Table 4 presents the factor loading of the various national indicators of the health of individuals and the variance of the indicators explained by the first factor. More than 60% of the variance is explained by the first factor and about 40% of the total variance is unique. The one-factor model can explain about 89% of the total variance of the mortality rate below 5 years, but less than 33% of the age standardized mortality of cancer.

Table 4 Factor matrix health of individuals

Next, we performed a factor analysis on the indicators of health services. Table 5 shows Pearson’s correlation coefficient. The results indicate that the correlations between the different indicators are often quite low, although generally significant. The scree plot is shown in Fig. 2. According to the Kaiser rule, two factors should be identified, while the elbow interpretation indicated only one factor. Both models were compared using a likelihood ratio test. The two-factors model does not fit the data significantly better than the one-factor model. The goodness-of-fit statistic of the one-factor model is 438.98 which is χ2(35) distributed and is highly significant at the five percent significance level, suggesting that the one-factor model is appropriate.

Table 5 Correlation matrix: indicators of health services
Fig. 2
figure 2

Scree plot of the eigenvalues and factors of the indicators of health services

Table 6 presents the factor loadings of the various indicators and the variance of the indicators explained by the factor. About 60% of the variance is explained by the factor and about 40% of the total variance is unique.

Table 6 Factor matrix health services

4 Health Ranking and Cluster Analysis

We constructed new measures for the health of individuals and health services based on the factor scores as reported in Sect. 3. Table 10 in the “Appendix” shows the full list of the predicted factor scores and the implied ranking of the various countries.

The rankings lead to a number of conclusions. First, not surprisingly, western countries and Japan dominate the top of the rankings, while mostly African countries take the positions at the bottom. Second, in the ranking based on health services Cuba and Belarus score remarkably high. Third, the ranking differs substantially from the most recent ranking on health over almost the same period by Nolte and McKee (2008) for OECD countries (see Table 7). According to the results of Nolte and McKee (2008), France outranks all other countries in the OECD area. However, in our ranking France is at place eight in the ranking based on the health of individuals and is even number 14 in the ranking based on health services. Another example is Spain that takes the third place in the ranking of Nolte and McKee (2008), but is on place 13 in our ranking of health services.

Table 7 Health ranking of OECD countries

As rankings are not that informative without further information, Table 7 also presents the distance between each OECD country and the OECD mean.Footnote 4 This measure gives a much better impression about health differences between countries. The results show that there is a large difference between both health measures. While France scores about 2.5% higher in our measure on individual health, it scores about 11% below the mean on our health services measure. Nolte and McKee (2008) report that the United States scores about 27% below the mean. However, according to our measure of individual health, the United States scores only about 13% below the mean, while it scores above the mean according to our measure for health services. In general, Nolte and McKee (2008) report more dispersion compared to our measure on the health of individuals. However, the variance among the countries in our sample for our measure on health services is much higher than that of Nolte and McKee (2008). These results are confirmed if we take the standard deviation of the various measures divided by their mean.

Furthermore, if we expand our sample including not only the OECD countries, we find a similar, but even more pronounced, pattern. The data show that the differences between a country’s score and the sample mean are much higher for the measure for health services than they are for the measure for the health of individuals. The variance of the individual health measure is 1.1, while for the health services measure the variance is 2.4.

To sum up, our results indicate that there exist significant differences between our measures. The ranking based on the health of individuals is less dispersed than the ranking based on the quality of health services. This strengthens our conclusions that both measures are capturing different dimensions of a country’s health. So in contrast to Nolte and McKee (2008), we pose that cross-country comparisons of health should not be based on only one (arbitrarily chosen) variable.

To get a better view of health differences across countries, we categorized the countries in our sample on the basis of their similarities and differences using cluster analysis. Cluster analysis is recognized as a useful technique for this purpose and has been employed extensively in social and economic sciences (Punj and Stewart 1983; Hair et al. 1998).

For the cluster analysis we used our two health measures as identified by the factor analysis. We also included some additional health related variables: public health expenditure as a percentage of GDP, the percentage of the population having access to improved sanitation, the percentage of the population having access to improved water resources, and GDP per capita.Footnote 5

The first step is to detect outliers and check for multicollinearity. Outliers distort the true structure of the data and make the derived clusters unrepresentative of the population structure. To test whether an observation is an outlier we used the Mahalanobis D2 (Hair et al. 1998). The Mahalanobis D2 estimates the standard deviation of the distances of the sample points from the centre of mass. If the distance between the test point and the centre of mass is more than one standard deviation, it is highly probable that the test point does not belong to the set and can be classified as an outlier. The Mahalanobis D2 measure indicated that less than 2% of the observations are outliers. A scatter matrix (not shown, but available on request) confirmed that our dataset contains only a limited number of outliers. As a robustness check, we estimated the cluster analysis with and without the outliers. However, the outliers did not affect our results and these observations were therefore not deleted.

Also multicollinearity can be a problem in cluster analysis because it distorts the weighting of variables in the different clusters. We used as rule of thumb that the correlation between the variables should not exceed 0.8 (Green 2003). The correlation of two variables was higher: the share of people having access to improved water and the share of people having access to improved sanitation (see Table 11 in the “Appendix”). We therefore dropped the latter variable.Footnote 6

The next step is to determine inter-object similarity, which is based on the distance between the objects. As a proxy we used the squared Euclidean distance, which is the square of the length of a straight line drawn between two objects (Hair et al. 1998). A higher value denotes less similarity. Because all variables are measured on a different scale, we first standardized the data by computing for each variable the standard scores (also known as Z scores) by subtracting the mean and dividing by the standard deviation of each variable.

Next, we used Ward’s linkage method to cluster countries (Hair et al. 1998). This method seeks to join the two clusters whose merger leads to the smallest within cluster sum of squares instead of joining the two closest clusters. An advantage of this method compared to others (like single linkage or complete linkage) is that Ward’s method is not sensitive to small distortions in the data.

There is no general rule on determining the number of clusters after the hierarchical clustering procedure. However, there are some rules of thumb. One of these rules is based on the so-called agglomeration coefficient. The agglomeration coefficient is the within-cluster sum of squares and measures the differences within a cluster. Joining two very different clusters results in a large agglomeration coefficient (or a large percentage change in the coefficient). One drawback of this method is that it has the tendency to indicate too few clusters (Hair et al. 1998). The agglomeration coefficients in Table 8 indicate that the largest percentage increase occurs if the number of clusters increases from one to two. After seven clusters, the agglomeration coefficient hardly changes.

Table 8 Optimal clusters solution tests

An alternative rule is to compute the Caliński-Harabasz pseudo-F-index or the Duda-Hart pseudo-T-square (Milligan and Cooper 1985). A large pseudo-F-index and a small T-square indicate homogenous clustering. The results in the second part of Table 8 show that the six-clusters solution has the largest Caliński-Harabasz pseudo-F-index (409.56). The smallest pseudo-T-squared value is 19.99 for the 5-clusters solution, but notice that the pseudo-T-square value for the 6-clusters solution is also low (23.78).

A more formal test on the number of clusters is given by the Mojena test statistics (Mojena 1977). Mojena test I assumes that the distances of the agglomeration schedule are normally distributed up to a certain step of the fusion process. At each step it is tested whether the distance increase belongs to the assumed normal distribution. Mojena test II verifies whether the distance in a certain step can be predicted with a regression line that is estimated using the distances from the previous steps. If the distance lies outside the 95% confidence interval, a significant increase in the distance is found and the respective step of the fusion process is used as the optimal number of clusters.

In the present analysis, the two Mojena tests give the same results. According to test statistic I, the level of significance exceeds from seven to six clusters, whereas test statistic II suggests an optimal number of six clusters. This solution is in line with the results on the agglomeration coefficient, the Caliński and Harabasz pseudo-F-index, and the Hart pseudo-T-square. Therefore, we identified six clusters.

The six-clusters solution is also in line with the dendrogram. The dendrogram is a graphical representation of the results of a hierarchical procedure in which each object is arrayed on one axis and the other axis portrays the steps in the hierarchical procedure. The dendrogram shows how the clusters are combined in each step of the procedure until all are contained in a single cluster. (Because the dendrogram was too large to include in the paper, we only summarize it in Table 12 in the “Appendix”. However, the dendrogram is available upon request). The dendrogram table indicates that the first cluster solution based on the minimal distance shows 171 clusters with only one country, the second cluster solution indicates that countries can be categorized in 6 clusters.

Finally, we profiled these six clusters. Table 9 shows the P-value of the F-test that the clusters differ significantly with respect to the health variables (P < 0.05). It is clear that the clusters differ significantly from one another. There are two clusters with poor health, i.e., cluster four and cluster two. In cluster four, on average less than fifty percent of the population has access to improved water facilities and the government is only spending about two percent of (low) GDP on health. Compared to cluster four, cluster two includes countries with a population that has somewhat better access to improved water facilities, a somewhat higher level of government health spending, while the average GDP per capita is about twice as high as GDP per capita in cluster four.

Table 9 Cluster characteristics

Clusters one and six have good and very good health outcomes. In these clusters almost the total population has access to improved water facilities and public health spending is more than five percent of GDP. Finally, the remaining two clusters are intermediate but differ in their health outcomes and income.

Table 9 shows that the clusters not only differ with respect to health, they also have different economic and demographic characteristics. Countries in clusters two and four are mostly countries with a low income, low school enrolment rate, and a high population growth. Countries in clusters one and six are high-income countries with a high school enrolment rate. Also the geographical dimension differs across clusters. African countries are mainly in clusters two and four, while most European countries can be found in clusters one and six. Table 13 in the “Appendix” shows the composition of the clusters.

5 Discussion

On the basis of factor analysis and cluster analysis, this paper tried to offer a better view on cross-country differences in health. Because health is not directly observable and there are many different health indicators available, we used factor analysis to examine the dimensions of health and to come up with better measures for health. Because rankings of countries based on these measures (or any other indicator) do not give information about distances between countries, we focused upon the difference between a country’s health vis-à-vis the sample mean.

However, like any study, the present study has weaknesses. The main weakness is the availability of the data. One limitation of studies on cross-country differences in health is the limited availability of indicators for a long-term period. Even though we included twenty-seven indicators of the health of individuals and ten indicators of health services, this may not suffice to fully capture the concept of health. Unfortunately, other indicators are only available for a small number of (mostly industrialized) countries or are not constructed in a consistent way. Due to this limitation, it is possible that when more indicators become available for a larger set of countries and longer periods, our two measures of health may turn out to be multi-dimensional instead of one-dimensional. In other words, different data could lead to different results and conclusions.

Furthermore, we aggregated the micro level health data to the macro level. Therefore, we cannot take into account the individual (respondent) differences in our cluster analysis. We can only relate the (macro) health outcomes to country averages.

Another problem in research on cross-country health differences is the quality of the data, especially for developing countries. Some variables for these countries show large and unrealistic swings and gaps. Also the data dispersion within in a country cannot be addressed in this study because we focus on country level data.

The final weakness is that our two-one-dimensional health measures explain on average only between 60 and 70% of the total variance. This means that about one-third of the variance remains unexplained. However, extracting more factors did not give us a more insights and worsened the interpretation of the results.

6 Conclusions

One of the major problems in the economic and social science literature is the measurement of latent constructs. This certainly holds true for cross-country analyses of health. Most previous studies that ranked countries on the basis of their health status used arbitrarily chosen indicators of the health status of a country (cf. the life expectancy or the mortality rate), thereby implicitly assuming that health is a one-dimensional concept. Furthermore, most indicators of health contain some measurement error, which may lead to biased estimates. To come up with better measures for health and to determine whether health has a multidimensional character, a so-called Explanatory Factor Analysis (EFA) was employed on various national health indicators for 171 countries over the period 2000–2005. We used the outcomes of the factor analysis to construct two new national health measures. The first one refers to the health of individuals and the second captures health services.

Our new health measures differ substantially from those reported in earlier studies ranking countries on the basis of their health status. As rankings are not that informative without further information, we focused upon the difference between a country’s health vis-à-vis the sample mean. We found that the cross-country variance of our measure for health services is much higher than that of our measure for the health of individuals.

Furthermore, we found that health depends mostly on geography and development. The dispersion of the two health measures within OECD countries is much lower than in the full sample of countries. This strengthens our conclusion that both measures capture different dimensions of health and that cross-country comparisons of health should not be based on only one (arbitrarily chosen) variable.

Further analysis showed that there are six clusters of countries, ranging from countries with very good health to very bad health. The clusters not only differ with respect to health, they also have different economic and demographic characteristics.