Abstract
Composite indices used in social science research often rely on principal components analysis (PCA) as a way to derive weights for component variables, which emphasizes the largest variations in the variables in a composite index. However, PCA may not work when the informative variations account for only a small share of the variance in the variables; also, the best weighting scheme may also depend on the use of a particular composite index. We consider partial least squares (PLS) as an alternative weighting scheme, which takes advantage of the relationship between outcome variables of interest and the variables in a composite index. In this paper, the Social Institutions and Gender Index (SIGI), a composite index produced by the OECD, is re-constructed using weights generated by PCA and PLS. Using the revised SIGIs and female education, fertility, child mortality, and corruption as outcome variables, we investigate the relationship between social institutions related to gender inequality and these development outcomes, controlling for relevant other determinants. We find that gender inequality in social institutions has a significant correlation with fertility and corruption regardless of the weighting procedure, while for female education and child mortality only the SIGIs based on PLS show significant results. Additionally, PLS brings benefits in terms of prediction compared to PCA for female education and child mortality. In our analysis of corruption, we consider not only the Corruption Perception Index (CPI) as our measure of corruption, but also create new reweighted CPIs again using PLS and PCA as weighting procedures. The CPI based on PCA shows a significant correlation with gender inequality, while the correlation is only marginally significant when using the PLS.
Similar content being viewed by others
Notes
We decided to use slightly different reference years than in Sect. 3, since a new standardization scheme is introduced in 2002 for the CPI data, which might undermine the comparability of the scalings.
References
Alesina, A., Devleeschauwer, A., Easterly, W., Kurlat, S., & Wacziarg, R. (2003). Fractionalization. Journal of Economic Growth, 8(2), 155–194.
Branisa, B., Klasen, S., & Ziegler, M. (2013). Gender inequality in social institutions and gendered development outcomes. World Development, 45, 252–268.
Branisa, B., & Ziegler, M. (2011). Reexamining the link between gender and corruption: The role of social institutions. In Proceedings of the German development economics conference, Berlin (Vol. 15).
Correlates of War 2 Project. (2003). Colonial/dependency contiguity data, v3.0. http://correlatesofwar.org/.
de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory System, 18, 251–263.
Dreher, A. (2006). Does globalization affect growth? Evidence from a new index of globalization. Applied Economics, 38(10), 1091–1110.
Filmer, D., & Pritchett, L. H. (2001). Estimating wealth effects without expenditure data-or tears: An application to educational enrollments in states of India. Demography, 38(1), 115–132.
Freedom House. (2008). Freedom in the world 2008. http://www.freedomhouse.org.
Greenacre, M. (2010). Correspondence analysis in practice. Boca Raton: Chapman and Hall/CRC.
Helland, I. S. (1990). Partial least squares regression and statistical models. Scandinavian Journal of Statistics, 17(2), 97–114.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441.
Kolenikov, S., & Angeles, G. (2009). Socioeconomic status measurement with discrete proxy variables: Is principal component analysis a reliable answer? Review of Income and Wealth, 55(1), 128–165.
Krämer, N., & Sugiyama, M. (2011). The degrees of freedom of partial least squares regression. Journal of the American Statistical Association, 106(494), 697–705.
Maitra, S., & Yan, J. (2008). Principle component analysis and partial least squares: Two dimension reduction techniques for regression. Applying Multivariate Statistical Models, 79, 79–90.
Marshall, M. G. (2013). Polity IV project: Political regime characteristics and transitions, 1800–2012. http://www.systemicpeace.org/polity/polity4.htm.
Martens, H., & Martens, M. (2000). Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food quality and preference, 11(1), 5–16.
Meulman, J. (2000). Optimal scaling methods for multivariate categorical data analysis (p. 12). Leiden: Leiden University.
Mevik, B.-H., & Cederkvist, H. R. (2004). Mean squared error of prediction (MSEP) estimates for principal component regression (PCR) and partial least squares regression (PLSR). Journal of Chemometrics, 18(9), 422–429.
Naes, T., & Martens, H. (1985). Comparison of prediction methods for multicollinear data. Communications in Statistics-Simulation and Computation, 14(3), 545–576.
Niitsuma, H., & Okada, T. (2005). Covariance and PCA for categorical variables. In T. B. Ho, D. Cheung, & H. Liu (Eds.), Advances in knowledge discovery and data mining, PAKDD 2005. Lecture notes in computer science (Vol. 3518). Berlin: Springer.
Puwakkatiya-Kankanamage, E. H., García-Muñoz, S., Biegler, L. T. (2014). An optimization-based undeflated PLS (OUPLS) method to handle missing data in the training set. Journal of Chemometrics, 28(7), 575–584.
Russolillo, G. (2009). Partial least squares methods for non-metric data. Ph.D. thesis, Università degli Studi di Napoli Federico II.
Rutstein, S. O., & Johnson, K. (2004). The DHS wealth index. ORC Macro, Measure DHS.
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8(1), 3–15.
Sen, A. (1999). Development as freedom. Oxford: Oxford University Press.
Tenenhaus, M., & Young, F. W. (1985). An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika, 50(1), 91–119.
Transparency International. (2013). Corruption Perception Index. http://www.transparency.org/.
United Nations Development Programme. (1995). Human development report. New York: Oxford University Press.
Wold, H. (1966a). Estimation of principal components and related models by iterative least squares. In P. Krishnaiah (Ed.), Multiuariate analysis (pp. 391–420). New York: Academic Press.
Wold, H. (1966b). Nonlinear estimation by iterative least squares procedures. Research papers in statistics. New York: Wiley.
Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. Lecture Notes in Mathematics, 973, 286–293.
World Bank. (2008). World development indicators. http://data.worldbank.org/data-catalog/world-development-indicators.
World Bank. (2009). GenderStats. http://datatopics.worldbank.org/gender/.
Yoon, J., & Krivobokova, T. (2015). Treatments of non-metric variables in partial least squares and principal component analysis. In Courant Research Centre: Poverty, equity and growth−Discussion Papers, 172. http://www.uni-goettingen.de/en/67061.html.
Zwick, W. R., & Velicer, W. F. (1986). Comparison of five rules for determining the number of components to retain. Psychological Bulletin, 99(3), 432.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Sample Selection of the Regressands
Figure 2 shows the estimated density of the kept and dropped observations of the regressands using kernel density estimation. We observe that the dropped and kept observations of child mortality and the CPI have nearly identical distribution. On the other hand, we observe slight differences for female education and fertility. However, we suspected that the differences could be because of the randomness in the non-parametric estimation, which is expected to be high considering small number of the dropped observations (33, 27, 27 and 39). Therefore, we performed the Welch Two Sample t-test, which didn’t deny the null hypothesis that the dropped and the kept observations have the same mean. The p values are 0.43, 0.27, 0.82 and 0.64 for female education, fertility, child mortality and the CPI respectively.
Appendix 2: Weights and Coefficients from the Fertility and CPI Regressions
Appendix 3: Reference Years of the Variables
See Table 10.
Appendix 4: Sensitivity Analysis for the CPI Variable Selection
This section provides a sensitivity analysis in terms of the variables included in the CPI. Variables containing more than 60% of missing observations were dropped in the main analysis in Sect. 4. The threshold was selected based on a subjective judgement of the authors, which is not free from errors. Therefore, the effects of different threshold choices on the analysis are investigated, wherein variables containing more than 20, 40, 60 and 80% of missing observations were dropped.
Table 11 shows the results of the linear regressions of the CPIs with different thresholds on the SIGIs along with other covariates. More gender inequality associates with more corruption for all regressions. The PCRs find statistically significant association between gender inequality and corruption for all thresholds, but the PLSRs find only marginally significant associations for the thresholds 40 and 60%. Note that the coefficients, R-Squares and estimated MSEPs are not comparable across columns, since outcome variables include different variables and have different weights. Table 12 show the coefficients in terms of the variables in the SIGI. The coefficients of the PCRs are generally similar to each other, while the PLSRs show large differences across columns. Some PLSR coefficients change the signs with different thresholds, for example Son preference 4 with threshold 40 and 60%. Statistical significance do not change much across columns both for the PLSRs and the PCRs. Table 13 shows the weights. The PCA weights do not change across columns, while the PLS weights show large differences with respect to varying thresholds in analogy to the findings on the PLSR coefficients. The weights for the CPIs are reported in Table 14. Note that variables containing more than 80% of missing values are not reported in the table. The CPIs include less variables with stricter thresholds, while the number of variables range from 9 to 18. The weights from the main analysis (60% threshold) shows small differences to the weights from the other thresholds. However, variables with positive weights drop out more with stricter thresholds, so that the majority of the PCA weights with the 20 and 40% thresholds and the PLS weights with the 20% threshold are negative. Considering that CU1999 with positive weights dominate over other variables for these cases, the weights are not rotated. But the negative weighted variables work against the interpretation of the CPI that high value represents low corruption. This problem is not too serious with less strict thresholds, as shown in the main analysis.
In conclusion, using different thresholds do not result in major changes to the findings in Sect. 4. Higher gender inequality associates with more corruption, whereby the association is significant based on the PCRs with different thresholds, and marginally or not significantly based on the PLSRs. Nevertheless, the PLS weights, the PLSR coefficients in terms of the variables in the SIGI, the CPI weights and the included variables in the CPIs do change notably with different thresholds.
Rights and permissions
About this article
Cite this article
Yoon, J., Klasen, S. An Application of Partial Least Squares to the Construction of the Social Institutions and Gender Index (SIGI) and the Corruption Perception Index (CPI). Soc Indic Res 138, 61–88 (2018). https://doi.org/10.1007/s11205-017-1655-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11205-017-1655-8