Abstract
Principal component analysis is one of the most popular machine learning and data mining techniques. Having its origins in statistics, principal component analysis is used in numerous applications. However, there seems to be not much systematic testing and assessment of principal component analysis for cases with erroneous and incomplete data. The purpose of this article is to propose multiple robust approaches for carrying out principal component analysis and, especially, to estimate the relative importances of the principal components to explain the data variability. Computational experiments are first focused on carefully designed simulated tests where the ground truth is known and can be used to assess the accuracy of the results of the different methods. In addition, a practical application and evaluation of the methods for an educational data set is given.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available at http://www.oecd.org/pisa/pisaproducts/.
References
Alpaydin, E.: Introduction to Machine Learning, 2nd edn. The MIT Press, Cambridge, MA, USA (2010)
Äyrämö, S.: Knowledge Mining Using Robust Clustering: volume 63 of Jyväskylä Studies in Computing. University of Jyväskylä, Jyväskylä (2006)
Bednar, J., Watt, T.: Alpha-trimmed means and their relationship to median filters. IEEE Trans. Acoust. Speech Sig. Process. 32(1), 145–153 (1984)
Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)
Croux, C., Ollila, E., Oja, H.: Sign and rank covariance matrices: statistical properties and application to principal components analysis. In: Dodge, Y. (ed.) Statistical data analysis based on the L1-norm and related methods, pp. 257–269. Springer, Basel (2002)
d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9, 1269–1294 (2008)
Gervini, D.: Robust functional estimation using the median and spherical principal components. Biometrika 95(3), 587–600 (2008)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore, MD, USA (1996)
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust statistics: the approach based on influence functions, vol. 114. Wiley, New York (2011)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2011)
Hettmansperger, T.P., McKean, J.W.: Robust Nonparametric Statistical Methods. Edward Arnold, London (1998)
Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24(6), 417 (1933)
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 11, 1957–2000 (2010)
Jolliffe, I.: Principal Component Analysis. Wiley Online Library, New York (2005)
Kärkkäinen, T., Heikkola, E.: Robust formulations for training multilayer perceptrons. Neural Comput. 16, 837–862 (2004)
Kärkkäinen, T., Toivanen, J.: Building blocks for odd-even multigrid with applications to reduced systems. J. Comput. Appl. Math. 131, 15–33 (2001)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 4. Wiley, New York (1987)
Locantore, N., Marron, J.S., Simpson, D.G., Tripoli, N., Zhang, J.T., Cohen, K.L., Boente, G., Fraiman, R., Brumback, B., Croux, C., et al.: Robust principal component analysis for functional data. Test 8(1), 1–73 (1999)
Milasevic, P., Ducharme, G.R.: Uniqueness of the spatial median. Ann. Stat. 15(3), 1332–1333 (1987)
OECD: PISA Data Analysis Manual: SPSS and SAS, 2nd edn. OECD Publishing, Paris (2009)
OECD: PISA: Results: Ready to Learn - Students’ Engagement, Drive and Self-Beliefs. OECD Publishing, Paris (2013)
Ringberg, H., Soule, A., Rexford, J., Diot, C.: Sensitivity of PCA for traffic anomaly detection. In: ACM SIGMETRICS Performance Evaluation Review, vol. 35, pp. 109–120. ACM (2007)
Saarela, M., Kärkkäinen,T.: Discovering gender-specific knowledge from Finnish basic education using PISA scale indices. In: Proceedings of the 7th International Conference on Educational Data Mining, pp. 60–68 (2014)
Saarela, M., Kärkkäinen, T.: Analysing student performance using sparse data of core bachelor courses. JEDM-J. Educ. Data Min. 7(1), 3–32 (2015)
Stigler, S.M.: Do robust estimators work with real data? Ann. Stat. 5, 1055–1098 (1977)
Van Ginkel, J.R.. Kroonenberg, P.M., Kiers, H.A.: Missing data in principal component analysis of questionnaire data: a comparison of methods. J. Stat. Comput. Simul. 1–18 (2013) (ahead-of-print)
Visuri, S., Koivunen, V., Oja, H.: Sign and rank covariance matrices. J. Stat. Plann. Infer. 91(2), 557–575 (2000)
Acknowledgments
The authors would like to thank Professor Tuomo Rossi for many helpful discussions on the contents of the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kärkkäinen, T., Saarela, M. (2015). Robust Principal Component Analysis of Data with Missing Values. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-21024-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)