Abstract
We propose sparse variants of correspondence analysis (CA) for large contingency tables like documents-terms matrices used in text mining. By seeking to obtain many zero coefficients, sparse CA remedies to the difficulty of interpreting CA results when the size of the table is large. Since CA is a double weighted PCA (for rows and columns) or a weighted generalized SVD, we adapt known sparse versions of these methods with specific developments to obtain orthogonal solutions and to tune the sparseness parameters. We distinguish two cases depending on whether sparseness is asked for both rows and columns, or only for one set.
Similar content being viewed by others
Notes
There are 45 presidents, but the speech data of presidents William Henry Harrison and James Garfield are missing.
References
Abdi H, Béra M (2014) Correspondence Analysis. Encyclopedia of Social Network Analysis and Mining. Springer, New York, New York, NY, pp 275–284
Adachi K, Trendafilov NT (2016) Sparse principal component analysis subject to prespecified cardinality of loadings. Computational Statistics 31(4):1403–1427
Bécue-Bertaut M (2019) Textual data science with R. CRC Press
Beh EJ, Lombardo R (2014) Correspondence analysis: Theory, practice and new strategies. John Wiley & Sons
Bernard A, Guinot C, Saporta G (2012) Sparse principal component analysis for multiblock data and its extension to sparse multiple correspondence analysis. In: Colubi A et al (eds) Proceedings of the 20th international conference on computational statistics (COMPSTAT 2012). International Association for Statistical Computing, pp 99–106
D’Ambra L, Lauro NC (1992) Non symmetrical exploratory data analysis. Statistica Applicata 4(4):511–529
Govaert G, Nadif M (2013) Co-clustering: models, algorithms and applications. John Wiley & Sons
Greenacre MJ (2010) Correspondence analysis. Wiley Interdisciplinary Reviews: Computational Statistics 2(5):613–619
Guerra-Urzola R, Van Deun K, Vera JC, Sijtsma K (2021) A Guide for Sparse PCA: Model Comparison and Applications. Psychometrika 86(4):893–919
Guillemot V, Beaton D, Gloaguen A, Löfstedt T, Levine B, Raymond N, Tenenhaus A, Abdi H (2019) A constrained singular value decomposition method that integrates sparsity and orthogonality. PloS one 14(3):e0211463
Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics 12(3):531–547
Lebart L, Pincemin B, Poudat C (2019) Analyse des données textuelles. Presses de l’Université du Québec
Lebart L, Salem A, Berry L (1997) Exploring textual data. Springer Science & Business Media
Lebart L, Saporta G (2014) Historical elements of correspondence analysis and multiple correspondence analysis. In: Blasius J, Greenacre MJ (eds) Visualization and Verbalization of Data. Chapman and Hall, London, pp 31–44
Mackey L (2009) Deflation Methods for Sparse PCA. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in Neural Information Processing Systems, vol 21. Curran Associates Inc, pp 1017–1024
Mori Y, Kuroda M, Makino N (2016) Sparse Multiple Correspondence Analysis. In: Mori Y, Kuroda M, Makino N (eds) Nonlinear Principal Component Analysis and Its Applications. Springer-Verlag, pp 47–56
Ning-min S, Jing L (2015) A literature survey on high-dimensional sparse principal component analysis. International Journal of Database Theory and Application 8(6):57–74
Savoy J (2015) Text clustering: An application with the State of the Union addresses. Journal of the Association for Information Science and Technology 66(8):1645–1654
Shen D, Shen H, Marron JS (2013) Consistency of sparse PCA in high dimension, low sample size contexts. Journal of Multivariate Analysis 115:317–333
Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99(6):1015–1034
Trendafilov NT (2014) From simple structure to sparse components: a review. Computational Statistics 29(3):431–454
Trendafilov NT, Fontanella S, Adachi K (2017) Sparse exploratory factor analysis. Psychometrika 82(3):778–794
Wilms I, Croux C (2015) Sparse canonical correlation analysis from a predictive point of view. Biometrical Journal 57(5):834–851
Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3):515–534
Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. Journal of Computational and Graphical Statistics 15(2):265–286
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: The 45 presidents of the United States
Appendix: The 45 presidents of the United States
1. George Washington | 16. Abraham Lincoln | 31. Herbert Hoover |
2. John Adams | 17. Andrew Johnson | 32. Franklin D. Roosevelt |
3. Thomas Jefferson | 18. Ulysses S. Grant | 33. Harry S. Truman |
4. James Madison | 19. Rutherford B. Hayes | 34. Dwight D. Eisenhower |
5. James Monroe | 20. James A. Garfield | 35. John F. Kennedy |
6. John Quincy Adams | 21. Chester A. Arthur | 36. Lyndon B. Johnson |
7. Andrew Jackson | 22. Grover Cleveland | 37. Richard Nixon |
8. Martin Van Buren | 23. Benjamin Harrison | 38. Gerald R. Ford |
9. William H. Harrison | 24. Grover Cleveland | 39. Jimmy Carter |
10. John Tyler | 25. William McKinley | 40. Ronald Reagan |
11. James Knox Polk | 26. Theodore Roosevelt | 41. George H.W. Bush |
12. Zachary Taylor | 27. William H. Taft | 42. William J. Clinton |
13. Millard Fillmore | 28. Woodrow Wilson | 43. George W. Bush |
14. Franklin Pierce | 29. Warren Harding | 44. Barack Obama |
15. James Buchanan | 30. Calvin Coolidge | 45. Donald Trump |
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, R., Niang, N., Saporta, G. et al. Sparse correspondence analysis for large contingency tables. Adv Data Anal Classif 17, 1037–1056 (2023). https://doi.org/10.1007/s11634-022-00531-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-022-00531-5
Keywords
- Contingency tables
- High-dimensional data
- Correspondence analysis
- Sparsity
- Textual data
- Penalized matrix decomposition