Skip to main content
Log in

Abstract

Dimensionality reduction is often a crucial step for the successful application of machine learning and data mining methods. One way to achieve said reduction is feature selection. Due to the impossibility of labelling many data sets, unsupervised approaches are frequently the only option. The column subset selection problem translates naturally to this purpose and has received considerable attention over the last few years, as it provides simple linear models for low-rank data reconstruction. Recently, it was empirically shown that an iterative algorithm, which can be implemented efficiently, provides better subsets than other state-of-the-art methods. In this paper, we describe this algorithm and provide a more in-depth analysis. We carry out numerous experiments to gain insights on its behaviour and derive a simple bound for the norm recovered by the resulting matrix. To the best of our knowledge, this is the first theoretical result of this kind for this algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. http://www.afarahat.com/code.

  2. https://github.com/brunez/IterFS.

  3. http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.

  4. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

  5. http://www.sheffield.ac.uk/eee/research/iel/research/face.

  6. http://www.cs.nyu.edu/~roweis/data.html.

  7. http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html.

  8. www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/PublicDatasets.

  9. https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity.

  10. https://archive.ics.uci.edu/ml/datasets/BlogFeedback.

  11. https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD.

  12. http://vision.ucsd.edu/content/yale-face-database.

References

  1. Altschuler J, Bhaskara A, Fu G, Mirrokni V, Rostamizadeh A, Zadimoghaddam M (2016) Greedy column subset selection: new bounds and distributed algorithms. In: International conference on machine learning, pp 2539–2548

  2. Arai H, Maung C, Schweitzer H (2015) Optimal column subset selection by a-star search. In: Twenty-ninth AAAI conference on artificial intelligence

  3. Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P (2011) The million song dataset. In: Proceedings of the 12th international conference on music information retrieval (ISMIR 2011)

  4. Boutsidis C, Drineas P, Magdon-Ismail M (2014) Near-optimal column-based matrix reconstruction. SIAM J Comput 43(2):687–717

    Article  MathSciNet  MATH  Google Scholar 

  5. Boutsidis C, Mahoney MW, Drineas P (2008) Unsupervised feature selection for principal components analysis. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 61–69

  6. Boutsidis C, Mahoney MW, Drineas P (2009) An improved approximation algorithm for the column subset selection problem. In: Proceedings of the 20th annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics, pp 968–977

  7. Businger P, Golub GH (1965) Linear least squares solutions by householder transformations. Numer Math 7(3):269–276

    Article  MathSciNet  MATH  Google Scholar 

  8. Buza K (2014) Feedback Prediction for Blogs. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Data analysis, machine learning and knowledge discovery. Studies in classification, Data analysis, and knowledge organization, Springer, Cham, pp 145–152

  9. Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 333–342

  10. Chan TF (1987) Rank revealing QR factorizations. Linear Algebra Appl 88:67–82

    MathSciNet  MATH  Google Scholar 

  11. Chan TF, Hansen PC (1992) Some applications of the rank revealing QR factorization. SIAM J Sci Stat Comput 13(3):727–741

    Article  MathSciNet  MATH  Google Scholar 

  12. Civril A, Magdon-Ismail M (2012) Column subset selection via sparse approximation of SVD. Theor Comput Sci 421:1–14

    Article  MathSciNet  MATH  Google Scholar 

  13. Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889

    MathSciNet  MATH  Google Scholar 

  14. Farahat AK, Elgohary A, Ghodsi A, Kamel MS (2013) Distributed column subset selection on mapreduce. In: Data mining (ICDM), 2013 IEEE 13th international conference on, IEEE, pp 171–180

  15. Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: Data mining (ICDM), 2011 IEEE 11th international conference on, IEEE, pp 161–170

  16. Fernandes K, Vinagre P, Cortez P (2015) A proactive intelligent decision support system for predicting the popularity of online news. In: Pereira F, Machado P, Costa E, Cardoso A (eds) Progress in artificial intelligence, EPIA, vol 9273. Lecture Notes in Computer Science. Springer, Cham, pp 535–546

    Google Scholar 

  17. Foster LV (1986) Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl 74:47–71

    Article  MathSciNet  MATH  Google Scholar 

  18. Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660

    Article  Google Scholar 

  19. Golub G (1965) Numerical methods for solving linear least squares problems. Numer Math 7(3):206–216

    Article  MathSciNet  MATH  Google Scholar 

  20. Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numer Math 14(5):403–420

    Article  MathSciNet  MATH  Google Scholar 

  21. Golub GH, Van Loan CF (2012) Matrix computations, vol 3. JHU Press, Baltimore, p 290

    Google Scholar 

  22. Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing qr factorization. SIAM J Sci Comput 17(4):848–869

    Article  MathSciNet  MATH  Google Scholar 

  23. Guruswami V, Sinop AK (2012) Optimal column-based low-rank matrix reconstruction. In: Proceedings of the twenty-third annual ACM-SIAM symposium on discrete algorithms, SIAM, pp 1207–1214

  24. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  25. He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: Advances in neural information processing systems, pp 507–514

  26. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  27. Jolliffe I (2002) Principal component analysis. Wiley Online Library, New York

    MATH  Google Scholar 

  28. Lee K-C, Ho J, Kriegman DJ (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698

    Article  Google Scholar 

  29. Lichman M (2013) UCI Machine Learning Repository. Irvine, CA. http://archive.ics.uci.edu/ml. Accessed 24 Oct 2017

  30. Mahoney MW, Drineas P (2009) Cur matrix decompositions for improved data analysis. Proc Natl Acad Sci 106(3):697–702

    Article  MathSciNet  MATH  Google Scholar 

  31. Meyer CD Jr (1973) Generalized inversion of modified matrices. SIAM J Appl Math 24(3):315–323

    Article  MathSciNet  MATH  Google Scholar 

  32. Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312

    Article  Google Scholar 

  33. Nene SA, Nayar SK, Murase H (1996) Columbia object image library (coil-20). Technical Report CUCS-005-96, Columbia University

  34. Ordozgoiti B, Canaval SG, Mozo A (2016) A fast iterative algorithm for improved unsupervised feature selection. In: Data mining (ICDM), 2016 IEEE 16th international conference on, IEEE, pp 390–399

  35. Papailiopoulos D, Kyrillidis A, Boutsidis C (2014) Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD, ACM, pp 997–1006

  36. Paul S, Magdon-Ismail M, Drineas P (2015) Column selection via adaptive sampling. In: Advances in neural information processing systems, pp 406–414

  37. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  38. Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15(11):1119–1125

    Article  Google Scholar 

  39. Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of the second IEEE workshop on applications of computer vision, 1994, IEEE, pp 138–142

  40. Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

    MathSciNet  MATH  Google Scholar 

  41. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, USENIX Association, pp 2–2

  42. Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp 1151–1157

  43. Zhu P, Zuo W, Zhang L, Hu Q, Shiu SC (2015) Unsupervised feature selection by regularized self-representation. Pattern Recognit 48(2):438–446

    Article  MATH  Google Scholar 

Download references

Acknowledgements

We would like to thank José Ramón Sánchez Couso for the valuable discussions he agreed to hold on the theoretical analysis. The research leading to these results has received funding from the European Union under the FP7 Grant Agreement No. 619633 (project ONTIC) and H2020 Grant Agreement No. 671625 (project CogNet).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruno Ordozgoiti.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ordozgoiti, B., Canaval, S.G. & Mozo, A. Iterative column subset selection. Knowl Inf Syst 54, 65–94 (2018). https://doi.org/10.1007/s10115-017-1115-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1115-4

Keywords

Navigation