Greedy column subset selection for large-scale data sets

Farahat, Ahmed K.; Elgohary, Ahmed; Ghodsi, Ali; Kamel, Mohamed S.

doi:10.1007/s10115-014-0801-8

Greedy column subset selection for large-scale data sets

Regular Paper
Published: 27 December 2014

Volume 45, pages 1–34, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ahmed K. Farahat¹,
Ahmed Elgohary²,
Ali Ghodsi³ &
…
Mohamed S. Kamel¹

945 Accesses
30 Citations
Explore all metrics

Abstract

In today’s information systems, the availability of massive amounts of data necessitates the development of fast and accurate algorithms to summarize these data and represent them in a succinct format. One crucial problem in big data analytics is the selection of representative instances from large and massively distributed data, which is formally known as the Column Subset Selection problem. The solution to this problem enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. This paper presents a fast and accurate greedy algorithm for large-scale column subset selection. The algorithm minimizes an objective function, which measures the reconstruction error of the data matrix based on the subset of selected columns. The paper first presents a centralized greedy algorithm for column subset selection, which depends on a novel recursive formula for calculating the reconstruction error of the data matrix. The paper then presents a MapReduce algorithm, which selects a few representative columns from a matrix whose columns are massively distributed across several commodity machines. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

\(\Vert A \Vert _{F}^{2} = trace\left( A^TA \right) \).
The in-memory summation can also be replaced by a MapReduce combiner, Dean2008.
The data sets Reuters-21578, MNIST-4K, PIE-20 and YaleB-38 are available in MAT format at: http://www.cad.zju.edu.cn/home/dengcai/Data/data.html. PIE-20 is a subset of PIE-32x32 with the images of the first 20 persons.
http://trec.nist.gov.
http://yann.lecun.com/exdb/mnist.
The CSS algorithm of Boutsidis et al. [3] was not included in the comparison as its implementation is not available.
Revision: 5.13.4.7.
In the implemented code, the efficient recursive formulas in Section 4 of [31] are used to implement the update of QR decomposition and the swapping criterion.
In [4] (a newer version of [6]), Boutsidis et al. suggested the use of the SRRQR algorithm [31, Algorithm 4] for the deterministic phase. Although the SRRQR algorithm achieves the theoretical guarantee presented in [6], the MATLAB \(qr\) function is used in the conducted experiments as it is much faster and it achieves comparable accuracy for the experimented data sets.
For the MNIST4K data set, the range of \(l/n\) values is smaller since the rank of the matrix is very low (i.e., less than the number of pixels).
The qr and SRRQR methods both depend on the MATLAB qr function. For the document data sets, the MATLAB qr function takes very long times compared with other methods and accordingly they are not reported in the shown figures.
Fig. 1
The relative accuracy measures and run times of different column-based low-rank approximations \(\tilde{A}_{\mathcal {S}}\) for the Reuters-21578, Reviews and LA1 data sets
Full size image

Fig. 2
The relative accuracy measures and run times of different column-based low-rank approximations \(\tilde{A}_{\mathcal {S}}\) for the MNIST-4K, PIE-20 and YaleB-38 data sets
Full size image
The MATLAB functions anova1 and multcompare were used.
Table 2 The relative accuracy of the best performing CSS methods for the Reuters-21578, Reviews and LA1 data sets
Full size table

Table 3 The relative accuracy of the best performing CSS methods for the MNIST-4K, PIE-20 and YaleB-38 data sets
Full size table
Amazon Elastic Compute Cloud (EC2): http://aws.amazon.com/ec2.
Mahout is an Apache project for implementing Machine Learning algorithms on Hadoop. See http://mahout.apache.org/.

References

Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687
Article MATH MathSciNet Google Scholar
Bischof C, Quintana-Ortí G (1998) Computing rank-revealing QR factorizations of dense matrices. ACM Trans Math Softw 24(2):226–253
Article MATH Google Scholar
Boutsidis C, Drineas P, Magdon-Ismail M (2011) Near optimal column-based matrix reconstruction. In: Proceedings of the 52nd annual IEEE symposium on foundations of computer science (FOCS’11), pp 305–314
Boutsidis C, Mahoney MW, Drineas P (2008a) An improved approximation algorithm for the column subset selection problem, CoRR abs/0812.4293
Boutsidis C, Mahoney MW Drineas P (2008b) Unsupervised feature selection for principal components analysis. In Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08). ACM, New York, pp 61–69
Boutsidis C, Mahoney MW, Drineas P (2009) An improved approximation algorithm for the column subset selection problem. In: Proceedings of the 20th annual ACM-SIAM symposium on discrete algorithms (SODA’09), pp 968–977
Boutsidis C, Sun J, Anerousis N (2008) Clustered subset selection and its applications on it service metrics. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM’08), pp 599–608
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10). ACM, New York, NY, pp 333–342
Çivril A, Magdon-Ismail M (2008) Deterministic sparse column based matrix reconstruction via greedy approximation of SVD. In: Proceedings of the 19th international symposium on algorithms and computation (ISAAC’08). Springer, New York, pp 414–423
Çivril A, Magdon-Ismail M (2012) Column subset selection via sparse approximation of SVD. Theoret Comput Sci 421:1–14
Article MATH MathSciNet Google Scholar
Chan T (1987) Rank revealing QR factorizations. Linear Algebra Appl 88:67–82
MathSciNet Google Scholar
Chen W-Y, Song Y, Bai H, Lin C-J, Chang E (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586
Article Google Scholar
Cui Y, Dy J (2008) Orthogonal principal feature selection, the sparse optimization and variable selection workshop at the international conference on machine learning (ICML)
Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithms 22(1):60–65
Article MATH MathSciNet Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci Technol 41(6):391–407
Article Google Scholar
Deshpande A, Rademacher L (2010) Efficient volume sampling for row/column subset selection. In: Proceedings of the 51st annual IEEE symposium on foundations of computer science (FOCS’10), pp 329–338
Deshpande A, Rademacher L, Vempala S, Wang G (2006a) Matrix approximation and projective clustering via volume sampling. Theory Comput 2(1):225–247
Article MathSciNet Google Scholar
Deshpande A, Rademacher L, Vempala S, Wang G (2006b) Matrix approximation and projective clustering via volume sampling. In: Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms (SODA’06). ACM, New York, NY, pp 1117–1126
Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1–3):9–33
Article MATH Google Scholar
Drineas P, Kannan R, Mahoney M (2007) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158–183
Article MathSciNet Google Scholar
Drineas P, Mahoney M, Muthukrishnan S (2006) Subspace sampling and relative-error matrix approximation: column-based methods. Approximation, randomization, and combinatorial optimization. Algorithms and techniques. Springer, Berlin, pp 316–326
Elgohary A, Farahat AK, Kamel MS, Karray F (2013) Embed and conquer: scalable embeddings for kernel k-means on mapreduce, CoRR abs/1311.2334
Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short Papers (HLT’08), pp 265–268
Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 681–689
Farahat A, Elgohary A, Ghodsi A, Kamel M (2013) Distributed column subset selection on MapReduce. In: Proceedings of the 13th IEEE international conference on data mining (ICDM’13), pp 171–180
Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’11), pp 161–170
Farahat AK, Ghodsi A, Kamel MS (2013) Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst 35(2):285–310
Article Google Scholar
Frieze A, Kannan R, Vempala S (1998) Fast Monte-Carlo algorithms for finding low-rank approximations. In: Proceedings of the 39th annual IEEE symposium on foundations of computer science (FOCS’98), pp 370–378
Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
MATH Google Scholar
Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comput 17(4):848–869
Article MATH MathSciNet Google Scholar
Guruswami V, Sinop AK (2012) Optimal column-based low-rank matrix reconstruction. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’12), pp 1207–1214
Halko N, Martinsson P-G, Shkolnisky Y, Tygert M (2011) An algorithm for the principal component analysis of large data sets. SIAM J Sci Comput 33(5):2580–2594
Article MATH MathSciNet Google Scholar
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection, advances in neural information processing systems 18 (NIPS’05). MIT Press, Cambridge, MA
Google Scholar
He X, Yan S, Hu Y, Niyogi P, Zhang H (2005) Face recognition using Laplacianfaces. IEEE Trans Pattern Anal Mach Intell 27(3):328–340
Article Google Scholar
Hogg RV, Ledolter J (1987) Engineering statistics, vol 358. MacMillan, New York
Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River, NJ
MATH Google Scholar
Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, New York
MATH Google Scholar
Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2008) Hadi: fast diameter estimation and mining in massive graphs with hadoop, CMU-ML-08-117
Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA’10), pp 938–948
Karypis G (2003) CLUTO—a clustering toolkit, rechnical report #02-017. University of Minnesota, Department of Computer Science
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. Department of Mathematics and Informatics,Technical report, Technische Hogeschool, Delft (Netherlands)
Lee K, Ho J, Kriegman D (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698
Article Google Scholar
Lewis D (1999) Reuters-21578 text categorization test collection distribution 1.0
Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Google Scholar
Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 287–296
Lu Y, Cohen I, Zhou X, Tian Q (2007) Feature selection using principal feature analysis. In: Proceedings of the 15th international conference on multimedia. ACM, New York, NY, pp 301–304
Lütkepohl H (1996) Handbook of matrices. Wiley, New Jersey
MATH Google Scholar
Masaeli M, Yan Y, Cui Y, Fung, G, Dy J (2010) Convex principal feature selection. In: Proceedings of SIAM international conference on data mining (SDM), pp 619–628
Meng X, Mahoney M (2013) Robust regression on mapreduce. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 888–896
Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Article Google Scholar
Pan C (2000) On the existence and computation of rank-revealing LU factorizations. Linear Algebra Appl 316(1):199–222
Article MATH MathSciNet Google Scholar
Sim T, Baker S, Bsat M (2003) The CMU pose, illumination, and expression database. IEEE Trans Pattern Anal Mach Intell 25(12):1615–1618
Article Google Scholar
Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. Proceedings of the SIAM international conference on data mining, pp 1171–1182
Torralba A, Fergus R, Freeman W (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970
Article Google Scholar
White T (2009) Hadoop: the definitive guide, 1st edn. O’Reilly Media Inc, Sebastopol
Google Scholar
Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887
MATH MathSciNet Google Scholar
Xiang J, Guo C, Aboulnaga A (2013) Scalable maximum clique computation using mapreduce. IEEE 29th international conference on data engineering (ICDE), 2013 , pp 74–85
Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning (ICML’07). ACM, New York, NY, pp 1151–1157

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Ahmed K. Farahat & Mohamed S. Kamel
Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
Ahmed Elgohary
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Ali Ghodsi

Authors

Ahmed K. Farahat
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Elgohary
View author publications
You can also search for this author in PubMed Google Scholar
Ali Ghodsi
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed S. Kamel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed K. Farahat.

Additional information

A preliminary version of this paper appeared as [26].

This work was completed while the second author was at the University of Waterloo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farahat, A.K., Elgohary, A., Ghodsi, A. et al. Greedy column subset selection for large-scale data sets. Knowl Inf Syst 45, 1–34 (2015). https://doi.org/10.1007/s10115-014-0801-8

Download citation

Received: 10 December 2013
Revised: 20 September 2014
Accepted: 29 October 2014
Published: 27 December 2014
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10115-014-0801-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Greedy column subset selection for large-scale data sets

Abstract

Access this article

Similar content being viewed by others

Optimal column subset selection for image classification by genetic algorithms

Iterative column subset selection

Feature Ranking and Selection for Big Data Sets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Greedy column subset selection for large-scale data sets

Abstract

Access this article

Similar content being viewed by others

Optimal column subset selection for image classification by genetic algorithms

Iterative column subset selection

Feature Ranking and Selection for Big Data Sets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation