Feature Ranking and Selection for Big Data Sets

Ordozgoiti, Bruno; Canaval, Sandra Gómez; Mozo, Alberto

doi:10.1007/978-3-319-44066-8_14

Bruno Ordozgoiti²⁰,
Sandra Gómez Canaval²⁰ &
Alberto Mozo²⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 637))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

521 Accesses

Abstract

The availability of big data sets has led to the successful application of machine learning and data mining to problems that were previously unsolved. The use of these techniques, though, is rarely straightforward. High dimensionality is often one of the main obstacles that must be overcome before learning an adequate model or drawing useful conclusions from large amounts of data. Rank revealing matrix factorizations can help in addressing this problem, by permuting the columns of the input data so that linearly dependent and thus redundant ones are moved to the right. These factorizations, however, are designed to operate in a centralized fashion, requiring the input data to be loaded into main memory, which makes them inapplicable to large data sets. In this paper we prove that data sets comprised of a huge number of rows can be easily transformed into a compact square matrix that preserves the permutation yielded by rank revealing QR factorizations. This leads to a simple algorithm for running these factorizations on big data sets regardless of their number of rows. The nature of the transformation makes it also possible to deal with high dimensional data with a controlled loss of precision. We offer experimental results showing that our method can provide improvements for the k-means algorithm, both in clustering results and in running time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Quintana-Ortí, G., Sun, X., Bischof, C.H.: A blas-3 version of the qr factorization with column pivoting. SIAM J. Sci. Comput. 19(5), 1486–1494 (1998)
Article MATH MathSciNet Google Scholar
Chan, T.F.: Rank revealing qr factorizations. Linear Algebra Appl. 88, 67–82 (1987)
MATH MathSciNet Google Scholar
Boutsidis, C., Mahoney, M.W., Drineas, P.: Unsupervised feature selection for principal components analysis. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 61–69. ACM (2008)
Google Scholar
Farahat, A.K., Ghodsi, A., Kamel, M.S.: An efficient greedy method for unsupervised feature selection. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 161–170. IEEE (2011)
Google Scholar
Farahat, A.K., Elgohary, A., Ghodsi, A., Kamel, M.S.: Distributed column subset selection on mapreduce. In: 2013 IEEE 13th International Conference on Data Mining (ICDM), pp. 171–180. IEEE (2013)
Google Scholar
Pi, Y., Peng, H., Zhou, S., Zhang, Z.: A scalable approach to column-based low-rank matrix approximation. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 1600–1606. AAAI Press (2013)
Google Scholar
Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262. IEEE (2014)
Google Scholar
Reggiani, C., Le Borgne, Y.-A., Pozzolo, A.D., Olsen, C., Bontempi, G.: Minimum redundancy maximum relevance: Mapreduce implementation using apache hadoop. In: BENELEARN 2014, p. 2 (2014)
Google Scholar
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large scale feature selection for logistic regression. In: SDM, pp. 1172–1183. SIAM (2009)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Distributed feature selection: An application to microarray data classification. Appl. Soft Comput. 30, 136–150 (2015)
Article Google Scholar
Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)
Article MATH MathSciNet Google Scholar
He, Q., Cheng, X., Zhuang, F., Shi, Z.: Parallel feature selection using positive approximation based on mapreduce. In: 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 397–402. IEEE (2014)
Google Scholar
Ordozgoiti, B., Gómez Canaval, S., Mozo, A.: Massively parallel unsupervised feature selection on spark. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds.) ADBIS 2015. CCIS, vol. 539, pp. 186–196. Springer, Heidelberg (2015)
Chapter Google Scholar
Ordozgoiti, B., Canaval, S.G., Mozo, A.: Parallelized unsupervised feature selection for large-scale network traffic analysis. Proc. ESANN 2016, 617–622 (2016)
Google Scholar

Download references

Acknowledgements

The research leading to these results has received funding from the European Union under the FP7 grant agreement n. 619633 (project ONTIC) and H2020 grant agreement n. 671625 (project CogNet).

Author information

Authors and Affiliations

Department of Computer Systems, University College of Computer Science, Universidad Politécnica de Madrid, Crta. de Valencia km. 7, 28031, Madrid, Spain
Bruno Ordozgoiti, Sandra Gómez Canaval & Alberto Mozo

Authors

Bruno Ordozgoiti
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Gómez Canaval
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Mozo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bruno Ordozgoiti .

Editor information

Editors and Affiliations

Faculty of Sciences, University of Novi Sad Faculty of Sciences, Novi Sad, Serbia
Mirjana Ivanović
Christian-Albrechts-Universität Kiel, Kiel, Germany
Bernhard Thalheim
University of Genoa, Genoa, Italy
Barbara Catania
Software Competence Cent. Hagenberg GmbH, Hagenberg, Austria
Klaus-Dieter Schewe
Riga Technical University, Riga, Latvia
Mārīte Kirikova
VSB-Technical University Ostrava, Ostrava, Czech Republic
Petr Šaloun
Georgia College and State University, Milledgeville, Georgia, USA
Ajantha Dahanayake
Politecnico di Torino, Torino, Italy
Tania Cerquitelli
Politecnico di Torino , Torino, Italy
Elena Baralis
EURECOM, Biot Sophia Antipolis cedex, France
Pietro Michiardi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ordozgoiti, B., Canaval, S.G., Mozo, A. (2016). Feature Ranking and Selection for Big Data Sets. In: Ivanović, M., et al. New Trends in Databases and Information Systems. ADBIS 2016. Communications in Computer and Information Science, vol 637. Springer, Cham. https://doi.org/10.1007/978-3-319-44066-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-44066-8_14
Published: 14 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44065-1
Online ISBN: 978-3-319-44066-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics