A Scalable Feature Selection Algorithm for Large Datasets – Quick Branch & Bound Iterative (QBB-I)

Nedungadi, Prema; Remya, M. S.

doi:10.1007/978-3-319-07353-8_15

A Scalable Feature Selection Algorithm for Large Datasets – Quick Branch & Bound Iterative (QBB-I)

Prema Nedungadi⁷ &
M. S. Remya⁸

Conference paper

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 27))

Abstract

Feature selection algorithms look to effectively and efficiently find an optimal subset of relevant features in the data. As the number of features and the data size increases, new methods of reducing the complexity while maintaining the goodness of the features selected are needed. We review popular feature selection algorithms such as the probabilistic search algorithm based Las Vegas Filter (LVF) and the complete search based Automatic Branch and Bound (ABB) that use the consistency measure. The hybrid Quick Branch and Bound (QBB) algorithm first runs LVF to find a smaller subset of valid features and then performs ABB with the reduced feature set. QBB is reasonably fast, robust and handles features which are interdependent, but does not work well with large data. In this paper, we propose an enhanced QBB algorithm called QBB Iterative (QBB-I).QBB-I partitions the dataset into two, and performs QBB on the first partition to find a possible feature subset. This feature subset is tested with the second partition using the consistency measure, and the inconsistent rows, if any, are added to the first partition and the process is repeated until we find the optimal feature set. Our tests with ASSISTments intelligent tutoring dataset using over 150,000 log data and other standard datasets show that QBB-I is significantly more efficient than QBB while selecting the same subset of features.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Almuallim, H., Dietterich, T.G.: Learning with many irrelevant features. In: Proceedings of the 9th National Conference on Artificial Intelligence (1991)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International Group, Belmont (1984)
MATH Google Scholar
Cestnik, G., Konenenko, I., Bratko, I.: Assistant-86: A Knowledge- Elicitation Tool for Sophisticated Users. In: Progress in Machine Learning, pp. 31–45. Sigma Press (1987)
Google Scholar
Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Batabases. University of California, Department of Information and Computer Science, Irvine (1996), http://www.ics.uci.edu/mlearn/MLRepository.html
Google Scholar
Liu, H., Motoda, H., Dash, M.: A monotonic measure for Optimal Feature Selection. In: Proceedings of the European Conference on Machine Learning, pp. 101–106 (1998)
Google Scholar
Liu, H., Setiono, R.: A probabilistic approach to feature selection: a Filter solution. In: Proceedings of the 13th International Conference on Machine Learning, pp. 319–327 (1996)
Google Scholar
Liu, H., Setiono, R.: Scalable feature selection for large sized databases. In: Proceedings of the 4th World Congress on Expert System, p. 6875 (1998)
Google Scholar
Hong, Z.Q., Yang, J.Y.: Optimal Discriminant Plane for a Small Number of Samples and Design Method of Classifier on the Plane. Pattern Recognition 24, 317–324 (1991)
Article MathSciNet Google Scholar
Yu, H.-F., Lo, H.-Y., Hsieh, H.-P.: Feature Engineering and Classifier Ensemble for KDD Cup 2010. In: JMLR: Workshop and Conference Proceedings, vol. 1, pp. 1–16 (2010)
Google Scholar
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference in Machine Learning (1994)
Google Scholar
Kira, K., Rendell, L.: A practical approach to feature selection. In: Proceedings of the 9th International Conference on Machine Learning, pp. 249–256 (1992)
Google Scholar
Koedinger, K., Baker, R., Cunningham, K., Skogsholm, A., Leber, B., Stamper, J.: A data repository for the EDM community: the pslc datashop (2010)
Google Scholar
Kohavi, K.: Wrappers for performance enhancement and oblivious decision graphs. Phd thesis, Stanford university (1995)
Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(12), 273–324 (1996)
Google Scholar
Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1(1-4), 131–156 (1997)
Article Google Scholar
Dash, M., Liu, H.: Hybrid search of feature subsets. In: Lee, H.-Y., Motoda, H. (eds.) PRICAI 1998. LNCS, vol. 1531, pp. 238–249. Springer, Heidelberg (1998)
Chapter Google Scholar
Dash, M., Liu, H., Motoda, H.: Feature Selection Using Consistency Measure. In: Arikawa, S., Nakata, I. (eds.) DS 1999. LNCS (LNAI), vol. 1721, pp. 319–320. Springer, Heidelberg (1999)
Chapter Google Scholar
Kudo, M., Sklansky, J.: A Comparative Evaluation of medium and largescale Feature Selectors for Pattern Classifiers. In: Proceedings of the 1st International Workshop on Statistical Techniques in Pattern Recognition, pp. 91–96 (1997)
Google Scholar
Molina, L.P., Belanche, L., Nebot, A.: Feature selection algorithms: a survey and experimental evaluation. Universitat Politcnica de catalunya. departament de llenguatges i sistemes informtics (2002)
Google Scholar
Noordewier, M.O., Towell, G.G., Shavlik, J.W.: Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences. In: Advances in Neural Information Processing Systems, vol. 3 (1991)
Google Scholar
Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Schlimmer, J.S.: Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine (1987)
Google Scholar
Nguyen, T.: A Group Activity Recommendation Scheme Based on Situation Distance and User Similarity Clustering. M. Thesis, Department of Computer Science KAIST (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Amrita Create, Coimbatore, India
Prema Nedungadi
Department of Computer Science, Amrita Vishwa Vidyapeetham, Coimbatore, India
M. S. Remya

Authors

Prema Nedungadi
View author publications
You can also search for this author in PubMed Google Scholar
M. S. Remya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prema Nedungadi .

Editor information

Editors and Affiliations

Indian Statistical Institute, Machine Intelligence Unit, Kolkata, India
Malay Kumar Kundu
Dept. of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela, India
Durga Prasad Mohapatra
Dept. of Electronics and Tele-Communication Engineering, Jadavpur University Artificial Intelligence Laboratory, Kolkata, India
Amit Konar
Dept. of Computer Science and Engineering, St. Thomas' College of Engineering & Technology, Kidderpore, West Bengal, India
Aruna Chakraborty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nedungadi, P., Remya, M.S. (2014). A Scalable Feature Selection Algorithm for Large Datasets – Quick Branch & Bound Iterative (QBB-I). In: Kumar Kundu, M., Mohapatra, D., Konar, A., Chakraborty, A. (eds) Advanced Computing, Networking and Informatics- Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-319-07353-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-07353-8_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07352-1
Online ISBN: 978-3-319-07353-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics