Abstract
Support Vector Machine (SVM) is a well-known classification technique which has achieved excellent performance in many nonlinear and high dimensional pattern recognition fields. However, due to the high time complexity of training SVM model, it’s difficult to implement it for large-scale data sets. One of the most promising solutions is to reduce the training data used for establishing the optimal classification hyperplane by means of selecting relevant support vectors which are the only factors affecting the classification rule. Thus, instance selection method is an efficient pre-processing technique to reduce the computational complexity and storage requirements of the learning process. In this manuscript, considering the geometry-distribution of data sets, we propose a Half Shell Extraction (HSE) algorithm which falls into the condensation category of instance selection methods. Moreover, fuzzy distance metric based on locality sensitive hash is employed to accelerate the instance selection process. Empirically, an experimental study involving various of data sets is carried out to compare the proposed algorithm with five competitive algorithms, and the results obtained show that the proposed algorithm consistently outperforms the other algorithms in terms of accuracy, reduction capability and runtime.
Similar content being viewed by others
Data Availability
The data sets analysed during the current study are available in the UCI repository; http://archive.ics.uci.edu/ml/index.php, and KEEL repository; http://keel.es/.
References
Acampora G, Herrera F, Tortora G et al (2018) A multi-objective evolutionary approach to training set selection for support vector machine. Knowl-Based Syst 147:94–108
Almasi ON, Rouhani M (2016) Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets. Turkish J Elect Eng Compu Sci 24(1):219–233
Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning, pp 25–32
Arnaiz-González Á , Díez-Pastor JF, Rodríguez JJ et al (2016) Instance selection of linear complexity for big data. Knowl-Based Syst 107:83–95
Assheton P, Hunter A (2011) A shape-based voting algorithm for pedestrian detection and tracking. Patt Recognit 44(5):1106–1120
Awad M, Khan L, Bastani F et al (2004) An effective support vector machines (svms) performance using hierarchical clustering. In: 16th IEEE international conference on tools with artificial intelligence. IEEE, pp 663–667
Balcázar J, Dai Y, Watanabe O (2001) A random sampling technique for training support vector machines. In: International conference on algorithmic learning theory. Springer, pp 119– 134
Birzhandi P, Kim KT, Lee B et al (2019) Reduction of training data using parallel hyperplane for support vector machine. Appl Artif Intell 33(6):497–516
Cao S, Liu X, Liu Z (2006) Fuzzy suppor t vector machine of dismissing margin based on the method of class-center. Comput Eng Appl 42(22):146–149
Cervantes J, Lin X, Yu W (2006) Support vector machine classification based on fuzzy clustering for large data sets. In: Mexican international conference on artificial intelligence. Springer, pp 572–582
Chang F, Guo CY, Lin XR et al (2010) Tree decomposition for large-scale svm problems. J Mach Learn Res 11:2935–2972
Chang KW, Hsieh CJ, Lin CJ (2008) Coordinate descent method for large-scale l2-loss linear support vector machines. J Mach Learn Res, vol 9(7)
Chen J, Zhang C, Xue X et al (2013) Fast instance selection for speeding up support vector machines. Knowl-Based Syst 45:1–7
Cheng F, Chen J, Qiu J et al (2020) A subregion division based multi-objective evolutionary algorithm for svm training set selection. Neurocomputing 394:70–83
Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: 18th international conference on pattern recognition (ICPR’06). IEEE, pp 556–559
Dai G, Yeung DY, Qian YT (2007) Face recognition using a kernel fractional-step discriminant analysis algorithm. Patt Recognit 40(1):229–243
Datar M, Immorlica N, Indyk P et al (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry, pp 253–262
Garcia S, Derrac J, Cano J et al (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Patt Anal Mach Intell 34(3):417–435
Graf H, Cosatto E, Bottou L et al (2004) Parallel support vector machines: The cascade svm. Adv Neural Inf Process Syst 17:521–528
Har-Peled S, Indyk P, Motwani R (2012) Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput 8(1):321–350
Hart P (1968) The condensed nearest neighbor rule (corresp.) IEEE Trans Inf Theory 14 (3):515–516
Hsieh CJ, Chang KW, Lin CJ et al (2008) A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on Machine learning, pp 408–415
Kawulok M, Nalepa J (2012) Support vector machines training data selection using a genetic algorithm
Kawulok M, Nalepa J (2014) Dynamically adaptive genetic algorithm to select training data for svms. In: Ibero-American conference on artificial intelligence. Springer, pp 242–254
Keerthi SS, Shevade SK, Bhattacharyya C et al (2000) A fast iterative nearest point algorithm for support vector machine classifier design. IEEE Trans Neural Netw 11(1):124–136
Keerthi SS, Shevade SK, Bhattacharyya C et al (2001) Improvements to platt’s smo algorithm for svm classifier design. Neural Comput 13(3):637–649
Koggalage R, Halgamuge S (2004) Reducing the number of training samples for fast support vector machine classification. Neural Inf Process-Letters Reviews 2(3):57–65
Lee YJ, Mangasarian OL (2001) Rsvm: reduced support vector machines. In: Proceedings of the 2001 SIAM International Conference on Data Mining. SIAM, pp 1-17
Li HL, Wang C, Yuan B (2003) An improved svm: Nn-svm. Chinese Journal Of Computers-Chinese Edition- 26(8):1015–1020
Li Z, Weida Z, Licheng J (2000) Pre-extracting support vectors for support vector machine. In: WCC 2000-ICSP 2000. 2000 5th international conference on signal processing proceedings. 16th world computer congress 2000. IEEE, pp 1432–1435
Liu C, Wang W, Wang M et al (2017) An efficient instance selection algorithm to reconstruct training set for support vector machine. Knowl-Based Syst 116:58–73
López-Chau A, García LL, Cervantes J et al (2012) Data selection using decision tree for svm classification. In: 2012 IEEE 24th international conference on tools with artificial intelligence. IEEE, pp 742–749
Lyhyaoui A, Martinez M, Mora I et al (1999) Sample selection via clustering to construct support vector-like classifiers. IEEE Trans Neural Netw 10(6):1474–1481
Mourad S, Tewfik A Vikalo H (2019) Weighted subset selection for fast svm training
Nalepa J, Kawulok M (2014a) Adaptive genetic algorithm to select training data for support vector machines. In: European conference on the applications of evolutionary computation. Springer, pp 514–525
Nalepa J, Kawulok M (2014b) A memetic algorithm to select training data for support vector machines. In: Proceedings of the 2014 annual conference on genetic and evolutionary computation, pp 573–580
Nalepa J, Kawulok M (2016) Adaptive memetic algorithm enhanced with data geometry analysis to select training data for svms. Neurocomputing 185:113–132
Nalepa J, Kawulok M (2019) Selecting training sets for support vector machines: a review. Artif Intell Rev 52(2):857–900
Ougiaroglou S, Diamantaras KI, Evangelidis G (2018) Exploring the effect of data reduction on neural network and support vector machine classification. Neurocomputing 280:101–110
Pighetti R, Pallez D, Precioso F (2015) Improving svm training sample selection using multi-objective evolutionary algorithm and lsh. In: 2015 IEEE symposium series on computational intelligence. IEEE, pp 1383–1390
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines
Qin J, Yung NH (2010) Scene categorization via contextual visual words. Pattern Recogn 43 (5):1874–1888
Richtárik P, Takáč M (2016) Parallel coordinate descent methods for big data optimization. Math Program 156(1):433–484
Rosales-Pérez A, García S, Gonzalez JA et al (2017) An evolutionary multiobjective model and instance selection for support vector machines with pareto-based ensembles. IEEE Trans Evol Comput 21(6):863–877
Shen XJ, Mu L, Li Z et al (2016) Large-scale support vector machine classification with redundant data reduction. Neurocomputing 172:189–197
Shin H, Cho S (2002) Pattern selection for support vector classifiers. In: International conference on intelligent data engineering and automated learning. Springer, pp 469–474
Shin H, Cho S (2003) Fast pattern selection for support vector classifiers. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 376–387
Shrivastava A, Ahirwal RR (2013) A svm and k-means clustering based fast and efficient intrusion detection system. Int J Comput Appl 72(6):25–29
Vamvakas G, Gatos B, Perantonis SJ (2010) Handwritten character recognition through two-stage foreground sub-sampling. Pattern Recogn 43(8):2807–2816
Vapnik V (2013) The nature of statistical learning theory. Springer Sci Business Media
Yang J, Yu X, Xie ZQ et al (2011) A novel virtual sample generation method based on gaussian distribution. Knowl-Based Syst 24(6):740–748
Yu G, Tian J, Li M (2016) Nearest neighbor-based instance selection for classification. In: International conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). IEEE, pp 75-80
Yu H, Yang J, Han J et al (2005) Making svms scalable to large data sets using hierarchical cluster indexing. Data Min Knowl Disc 11(3):295–321
Yu L, Wende Y, Dake H et al (2007) Fast reduction for large-scale training data set. J Southwest Jiaotong University:4
Acknowledgments
The authors would like to thank the anonymous referees for the valuable comments and suggestions which help us to improve this paper.
Funding
This work has been supported by Fundamental Research Funds for the Central Universities under grant SWU117051.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Liu, C. Fast instance selection method for SVM training based on fuzzy distance metric. Appl Intell 53, 18109–18124 (2023). https://doi.org/10.1007/s10489-022-04447-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04447-7