Abstract
The selection of proper classifiers for a given data set is full of challenges. The critical problem of classifier selection is how to extract feature from data sets. This paper proposes a new method for feature extraction of a data set. Our method not only preserves the geometrical structure of a data set, but also characterizes the decision boundary of classification problems. Specifically speaking, the extracted feature can recover a data set that has the same Euclidean geometrical structure as the original data set. We present an efficient algorithm to compute the similarity between data set features. We theoretically analyze how the similarity between our features affects the performance of the support vector machine, a well-known classifier. The empirical results show that our method is effective in finding suitable classifiers.
Similar content being viewed by others
Data availability
The datasets analysed during the current study are available in the UCI machine learning repository, https://archive.ics.uci.edu/ml/index.php.
Notes
For simplicity, we omit the class labels.
References
Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, pp 1–10
Bahri M, Salutari F, Putina A et al (2022) AutoML: state of the art with a focus on anomaly detection, challenges, and research directions. Int J Data Sci Anal 14(2):113–126
Bensusan H (1998) God doesn’t always shave with Occam’s razor - learning when and how to prune. In: Proceedings of the tenth European conference on machine learning, pp 119–124
Bensusan H, Giraud-Carrier C (2000) Discovering task neighbourhoods through landmark learning performances. In: Proceedings of the fourth European conference on principles and practice of knowledge discovery in databases, pp 325–330
Bensusan H, Giraud-Carrier C, Kennedy C (2000) A higher-order approach to meta-learning. In: Proceedings of the ECML workshop on meta-learning: building automatic advice strategies for model selection and method combination, pp 109–118
Broomhead DS, Lowe D (1988) Multivariable functional interpolation and adaptive networks. Complex Syst 2(3):321–355
Cano JR (2013) Analysis of data complexity measures for classification. Expert Syst Appl 40(12):4820–4831
Cartinhour J (1992) A Bayes classifier when the class distributions come from a common multivariate normal distribution. IEEE Trans Reliab 41(1):124–126
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Deng L, Xiao M (2023) Latent feature learning via autoencoder training for automatic classification configuration recommendation. Knowl-Based Syst 261(110):218
Deng L, Xiao M (2023) A new automatic hyperparameter recommendation approach under low-rank tensor completion e framework. IEEE Trans Pattern Anal Mach Intell 45(4):4038–4050
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Springer, Berlin
Duin RPW, Pekalska E, Tax DMJ (2004) The characterization of classification problems by classifier disagreements. In: Proceedings of the seventeenth international conference on pattern recognition, pp 140–143
Fernández-Delgado M, Cernadas E, Barro S et al (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15:3133–3181
Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, Cambridge
Golub GH, Van Loan CF (1996) Matrix computations. Johns Hopkins University Press, Baltimore
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New York
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300
Jain AK, Ramaswami M (1988) Classifier design with Parzen windows. Mach Intell Pattern Recogn 7:211–228
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Kalousis A, Theoharis T (1999) NOEMON: design, implementation and performance results of an intelligent assistant for classifier selection. Intell Data Anal 3(5):319–337
Koren O, Hallin CA, Koren M et al (2022) AutoML classifier clustering procedure. Int J Intell Syst 37(7):4214–4232
Macià N, Bernadó-Mansilla E, Orriols-Puig A et al (2013) Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recogn 46(3):1054–1066
Pan B, Chen WS, Chen B et al (2016) Efficient learning of supervised kernels with a graph-based loss function. Inf Sci 370(371):50–62
Pan B, Chen WS, Xu C et al (2016) A novel framework for learning geometry-aware kernels. IEEE Trans Neural Netw Learn Syst 27:939–951
Peng Y, Flach PA, Brazdil P, et al (2002) Improved dataset characterisation for meta-learning. In: Proceedings of the Fifth international conference on discovery science, pp 141–152
Pfahringer B, Bensusan H, Giraud-Carrier C (2000) Meta-learning by landmarking various learning algorithms. In: Proceedings of the seventeenth international conference on machine learning, pp 743–750
Raudys S, Duin RPW (1998) Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recogn Lett 19(5–6):385–392
Rice JR (1976) The algorithm selection problem. Adv Comput 15:65–118
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539
Song Q, Wang G, Wang C (2012) Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recogn 45(7):2672–2689
Umeyama S (1988) An eigendecomposition approach to weighted graph matching problems. IEEE Trans Pattern Anal Mach Intell 10(5):695–703
Vong CM, Du J (2020) Accurate and efficient sequential ensemble learning for highly imbalanced multi-class data. Neural Netw 128:268–278
Wang G, Song Q, Zhu X (2015) An improved data characterization method and its application in classification algorithm recommendation. Appl Intell 43(4):892–912
Williams CKI, Seeger M (2000) The effect of the input density distribution on kernel-based classifiers. In: Proceedings of the seventeenth international conference on machine learning, pp 1159–1166
Williams CKI, Seeger M (2001) Using the Nyström method to speed up kernel machines. In: Leen T, Dietterich T, Tresp V (eds) Advances in neural information processing systems 13. MIT Press, Cambridge, pp 682–688
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390
Yokota T, Yamashita Y (2013) A quadratically constrained MAP classifier using the mixture of Gaussians models as a weight function. IEEE Trans Neural Netw Learn Syst 24(7):1127–1140
Yousef WA (2021) Estimating the standard error of cross-validation-based estimators of classifier performance. Pattern Recogn Lett 146:115–125
Zhu X, Wu X (2004) Class noise versus attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210
Zhu X, Wu X, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: McGuinness DL, Ferguson G (eds) Proceedings of the nineteenth national conference on artificial intelligence, July 25-29, 2004, San Jose, California, USA, pp 378–384
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant 61602308 and the Interdisciplinary Innovation Team of Shenzhen University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Theorem 2
Proof
Let \(h(\textbf{x}) = \textbf{w}^\top \textbf{x}\), \(h^\prime (\textbf{x}^\prime ) = {\textbf{w}^\prime }^\top \textbf{x}^\prime \) and \(\Delta \textbf{w} = \textbf{w} - \textbf{w}^\prime \). Then we have
Since \(\textbf{w}\) and \(\textbf{w}^\prime \) are the minimizers of the associated SVM problems (12), for all \(t \in [0,1]\), we have
and
Summing (A2) and (A3), we obtain
Using the convexity of hinge loss, we have
and
Combing (A4), (A5) and (A6) and taking the limit \(t \rightarrow 0\) leads to
The second inequality follows from the Lipschitz continuity of hinge loss. We write \(\textbf{w}\) in terms of the dual variables \(\alpha _i\), namely \(\textbf{w} = \sum _{i=1}^n \alpha _i \textbf{x}_i\). Note that \(0 \leqslant \alpha _i \leqslant C\) and \(\textbf{x}_i = \textbf{G}^{1/2} \textbf{e}_i\). Then we obtain
Analogously, we have
Thus, (A7) can be rewritten as
Here we use the inequality \(\Vert \textbf{G}^{1/2}-{\textbf{G}^\prime }^{1/2}\Vert _2 \leqslant \Vert \textbf{G}-\textbf{G}^\prime \Vert ^{1/2}_2\).
We also have
and
Substituting (A9) to (A12) into (A1) yields
The last inequality results from \((a+b)^{1/2}\leqslant \sqrt{a}+\sqrt{b}\). \(\square \)
Appendix B: Incomplete Cholesky decomposition (ICD) algorithm
The algorithm of ICD is shown in Algorithm 2.
Appendix C: Proof of Theorem 3
Proof
The second and the last statements can be inferred from the first and the third ones. Thus, we only need to proof (i) and (iii). Since \(\textbf{G}\) and \(\textbf{G}^\prime \) are isomorphic, there exists a bijection \(\pi _0: \{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\) such that
We will argue by mathematical induction.
1) For \(s = 1\), let \(p_1 = \arg \max _{1 \leqslant i \leqslant n} \textbf{G}(i,i)\) and \(q_1 = \arg \max _{1 \leqslant i \leqslant n} \textbf{G}^\prime (i,i)\). Since the largest diagonal elements of \(\textbf{G}\) and \(\textbf{G}^\prime \) are unique, we have \(q_1 = \pi _0(p_1)\). We construct bijections
and
If \(p_1 = 1\) (\(q_1 = 1\)), then \(\omega _1\) (\(\omega ^\prime _1\)) is an identity mapping. Define a composite function \(\pi _1: \{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\):
Then
The first diagonal element of \(\textbf{A}^{(1)}\) is
The rest elements of the first column of \(\textbf{A}^{(1)}\) are
In a similar way,
Note that \(\textbf{A}^{(1)}(1,1) = \textbf{B}^{(1)}(1,1)\), \(\omega ^\prime _1\circ \omega ^\prime _1(i) = i\) and \(\textbf{G}(i,p_1) = \textbf{G}^\prime (\pi _0(i),\pi _0(p_1))\). Then we obtain
Let \(\textbf{P}^{(1)} = \textbf{P}[1,p_1]\) and \(\textbf{Q}^{(1)} = \textbf{P}[1,q_1]\). Denote \(\textbf{G}^{(1)} = \textbf{P}^{(1)}\textbf{G}\textbf{P}^{(1)}\) and \({\textbf{G}^\prime }^{(1)} = \textbf{Q}^{(1)}\textbf{G}^\prime \textbf{Q}^{(1)}\), then
and
where the third equality results from (C14). Combining (C15) and (C16) yields \(\textbf{G}^{(1)}(i,j) = {\textbf{G}^\prime }^{(1)}(\pi _1(i),\pi _1(j))\). Therefore, the theorem holds for \(s=1\).
2) Assume that the theorem holds for \(s = m\). There exists a bijection \(\pi _m:\{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\) such that \(\pi _m(i) = i\) (\(1\leqslant i\leqslant m\)) and for every i,
There also exist permutation matrices \(\textbf{P}^{(m)}\) and \(\textbf{Q}^{(m)}\) such that
where \(\textbf{G}^{(m)} = \textbf{P}^{(m)} \textbf{G} \textbf{P}^{(m)}\) and \({\textbf{G}^\prime }^{(m)} = \textbf{Q}^{(m)} \textbf{G}^\prime \textbf{Q}^{(m)}\).
3) We will show that the theorem holds for \(s = m+1\). The \((m+1)\)th to the nth diagonal elements of \(\textbf{A}^{(m)}\) are
Note that \(\pi _m(i) = i\) (\(1\leqslant i\leqslant m\)) and \(\pi _m\) is a bijection. Let \(p_{m+1} = \arg \max _{m+1\leqslant i\leqslant n}\textbf{A}^{(m)}(i,i)\) and \(q_{m+1} = \arg \max _{m+1\leqslant i\leqslant n} \textbf{B}^{(m)}(i,i)\). We have \(q_{m+1} = \pi _m(p_{m+1})\) due to the uniqueness of the largest diagonal elements of symmetric pivoting. We construct bijections
and
Define a composite function \(\pi _{m+1}:\{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\):
For \(1 \leqslant i \leqslant m\),
We also have
Thus, \(\pi _{m+1}(i) = i \ (1\leqslant i\leqslant m+1)\).
Let \(\textbf{P}^{(m+1)} = \textbf{P}[m+1,p_{m+1}] \textbf{P}^{(m)}\) and \(\textbf{Q}^{(m+1)} = \textbf{P}[m+1,q_{m+1}] \textbf{Q}^{(m)}\). Denote \(\textbf{G}^{(m+1)} = \textbf{P}^{(m+1)} \textbf{G} \textbf{P}^{(m+1)} \) and \({\textbf{G}^\prime }^{(m+1)} = \textbf{Q}^{(m+1)} {\textbf{G}^\prime } \textbf{Q}^{(m+1)}\), then
and
where the last two equations follows from (C19) and (C18), respectively. Combining (C20) and (C21) yields
For \(1 \leqslant i \leqslant n\) and \(1 \leqslant j \leqslant m\),
where the third and the fourth equations results from (C17) and (C19), respectively. For the \((m+1)\)th diagonal element of \(\textbf{A}^{(m+1)}\),
and for \(m + 2 \leqslant i \leqslant n\),
Thus, for every i we obtain
The proof is complete by mathematical induction.
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pan, B., Chen, WS., Deng, L. et al. Classifier selection using geometry preserving feature. Neural Comput & Applic 35, 20955–20976 (2023). https://doi.org/10.1007/s00521-023-08828-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08828-y