Abstract
This paper proposes a method of finding a discriminative linear transformation that enhances the data’s degree of conformance to the compactness hypothesis and its inverse. The problem formulation relies on inter-observation distances only, which is shown to improve non-parametric and non-linear classifier performance on benchmark and real-world data sets. The proposed approach is suitable for both binary and multiple-category classification problems, and can be applied as a dimensionality reduction technique. In the latter case, the number of necessary discriminative dimensions can be determined exactly. Also considered is a kernel-based extension of the proposed discriminant analysis method which overcomes the linearity assumption of the sought discriminative transformation imposed by the initial formulation. This enhancement allows the proposed method to be applied to non-linear classification problems and has an additional benefit of being able to accommodate indefinite kernels.
Similar content being viewed by others
Notes
Here and in several other places we will use shorthand \(\prod_{i < j}^{N_X}\) to designate double product \(\prod_{i=1}^{N_X} \prod_{j=i+1}^{N_X}.\)
For instance, in contrast to the ACM technique, the DDA formulation applies naturally to the cases where there is no strict class separability, whereas the ACM method fails because the version space becomes an empty set.
The similar notation will be used further on, where a dash over a variable name will signify that the variable either depends on or is itself a supporting point.
A more detailed analysis may demonstrate that resorting to the Taylor series approximation might break conformance to the majorization requirements in the strict sense. However, the empirical evidence proved otherwise (see section 7), confirming the technique as an alternative of preference.
The elements g ij of matrix G not affected by the first two rules of (23) are assumed to have been initially set to zero.
A word of caution is in order as for the choice of k = 1, which corresponds to an ill-posed combinatorial problem [6].
References
Arkadev A, Braverman E (1966) Computers and pattern recognition. Thompson, Washington, DC
Bahlmann C, Haasdonk B, Burkhardt H (2002) On-line handwriting recognition with support vector machines—a kernel approach. In: Eighth International Workshop on Frontiers in Handwriting Recognition. Ontario, Canada
Bartlett P (1997) For valid generalization, the size of the weights is more important than the size of the network. Adv Neural Inform Process Syst 9:134–140
Bertero M, Boccacci P (1998) Introduction to inverse problems in imaging. Institute of Physics Publishing
Blake CL, Merz CJ (1998) UCI repository of machine learning databases
Borg I, Groenen PJF (1997) Modern multidimensional scaling. Springer, New York
Marco Bressan, Vitria J (2003) Nonparametric discriminant analysis and nearest neighbor classification. Pattern Recogn Lett 24(15):2743–2749
Chow GC (1960) Tests of equality between sets of coefficients in two linear regressions. Econometrica 28(3)
Commandeur J, Groenen PJF, Meulman J (1999) A distance-based variety of non-linear multivariate data analysis, including weights for objects and variables. Psychometrika 64(2):169–186
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Dennis DeCoste, Bernhard Schölkopf (2002) Training invariant support vector machines. Mach Learn 46(1–3):161–190
Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) First international workshop on multiple classifier systems. Springer, Heidelberg, pp 1–15
Dinkelbach W (1967) On nonlinear fractional programming. Manage Sci A(13):492–498
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Dunn D, Higgins WE, Wakeley J (1994) Texture segmentation using 2-d gabor elementary functions. IEEE Trans Pattern Anal Mach Intell 16(2):130–149
Fisher RA (1936) The use of multiple measures in taxonomic problems. Ann Eugenics 7:179–188
Fix E, Hodges J (1951) Discriminatory analysis: nonparametric discrimination: consistency properties. Technical Report 4, USAF School of Aviation Medicine
Fletcher R (1987) Practical methods of optimization. Wiley, Chichester
Fogel I, Sagi D (1989) Gabor filters as texture discriminator. Cybernetics 61:103–113
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic, New York
Fukunaga K, Mantock J (1983) Nonparametric discriminant analysis. IEEE Trans Pattern Anal Mach Intell 5(6):671–678
Gentle J (1998) Numerical linear algebra for applications in statistics. Springer, Berlin
Haasdonk B (2005) Feature space interpretation of SVMs with indefinite kernels. IEEE Trans Pattern Anal Mach Intell 27(4):482–492
Haasdonk B, Keysers D (2002) Tangent distance kernels for support vector machines. In: Proceedings of the 16th ICPR, pp 864–868
Haasdonk B, Bahlmann C (2004) Learning with distance substitution kernels. In: 26th Pattern Recognition Symposium of the German Association for Pattern Recognition (DAGM 2004). Springer, Tübingen, Germany
Ju Han, Kai-Kuang Ma (2007) Rotation-invariant and scale-invariant gabor features for texture image retrieval. Image Vis Comput 25(9):1474–1481
Trevor Hastie, Robert Tibshirani (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616
Heiser W (1995) Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent advances in descriptive multivariate analysis, pp. 157–189
Huber P (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101
Kiers HAL (1990) Majorization as a tool for optimizing a class of matrix functions. Psychometrika 55:417–428
Krogh A, Hertz JA (1992) A simple weight decay can improve generalization. In: Moody JE, Hanson SJ, Lippmann RP (eds) Advances in neural information processing systems, Vol 4. Morgan Kaufmann, San Francisco, pp 950–957
Lawrence S, Giles C (2000) Overfitting and neural networks: conjugate gradient and backpropagation. In: Proceedings of the IEEE international conference on neural networks. IEEE Press, pp 114–119
De Leeuw J (1993) Fitting distances by least squares. Technical Report 130, Interdiviional Program in Statistics. UCLA, Los Angeles
Leibe B, Schiele B (2003) Analyzing appearance and contour based methods for object categorization. In: International conference on computer vision and pattern recognition (CVPR’03). Madison, WI, pp 409–415
Li Y, Shapiro LG (2004) Object recognition for content-based image retrieval. In: Lecture Notes in Computer Science. Springer, Heidelberg
Hsuan-Tien Lin, Chih-Jen Lin (2003) A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. Available at http://www.csie.ntu.edu.tw/∼cjlin/papers/tanh.pdf
Mary X (2003) Sous-espaces hilbertiens, sous-dualités et applications. PhD thesis, Institut national des sciences appliquees de rouen - Insa rouen, ASI-PSI
David Masip, Ludmila I Kuncheva, Jordi Vitrià (2005) An ensemble-based method for linear feature extraction for two-class problems. Pattern Anal Appl 8(3):227–237
Tom Mitchell (1997) Machine learning. McGraw-Hill, New York
Moré JJ, Sorensen DC (1983) Computing a trust region step. SIAM J Sci Stat Comput 4(3):553–572
Moreno PJ, Ho PP, Vasconcelos N (2004) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, Vol 16. MIT, Cambridge
Nesterov Y, Nemirovskii A (1994) Interior Point Polynomial Methods in Convex Programming: Theory and Applications. Society for Industrial and Applied Mathematics, Philadelphia
Ong CS, Smola AJ, Williamson RC (2002) Hyperkernels. In: Neural information processing systems, Vol 15. MIT, Cambridge
Ong CS, Mary X, Canu S, Smola AJ (2004) Learning with non-positive kernels. In: ICML ’04: Proceedings of the twenty-first international conference on Machine learning. ACM
R. Paredes, E. Vidal (2000) A class-dependent weighted dissimilarity measure for nearest neighbor classification problems. Pattern Recogn Lett 21(12):1027–1036
Rojas M, Santos S, Sorensen D (2000) A new matrix-free algorithm for the large-scale trust-region subproblem. SIAM J Optim 11(3):611–646
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326
Schölkopf B (2001) The kernel trick for distances. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems, Vol 13. MIT, Cambridge, pp 301–307
Shen L, Bai L (2006) A review on gabor wavelets for face recognition. Pattern Anal Appl 9(2–3):273–292
Smith JR, Chang S-F (1996) Tools and techniques for color image retrieval. In: Storage and Retrieval for Image and Video Databases (SPIF), pp 426–437
Squire D. McG, Müller W, Müller H, Raki J (1999) Content-based query of image databases, inspirations from text retrieval: inverted files, frequency-based weights and relevance feedback. In: The 11th Scandinavian Conference on Image Analysis. Kangerlussuaq, Greenland, pp 143–149
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323
Theodoridis S, Koutroumbas K (1999) Pattern recognition. Academic, London
Torkkola K, Campbell W (2000) Mutual information in learning feature transformations. In: Proceedings 17th international conference on machine learning, pp 1015–1022
Trafalis TB, Malyscheff AM (2002) An analytic center machine. Mach Learn 46(1–3):203–223
van Deun K, Groenen PJF (2003) Majorization algorithms for inspecting circles, ellipses, squares, rectangles, and rhombi. Technical report, Econometric Institute Report EI 2003-35
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Vapnik VN (1998) Statistical learning theory. Wiley, New-York
Watanabe H, Yamaguchi T, Katagiri S (1997) Discriminative metric design for robust pattern recognition. IEEE Trans Signal Process 45(11):2655–2661
Webb A (1995) Multidimensional scaling by iterative majorization using radial basis functions. Pattern Recogn 28(5):753–759
Weiss S, Kulikowski C (1991) Computer systems that learn. Morgan Kaufmann, San Francisco
Zhou X, Huang T (2001) Comparing discriminating transformations and SVM for learning during multimedia retrieval. In: Proceedings of the 9th ACM international conference on multimedia. Ottawa, Canada, pp 137–146
Zhou X, Huang T (2001) Small sample learning during multimedia retrieval using BiasMap. In: IEEE computer vision and pattern recognition (CVPR’01), Hawaii
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
This section focuses on the intuition behind the definitions of design matrices R and G specified in (14) and (22). The derivations listed here are mostly based on those developed for the SMACOF multi-dimensional scaling algorithm [6].
Let us consider matrix R that is used in calculation of the majorizing expression of S W (T) represented by a weighted sum of within-distances. In the derivations that follow, we will assume all weights to be equal to unity, and show afterwards how this assumption can be easily corrected for. We, thus, begin by rewriting a squared within-distance in the vector form:
where x ′ i and x ′ j denote rows i and j from matrix X′ = XT, representing the corresponding observations transformed by T. Noticing that x ′ i − x ′ j = (e i −e j )T X′, (50) becomes:
where A ij is a square symmetric matrix whose elements are all zeros, except for those four indexed by the combinations of i and j that are either 1 (diagonal) or −1 (off-diagonal). For instance, A 13 for i = 1, j = 3 and N X = 3 will have the following form:
Taking into account (51), the sum of the squared within-distances can be expressed as:
where \(V=\sum_{i < j}^{N_X} A_{ij},\) for which there exists an easy computational shortcut. Namely, V is obtained by placing −1 in all off-diagonal entries of the matrix, while the diagonal elements are calculated as negated sums of their corresponding off-diagonal values in rows or columns. That is:
For instance, coming back to our previous N X = 3 example, this technique produces:
It is not difficult to see that the same result applies to the case of non-unitary weights associated with each distance, the only difference being that instead of −1 placed into the off-diagonal elements of V, one should use the negated values of the corresponding weights. And this is exactly how the matrix formulation of \(\mu_{S_W}(T,{\bar T}),\) (15), and design matrix R, (14), are obtained:
In order to derive the formulation of matrix G, as specified for the majorizer of −S B (T) based on Taylor series expansion (23), we rewrite (22) using the same techniques as we did in (51) arriving at:
where C ij = (e i −e N_X+j ) (e i −e N_X+j )T is a between-class analog of matrix A ij . From (55), it is apparent that the same type of a computational shortcut used above to obtain V may be exploited here too. Indeed, matrix \(F=\sum_{i=1}^{N_X} \sum_{j=1}^{N_Y} C_{ij}\) can be quickly constructed by placing −1 in the off-diagonal elements that correspond to index locations of the between-distances, and subsequently summing with negation to obtain the diagonal entries. An illustration of the technique for N X = 2, N Y = 3 is shown below:
This is the case of unitary weights. Again, the extension to the non-unitary weight formulation is trivial, and will involve pre-multiplying the off-diagonal entries by the appropriate quantities, which in the case of G are the reciprocals of the squares of the corresponding distances, as shown in (23).
Rights and permissions
About this article
Cite this article
Kosinov, S., Pun, T. Distance-based discriminant analysis method and its applications. Pattern Anal Applic 11, 227–246 (2008). https://doi.org/10.1007/s10044-007-0082-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-007-0082-x