Skip to main content
Log in

Distance-based discriminant analysis method and its applications

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

This paper proposes a method of finding a discriminative linear transformation that enhances the data’s degree of conformance to the compactness hypothesis and its inverse. The problem formulation relies on inter-observation distances only, which is shown to improve non-parametric and non-linear classifier performance on benchmark and real-world data sets. The proposed approach is suitable for both binary and multiple-category classification problems, and can be applied as a dimensionality reduction technique. In the latter case, the number of necessary discriminative dimensions can be determined exactly. Also considered is a kernel-based extension of the proposed discriminant analysis method which overcomes the linearity assumption of the sought discriminative transformation imposed by the initial formulation. This enhancement allows the proposed method to be applied to non-linear classification problems and has an additional benefit of being able to accommodate indefinite kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Here and in several other places we will use shorthand \(\prod_{i < j}^{N_X}\) to designate double product \(\prod_{i=1}^{N_X} \prod_{j=i+1}^{N_X}.\)

  2. For instance, in contrast to the ACM technique, the DDA formulation applies naturally to the cases where there is no strict class separability, whereas the ACM method fails because the version space becomes an empty set.

  3. The similar notation will be used further on, where a dash over a variable name will signify that the variable either depends on or is itself a supporting point.

  4. A more detailed analysis may demonstrate that resorting to the Taylor series approximation might break conformance to the majorization requirements in the strict sense. However, the empirical evidence proved otherwise (see section 7), confirming the technique as an alternative of preference.

  5. The elements g ij of matrix G not affected by the first two rules of (23) are assumed to have been initially set to zero.

  6. A word of caution is in order as for the choice of k = 1, which corresponds to an ill-posed combinatorial problem [6].

References

  1. Arkadev A, Braverman E (1966) Computers and pattern recognition. Thompson, Washington, DC

    Google Scholar 

  2. Bahlmann C, Haasdonk B, Burkhardt H (2002) On-line handwriting recognition with support vector machines—a kernel approach. In: Eighth International Workshop on Frontiers in Handwriting Recognition. Ontario, Canada

  3. Bartlett P (1997) For valid generalization, the size of the weights is more important than the size of the network. Adv Neural Inform Process Syst 9:134–140

    Google Scholar 

  4. Bertero M, Boccacci P (1998) Introduction to inverse problems in imaging. Institute of Physics Publishing

  5. Blake CL, Merz CJ (1998) UCI repository of machine learning databases

  6. Borg I, Groenen PJF (1997) Modern multidimensional scaling. Springer, New York

    MATH  Google Scholar 

  7. Marco Bressan, Vitria J (2003) Nonparametric discriminant analysis and nearest neighbor classification. Pattern Recogn Lett 24(15):2743–2749

    Article  Google Scholar 

  8. Chow GC (1960) Tests of equality between sets of coefficients in two linear regressions. Econometrica 28(3)

  9. Commandeur J, Groenen PJF, Meulman J (1999) A distance-based variety of non-linear multivariate data analysis, including weights for objects and variables. Psychometrika 64(2):169–186

    Article  MathSciNet  Google Scholar 

  10. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    Google Scholar 

  11. Dennis DeCoste, Bernhard Schölkopf (2002) Training invariant support vector machines. Mach Learn 46(1–3):161–190

    MATH  Google Scholar 

  12. Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) First international workshop on multiple classifier systems. Springer, Heidelberg, pp 1–15

  13. Dinkelbach W (1967) On nonlinear fractional programming. Manage Sci A(13):492–498

    MathSciNet  Google Scholar 

  14. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  15. Dunn D, Higgins WE, Wakeley J (1994) Texture segmentation using 2-d gabor elementary functions. IEEE Trans Pattern Anal Mach Intell 16(2):130–149

    Article  Google Scholar 

  16. Fisher RA (1936) The use of multiple measures in taxonomic problems. Ann Eugenics 7:179–188

    Google Scholar 

  17. Fix E, Hodges J (1951) Discriminatory analysis: nonparametric discrimination: consistency properties. Technical Report 4, USAF School of Aviation Medicine

  18. Fletcher R (1987) Practical methods of optimization. Wiley, Chichester

    MATH  Google Scholar 

  19. Fogel I, Sagi D (1989) Gabor filters as texture discriminator. Cybernetics 61:103–113

    Google Scholar 

  20. Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic, New York

    MATH  Google Scholar 

  21. Fukunaga K, Mantock J (1983) Nonparametric discriminant analysis. IEEE Trans Pattern Anal Mach Intell 5(6):671–678

    Article  MATH  Google Scholar 

  22. Gentle J (1998) Numerical linear algebra for applications in statistics. Springer, Berlin

    MATH  Google Scholar 

  23. Haasdonk B (2005) Feature space interpretation of SVMs with indefinite kernels. IEEE Trans Pattern Anal Mach Intell 27(4):482–492

    Article  Google Scholar 

  24. Haasdonk B, Keysers D (2002) Tangent distance kernels for support vector machines. In: Proceedings of the 16th ICPR, pp 864–868

  25. Haasdonk B, Bahlmann C (2004) Learning with distance substitution kernels. In: 26th Pattern Recognition Symposium of the German Association for Pattern Recognition (DAGM 2004). Springer, Tübingen, Germany

  26. Ju Han, Kai-Kuang Ma (2007) Rotation-invariant and scale-invariant gabor features for texture image retrieval. Image Vis Comput 25(9):1474–1481

    Article  Google Scholar 

  27. Trevor Hastie, Robert Tibshirani (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616

    Article  Google Scholar 

  28. Heiser W (1995) Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent advances in descriptive multivariate analysis, pp. 157–189

  29. Huber P (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101

    Article  MATH  MathSciNet  Google Scholar 

  30. Kiers HAL (1990) Majorization as a tool for optimizing a class of matrix functions. Psychometrika 55:417–428

    Article  MATH  MathSciNet  Google Scholar 

  31. Krogh A, Hertz JA (1992) A simple weight decay can improve generalization. In: Moody JE, Hanson SJ, Lippmann RP (eds) Advances in neural information processing systems, Vol 4. Morgan Kaufmann, San Francisco, pp 950–957

  32. Lawrence S, Giles C (2000) Overfitting and neural networks: conjugate gradient and backpropagation. In: Proceedings of the IEEE international conference on neural networks. IEEE Press, pp 114–119

  33. De Leeuw J (1993) Fitting distances by least squares. Technical Report 130, Interdiviional Program in Statistics. UCLA, Los Angeles

  34. Leibe B, Schiele B (2003) Analyzing appearance and contour based methods for object categorization. In: International conference on computer vision and pattern recognition (CVPR’03). Madison, WI, pp 409–415

  35. Li Y, Shapiro LG (2004) Object recognition for content-based image retrieval. In: Lecture Notes in Computer Science. Springer, Heidelberg

  36. Hsuan-Tien Lin, Chih-Jen Lin (2003) A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. Available at http://www.csie.ntu.edu.tw/∼cjlin/papers/tanh.pdf

  37. Mary X (2003) Sous-espaces hilbertiens, sous-dualités et applications. PhD thesis, Institut national des sciences appliquees de rouen - Insa rouen, ASI-PSI

  38. David Masip, Ludmila I Kuncheva, Jordi Vitrià (2005) An ensemble-based method for linear feature extraction for two-class problems. Pattern Anal Appl 8(3):227–237

    Article  MathSciNet  Google Scholar 

  39. Tom Mitchell (1997) Machine learning. McGraw-Hill, New York

  40. Moré JJ, Sorensen DC (1983) Computing a trust region step. SIAM J Sci Stat Comput 4(3):553–572

    Article  MATH  Google Scholar 

  41. Moreno PJ, Ho PP, Vasconcelos N (2004) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, Vol 16. MIT, Cambridge

  42. Nesterov Y, Nemirovskii A (1994) Interior Point Polynomial Methods in Convex Programming: Theory and Applications. Society for Industrial and Applied Mathematics, Philadelphia

    Google Scholar 

  43. Ong CS, Smola AJ, Williamson RC (2002) Hyperkernels. In: Neural information processing systems, Vol 15. MIT, Cambridge

  44. Ong CS, Mary X, Canu S, Smola AJ (2004) Learning with non-positive kernels. In: ICML ’04: Proceedings of the twenty-first international conference on Machine learning. ACM

  45. R. Paredes, E. Vidal (2000) A class-dependent weighted dissimilarity measure for nearest neighbor classification problems. Pattern Recogn Lett 21(12):1027–1036

    Article  MATH  Google Scholar 

  46. Rojas M, Santos S, Sorensen D (2000) A new matrix-free algorithm for the large-scale trust-region subproblem. SIAM J Optim 11(3):611–646

    Article  MATH  MathSciNet  Google Scholar 

  47. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326

    Article  Google Scholar 

  48. Schölkopf B (2001) The kernel trick for distances. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems, Vol 13. MIT, Cambridge, pp 301–307

  49. Shen L, Bai L (2006) A review on gabor wavelets for face recognition. Pattern Anal Appl 9(2–3):273–292

    MathSciNet  Google Scholar 

  50. Smith JR, Chang S-F (1996) Tools and techniques for color image retrieval. In: Storage and Retrieval for Image and Video Databases (SPIF), pp 426–437

  51. Squire D. McG, Müller W, Müller H, Raki J (1999) Content-based query of image databases, inspirations from text retrieval: inverted files, frequency-based weights and relevance feedback. In: The 11th Scandinavian Conference on Image Analysis. Kangerlussuaq, Greenland, pp 143–149

  52. Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323

    Article  Google Scholar 

  53. Theodoridis S, Koutroumbas K (1999) Pattern recognition. Academic, London

    Google Scholar 

  54. Torkkola K, Campbell W (2000) Mutual information in learning feature transformations. In: Proceedings 17th international conference on machine learning, pp 1015–1022

  55. Trafalis TB, Malyscheff AM (2002) An analytic center machine. Mach Learn 46(1–3):203–223

    Article  MATH  Google Scholar 

  56. van Deun K, Groenen PJF (2003) Majorization algorithms for inspecting circles, ellipses, squares, rectangles, and rhombi. Technical report, Econometric Institute Report EI 2003-35

  57. Vapnik VN (1995) The nature of statistical learning theory. Springer, New York

    MATH  Google Scholar 

  58. Vapnik VN (1998) Statistical learning theory. Wiley, New-York

    MATH  Google Scholar 

  59. Watanabe H, Yamaguchi T, Katagiri S (1997) Discriminative metric design for robust pattern recognition. IEEE Trans Signal Process 45(11):2655–2661

    Article  Google Scholar 

  60. Webb A (1995) Multidimensional scaling by iterative majorization using radial basis functions. Pattern Recogn 28(5):753–759

    Article  Google Scholar 

  61. Weiss S, Kulikowski C (1991) Computer systems that learn. Morgan Kaufmann, San Francisco

  62. Zhou X, Huang T (2001) Comparing discriminating transformations and SVM for learning during multimedia retrieval. In: Proceedings of the 9th ACM international conference on multimedia. Ottawa, Canada, pp 137–146

  63. Zhou X, Huang T (2001) Small sample learning during multimedia retrieval using BiasMap. In: IEEE computer vision and pattern recognition (CVPR’01), Hawaii

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serhiy Kosinov.

Appendix

Appendix

This section focuses on the intuition behind the definitions of design matrices R and G specified in (14) and (22). The derivations listed here are mostly based on those developed for the SMACOF multi-dimensional scaling algorithm [6].

Let us consider matrix R that is used in calculation of the majorizing expression of S W (T) represented by a weighted sum of within-distances. In the derivations that follow, we will assume all weights to be equal to unity, and show afterwards how this assumption can be easily corrected for. We, thus, begin by rewriting a squared within-distance in the vector form:

$$ \left(d_{ij}^W(T)\right)^2 = \sum_{a=1}^m (x_{ia}^{\prime}-x_{ja}^{\prime})^2 = ({\mathbf {x}}_i^{\prime} - {{\mathbf{x}}}_j^{\prime})({{\mathbf{x}}}_i^{\prime} - {{\mathbf{x}}}_j^{\prime})^{\rm T}, $$
(50)

where x i and x j denote rows i and j from matrix X′ = XT, representing the corresponding observations transformed by T. Noticing that x i x j =  (e i e j )T X′, (50) becomes:

$$ \begin{aligned} \left(d_{ij}^W(T)\right)^2 &= (e_i-e_j)^{\rm T} X^{\prime} X^{\prime {\rm T}} (e_i-e_j)\\ &= {{\mathbf{tr}}} \left(X^{\prime {\rm T}} (e_i-e_j) (e_i-e_j)^{\rm T} X^{\prime}\right)\\ & = {{\mathbf{tr}}} \left(X^{\prime {\rm T}} A_{ij} X^{\prime} \right), \end{aligned} $$
(51)

where A ij is a square symmetric matrix whose elements are all zeros, except for those four indexed by the combinations of i and j that are either 1 (diagonal) or −1 (off-diagonal). For instance, A 13 for i = 1, j = 3 and N X  = 3 will have the following form:

$$ A_{13} = \left[{\begin{array}{*{20}c} 1 & 0 & -1\\ 0 & 0 & 0\\ -1 & 0 & 1\\ \end{array}}\right]. $$
(52)

Taking into account (51), the sum of the squared within-distances can be expressed as:

$$ \begin{aligned} \sum_{i < j}^{N_X} \left(d_{ij}^W(T)\right)^2 &= \sum_{i < j}^{N_X} {{\mathbf{tr}}} \left(X^{\prime {\rm T}} A_{ij} X^{\prime}\right)\\ & = {{\mathbf{tr}}}\left(X^{\prime {\rm T}} V X^{\prime}\right)\\ &= {{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} V X T\right), \end{aligned} $$
(53)

where \(V=\sum_{i < j}^{N_X} A_{ij},\) for which there exists an easy computational shortcut. Namely, V is obtained by placing −1 in all off-diagonal entries of the matrix, while the diagonal elements are calculated as negated sums of their corresponding off-diagonal values in rows or columns. That is:

$$ v_{ij} = \left\{{\begin{array}{ll} -1,& \hbox{if }\, i \ne j;\\ -\sum\limits_{k=1,k \ne i}^{N_X} v_{ik} = N_X-1, & \hbox{if}\,i = j;\\ \end{array}} \right. $$
(54)

For instance, coming back to our previous N X  = 3 example, this technique produces:

$$ \begin{aligned} V &= \sum_{i < j}^{N_X=3} A_{ij}\\ &= \left(\left[{\begin{array}{*{20}c} 1 &-1 & 0\\ -1 & 1 & 0\\ 0 & 0 & 0\\ \end{array}}\right] + \left[ {\begin{array}{*{20}c} 1 & 0 & -1\\ 0 & 0 & 0\\ -1 & 0 & 1\\ \end{array}} \right] + \left[{\begin{array}{*{20}c} 0 & 0 & 0\\ 0 & 1 &-1\\ 0 &-1 & 1\\ \end{array}} \right]\right)\\ & = \left[{\begin{array}{*{20}c} 2 & -1 & -1\\ -1 & 2 & -1\\ -1 & -1 &2\\ \end{array}} \right]. \end{aligned} $$
(55)

It is not difficult to see that the same result applies to the case of non-unitary weights associated with each distance, the only difference being that instead of −1 placed into the off-diagonal elements of V, one should use the negated values of the corresponding weights. And this is exactly how the matrix formulation of \(\mu_{S_W}(T,{\bar T}),\) (15), and design matrix R, (14), are obtained:

$$ \begin{aligned} \mu_{S_W}(T,{\bar T}) &= \sum_{i < j}^{N_X} \frac{{\bar w}_{ij}\cdot\left(d^W_{ij}(T)\right)^2}{2\Psi\left(d^W_{ij}({\bar T})\right)} + K_1\\ &= \sum_{i < j}^{N_X} \frac{{\bar w}_{ij}} {\Psi\left(d^W_{ij}({\bar T})\right)} \left[\frac{1}{2} {{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} A_{ij} X T \right) + K_1^{\prime} \right]\\ & = \frac{1}{2}{{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} \sum_{i < j}^{N_X} \frac{{\bar w}_{ij}} {\Psi\left(d^W_{ij}({\bar T}) \right)} A_{ij} X T \right) + K_1\\ & =\frac{1}{2}{{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} R X T \right) + K_1\\ \end{aligned} $$
(56)

In order to derive the formulation of matrix G, as specified for the majorizer of −S B (T) based on Taylor series expansion (23), we rewrite (22) using the same techniques as we did in (51) arriving at:

$$ -d_{ij}^B(T) \le - \frac{{{\mathbf{tr}}}\left(T^{\rm T} Z^{\rm T} C_{ij} Z {\bar T} \right) }{d^B_{ij}({\bar T})}, $$
(57)

where C ij  = (e i e N_X+j ) (e i e N_X+j )T is a between-class analog of matrix A ij . From (55), it is apparent that the same type of a computational shortcut used above to obtain V may be exploited here too. Indeed, matrix \(F=\sum_{i=1}^{N_X} \sum_{j=1}^{N_Y} C_{ij}\) can be quickly constructed by placing −1 in the off-diagonal elements that correspond to index locations of the between-distances, and subsequently summing with negation to obtain the diagonal entries. An illustration of the technique for N X  = 2, N Y  = 3 is shown below:

$$ \begin{aligned} F &= \sum_{i=1}^{N_X=2} \sum_{j=1}^{N_Y=3} C_{ij}\\ &=\left[{\begin{array}{*{20}c} 3 & 0 & -1 & -1 & -1\\ 0 & 3 & -1 & -1 & -1\\ -1 &-1 & 2 & 0 & 0 \\ -1 & -1 & 0 & 2 & 0\\ -1 & -1 & 0 & 0 & 2\\ \end{array}} \right]. \end{aligned} $$
(58)

This is the case of unitary weights. Again, the extension to the non-unitary weight formulation is trivial, and will involve pre-multiplying the off-diagonal entries by the appropriate quantities, which in the case of G are the reciprocals of the squares of the corresponding distances, as shown in (23).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kosinov, S., Pun, T. Distance-based discriminant analysis method and its applications. Pattern Anal Applic 11, 227–246 (2008). https://doi.org/10.1007/s10044-007-0082-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-007-0082-x

Keywords

Navigation