Distance-based discriminant analysis method and its applications

Kosinov, Serhiy; Pun, Thierry

doi:10.1007/s10044-007-0082-x

Distance-based discriminant analysis method and its applications

Theoretical Advances
Published: 11 September 2007

Volume 11, pages 227–246, (2008)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Serhiy Kosinov¹ &
Thierry Pun¹

216 Accesses
5 Citations
Explore all metrics

Abstract

This paper proposes a method of finding a discriminative linear transformation that enhances the data’s degree of conformance to the compactness hypothesis and its inverse. The problem formulation relies on inter-observation distances only, which is shown to improve non-parametric and non-linear classifier performance on benchmark and real-world data sets. The proposed approach is suitable for both binary and multiple-category classification problems, and can be applied as a dimensionality reduction technique. In the latter case, the number of necessary discriminative dimensions can be determined exactly. Also considered is a kernel-based extension of the proposed discriminant analysis method which overcomes the linearity assumption of the sought discriminative transformation imposed by the initial formulation. This enhancement allows the proposed method to be applied to non-linear classification problems and has an additional benefit of being able to accommodate indefinite kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonparametric Discriminant Analysis Based on the Trace Ratio Criterion

High-Dimensional Classification

A DC Programming Approach for Sparse Linear Discriminant Analysis

Notes

Here and in several other places we will use shorthand $\prod_{i < j}^{N_X}$ to designate double product $\prod_{i=1}^{N_X} \prod_{j=i+1}^{N_X}.$
For instance, in contrast to the ACM technique, the DDA formulation applies naturally to the cases where there is no strict class separability, whereas the ACM method fails because the version space becomes an empty set.
The similar notation will be used further on, where a dash over a variable name will signify that the variable either depends on or is itself a supporting point.
A more detailed analysis may demonstrate that resorting to the Taylor series approximation might break conformance to the majorization requirements in the strict sense. However, the empirical evidence proved otherwise (see section 7), confirming the technique as an alternative of preference.
The elements g _ij of matrix G not affected by the first two rules of (23) are assumed to have been initially set to zero.
A word of caution is in order as for the choice of k = 1, which corresponds to an ill-posed combinatorial problem [6].

References

Arkadev A, Braverman E (1966) Computers and pattern recognition. Thompson, Washington, DC
Google Scholar
Bahlmann C, Haasdonk B, Burkhardt H (2002) On-line handwriting recognition with support vector machines—a kernel approach. In: Eighth International Workshop on Frontiers in Handwriting Recognition. Ontario, Canada
Bartlett P (1997) For valid generalization, the size of the weights is more important than the size of the network. Adv Neural Inform Process Syst 9:134–140
Google Scholar
Bertero M, Boccacci P (1998) Introduction to inverse problems in imaging. Institute of Physics Publishing
Blake CL, Merz CJ (1998) UCI repository of machine learning databases
Borg I, Groenen PJF (1997) Modern multidimensional scaling. Springer, New York
MATH Google Scholar
Marco Bressan, Vitria J (2003) Nonparametric discriminant analysis and nearest neighbor classification. Pattern Recogn Lett 24(15):2743–2749
Article Google Scholar
Chow GC (1960) Tests of equality between sets of coefficients in two linear regressions. Econometrica 28(3)
Commandeur J, Groenen PJF, Meulman J (1999) A distance-based variety of non-linear multivariate data analysis, including weights for objects and variables. Psychometrika 64(2):169–186
Article MathSciNet Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Google Scholar
Dennis DeCoste, Bernhard Schölkopf (2002) Training invariant support vector machines. Mach Learn 46(1–3):161–190
MATH Google Scholar
Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) First international workshop on multiple classifier systems. Springer, Heidelberg, pp 1–15
Dinkelbach W (1967) On nonlinear fractional programming. Manage Sci A(13):492–498
MathSciNet Google Scholar
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Dunn D, Higgins WE, Wakeley J (1994) Texture segmentation using 2-d gabor elementary functions. IEEE Trans Pattern Anal Mach Intell 16(2):130–149
Article Google Scholar
Fisher RA (1936) The use of multiple measures in taxonomic problems. Ann Eugenics 7:179–188
Google Scholar
Fix E, Hodges J (1951) Discriminatory analysis: nonparametric discrimination: consistency properties. Technical Report 4, USAF School of Aviation Medicine
Fletcher R (1987) Practical methods of optimization. Wiley, Chichester
MATH Google Scholar
Fogel I, Sagi D (1989) Gabor filters as texture discriminator. Cybernetics 61:103–113
Google Scholar
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic, New York
MATH Google Scholar
Fukunaga K, Mantock J (1983) Nonparametric discriminant analysis. IEEE Trans Pattern Anal Mach Intell 5(6):671–678
Article MATH Google Scholar
Gentle J (1998) Numerical linear algebra for applications in statistics. Springer, Berlin
MATH Google Scholar
Haasdonk B (2005) Feature space interpretation of SVMs with indefinite kernels. IEEE Trans Pattern Anal Mach Intell 27(4):482–492
Article Google Scholar
Haasdonk B, Keysers D (2002) Tangent distance kernels for support vector machines. In: Proceedings of the 16th ICPR, pp 864–868
Haasdonk B, Bahlmann C (2004) Learning with distance substitution kernels. In: 26th Pattern Recognition Symposium of the German Association for Pattern Recognition (DAGM 2004). Springer, Tübingen, Germany
Ju Han, Kai-Kuang Ma (2007) Rotation-invariant and scale-invariant gabor features for texture image retrieval. Image Vis Comput 25(9):1474–1481
Article Google Scholar
Trevor Hastie, Robert Tibshirani (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616
Article Google Scholar
Heiser W (1995) Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent advances in descriptive multivariate analysis, pp. 157–189
Huber P (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101
Article MATH MathSciNet Google Scholar
Kiers HAL (1990) Majorization as a tool for optimizing a class of matrix functions. Psychometrika 55:417–428
Article MATH MathSciNet Google Scholar
Krogh A, Hertz JA (1992) A simple weight decay can improve generalization. In: Moody JE, Hanson SJ, Lippmann RP (eds) Advances in neural information processing systems, Vol 4. Morgan Kaufmann, San Francisco, pp 950–957
Lawrence S, Giles C (2000) Overfitting and neural networks: conjugate gradient and backpropagation. In: Proceedings of the IEEE international conference on neural networks. IEEE Press, pp 114–119
De Leeuw J (1993) Fitting distances by least squares. Technical Report 130, Interdiviional Program in Statistics. UCLA, Los Angeles
Leibe B, Schiele B (2003) Analyzing appearance and contour based methods for object categorization. In: International conference on computer vision and pattern recognition (CVPR’03). Madison, WI, pp 409–415
Li Y, Shapiro LG (2004) Object recognition for content-based image retrieval. In: Lecture Notes in Computer Science. Springer, Heidelberg
Hsuan-Tien Lin, Chih-Jen Lin (2003) A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. Available at http://www.csie.ntu.edu.tw/∼cjlin/papers/tanh.pdf
Mary X (2003) Sous-espaces hilbertiens, sous-dualités et applications. PhD thesis, Institut national des sciences appliquees de rouen - Insa rouen, ASI-PSI
David Masip, Ludmila I Kuncheva, Jordi Vitrià (2005) An ensemble-based method for linear feature extraction for two-class problems. Pattern Anal Appl 8(3):227–237
Article MathSciNet Google Scholar
Tom Mitchell (1997) Machine learning. McGraw-Hill, New York
Moré JJ, Sorensen DC (1983) Computing a trust region step. SIAM J Sci Stat Comput 4(3):553–572
Article MATH Google Scholar
Moreno PJ, Ho PP, Vasconcelos N (2004) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, Vol 16. MIT, Cambridge
Nesterov Y, Nemirovskii A (1994) Interior Point Polynomial Methods in Convex Programming: Theory and Applications. Society for Industrial and Applied Mathematics, Philadelphia
Google Scholar
Ong CS, Smola AJ, Williamson RC (2002) Hyperkernels. In: Neural information processing systems, Vol 15. MIT, Cambridge
Ong CS, Mary X, Canu S, Smola AJ (2004) Learning with non-positive kernels. In: ICML ’04: Proceedings of the twenty-first international conference on Machine learning. ACM
R. Paredes, E. Vidal (2000) A class-dependent weighted dissimilarity measure for nearest neighbor classification problems. Pattern Recogn Lett 21(12):1027–1036
Article MATH Google Scholar
Rojas M, Santos S, Sorensen D (2000) A new matrix-free algorithm for the large-scale trust-region subproblem. SIAM J Optim 11(3):611–646
Article MATH MathSciNet Google Scholar
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326
Article Google Scholar
Schölkopf B (2001) The kernel trick for distances. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems, Vol 13. MIT, Cambridge, pp 301–307
Shen L, Bai L (2006) A review on gabor wavelets for face recognition. Pattern Anal Appl 9(2–3):273–292
MathSciNet Google Scholar
Smith JR, Chang S-F (1996) Tools and techniques for color image retrieval. In: Storage and Retrieval for Image and Video Databases (SPIF), pp 426–437
Squire D. McG, Müller W, Müller H, Raki J (1999) Content-based query of image databases, inspirations from text retrieval: inverted files, frequency-based weights and relevance feedback. In: The 11th Scandinavian Conference on Image Analysis. Kangerlussuaq, Greenland, pp 143–149
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323
Article Google Scholar
Theodoridis S, Koutroumbas K (1999) Pattern recognition. Academic, London
Google Scholar
Torkkola K, Campbell W (2000) Mutual information in learning feature transformations. In: Proceedings 17th international conference on machine learning, pp 1015–1022
Trafalis TB, Malyscheff AM (2002) An analytic center machine. Mach Learn 46(1–3):203–223
Article MATH Google Scholar
van Deun K, Groenen PJF (2003) Majorization algorithms for inspecting circles, ellipses, squares, rectangles, and rhombi. Technical report, Econometric Institute Report EI 2003-35
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
MATH Google Scholar
Vapnik VN (1998) Statistical learning theory. Wiley, New-York
MATH Google Scholar
Watanabe H, Yamaguchi T, Katagiri S (1997) Discriminative metric design for robust pattern recognition. IEEE Trans Signal Process 45(11):2655–2661
Article Google Scholar
Webb A (1995) Multidimensional scaling by iterative majorization using radial basis functions. Pattern Recogn 28(5):753–759
Article Google Scholar
Weiss S, Kulikowski C (1991) Computer systems that learn. Morgan Kaufmann, San Francisco
Zhou X, Huang T (2001) Comparing discriminating transformations and SVM for learning during multimedia retrieval. In: Proceedings of the 9th ACM international conference on multimedia. Ottawa, Canada, pp 137–146
Zhou X, Huang T (2001) Small sample learning during multimedia retrieval using BiasMap. In: IEEE computer vision and pattern recognition (CVPR’01), Hawaii

Download references

Author information

Authors and Affiliations

24, rue General-Dufour, 1211, Geneva, Switzerland
Serhiy Kosinov & Thierry Pun

Authors

Serhiy Kosinov
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Pun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Serhiy Kosinov.

Appendix

This section focuses on the intuition behind the definitions of design matrices R and G specified in (14) and (22). The derivations listed here are mostly based on those developed for the SMACOF multi-dimensional scaling algorithm [6].

Let us consider matrix R that is used in calculation of the majorizing expression of S _W(T) represented by a weighted sum of within-distances. In the derivations that follow, we will assume all weights to be equal to unity, and show afterwards how this assumption can be easily corrected for. We, thus, begin by rewriting a squared within-distance in the vector form:

$$ \left(d_{ij}^W(T)\right)^2 = \sum_{a=1}^m (x_{ia}^{\prime}-x_{ja}^{\prime})^2 = ({\mathbf {x}}_i^{\prime} - {{\mathbf{x}}}_j^{\prime})({{\mathbf{x}}}_i^{\prime} - {{\mathbf{x}}}_j^{\prime})^{\rm T}, $$

(50)

where x ^′_i and x ^′_j denote rows i and j from matrix X′ = XT, representing the corresponding observations transformed by T. Noticing that x ^′_i − x ^′_j = (e _i−e _j)^T X′, (50) becomes:

$$ \begin{aligned} \left(d_{ij}^W(T)\right)^2 &= (e_i-e_j)^{\rm T} X^{\prime} X^{\prime {\rm T}} (e_i-e_j)\\ &= {{\mathbf{tr}}} \left(X^{\prime {\rm T}} (e_i-e_j) (e_i-e_j)^{\rm T} X^{\prime}\right)\\ & = {{\mathbf{tr}}} \left(X^{\prime {\rm T}} A_{ij} X^{\prime} \right), \end{aligned} $$

(51)

where A _ij is a square symmetric matrix whose elements are all zeros, except for those four indexed by the combinations of i and j that are either 1 (diagonal) or −1 (off-diagonal). For instance, A ₁₃ for i = 1, j = 3 and N _X = 3 will have the following form:

$$ A_{13} = \left[{\begin{array}{*{20}c} 1 & 0 & -1\\ 0 & 0 & 0\\ -1 & 0 & 1\\ \end{array}}\right]. $$

(52)

Taking into account (51), the sum of the squared within-distances can be expressed as:

$$ \begin{aligned} \sum_{i < j}^{N_X} \left(d_{ij}^W(T)\right)^2 &= \sum_{i < j}^{N_X} {{\mathbf{tr}}} \left(X^{\prime {\rm T}} A_{ij} X^{\prime}\right)\\ & = {{\mathbf{tr}}}\left(X^{\prime {\rm T}} V X^{\prime}\right)\\ &= {{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} V X T\right), \end{aligned} $$

(53)

where $V=\sum_{i < j}^{N_X} A_{ij},$ for which there exists an easy computational shortcut. Namely, V is obtained by placing −1 in all off-diagonal entries of the matrix, while the diagonal elements are calculated as negated sums of their corresponding off-diagonal values in rows or columns. That is:

$$ v_{ij} = \left\{{\begin{array}{ll} -1,& \hbox{if }\, i \ne j;\\ -\sum\limits_{k=1,k \ne i}^{N_X} v_{ik} = N_X-1, & \hbox{if}\,i = j;\\ \end{array}} \right. $$

(54)

For instance, coming back to our previous N _X = 3 example, this technique produces:

$$ \begin{aligned} V &= \sum_{i < j}^{N_X=3} A_{ij}\\ &= \left(\left[{\begin{array}{*{20}c} 1 &-1 & 0\\ -1 & 1 & 0\\ 0 & 0 & 0\\ \end{array}}\right] + \left[ {\begin{array}{*{20}c} 1 & 0 & -1\\ 0 & 0 & 0\\ -1 & 0 & 1\\ \end{array}} \right] + \left[{\begin{array}{*{20}c} 0 & 0 & 0\\ 0 & 1 &-1\\ 0 &-1 & 1\\ \end{array}} \right]\right)\\ & = \left[{\begin{array}{*{20}c} 2 & -1 & -1\\ -1 & 2 & -1\\ -1 & -1 &2\\ \end{array}} \right]. \end{aligned} $$

(55)

It is not difficult to see that the same result applies to the case of non-unitary weights associated with each distance, the only difference being that instead of −1 placed into the off-diagonal elements of V, one should use the negated values of the corresponding weights. And this is exactly how the matrix formulation of $\mu_{S_W}(T,{\bar T}),$ (15), and design matrix R, (14), are obtained:

$$ \begin{aligned} \mu_{S_W}(T,{\bar T}) &= \sum_{i < j}^{N_X} \frac{{\bar w}_{ij}\cdot\left(d^W_{ij}(T)\right)^2}{2\Psi\left(d^W_{ij}({\bar T})\right)} + K_1\\ &= \sum_{i < j}^{N_X} \frac{{\bar w}_{ij}} {\Psi\left(d^W_{ij}({\bar T})\right)} \left[\frac{1}{2} {{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} A_{ij} X T \right) + K_1^{\prime} \right]\\ & = \frac{1}{2}{{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} \sum_{i < j}^{N_X} \frac{{\bar w}_{ij}} {\Psi\left(d^W_{ij}({\bar T}) \right)} A_{ij} X T \right) + K_1\\ & =\frac{1}{2}{{\mathbf{tr}}} \left(T^{\rm T} X^{\rm T} R X T \right) + K_1\\ \end{aligned} $$

(56)

In order to derive the formulation of matrix G, as specified for the majorizer of −S _B(T) based on Taylor series expansion (23), we rewrite (22) using the same techniques as we did in (51) arriving at:

$$ -d_{ij}^B(T) \le - \frac{{{\mathbf{tr}}}\left(T^{\rm T} Z^{\rm T} C_{ij} Z {\bar T} \right) }{d^B_{ij}({\bar T})}, $$

(57)

where C _ij = (e _i−e _{N_X+j}) (e _i−e _{N_X+j})^T is a between-class analog of matrix A _ij. From (55), it is apparent that the same type of a computational shortcut used above to obtain V may be exploited here too. Indeed, matrix $F=\sum_{i=1}^{N_X} \sum_{j=1}^{N_Y} C_{ij}$ can be quickly constructed by placing −1 in the off-diagonal elements that correspond to index locations of the between-distances, and subsequently summing with negation to obtain the diagonal entries. An illustration of the technique for N _X = 2, N _Y = 3 is shown below:

$$ \begin{aligned} F &= \sum_{i=1}^{N_X=2} \sum_{j=1}^{N_Y=3} C_{ij}\\ &=\left[{\begin{array}{*{20}c} 3 & 0 & -1 & -1 & -1\\ 0 & 3 & -1 & -1 & -1\\ -1 &-1 & 2 & 0 & 0 \\ -1 & -1 & 0 & 2 & 0\\ -1 & -1 & 0 & 0 & 2\\ \end{array}} \right]. \end{aligned} $$

(58)

This is the case of unitary weights. Again, the extension to the non-unitary weight formulation is trivial, and will involve pre-multiplying the off-diagonal entries by the appropriate quantities, which in the case of G are the reciprocals of the squares of the corresponding distances, as shown in (23).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kosinov, S., Pun, T. Distance-based discriminant analysis method and its applications. Pattern Anal Applic 11, 227–246 (2008). https://doi.org/10.1007/s10044-007-0082-x

Download citation

Received: 29 November 2006
Accepted: 10 July 2007
Published: 11 September 2007
Issue Date: September 2008
DOI: https://doi.org/10.1007/s10044-007-0082-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distance-based discriminant analysis method and its applications

Abstract

Access this article

Similar content being viewed by others

Nonparametric Discriminant Analysis Based on the Trace Ratio Criterion

High-Dimensional Classification

A DC Programming Approach for Sparse Linear Discriminant Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distance-based discriminant analysis method and its applications

Abstract

Access this article

Similar content being viewed by others

Nonparametric Discriminant Analysis Based on the Trace Ratio Criterion

High-Dimensional Classification

A DC Programming Approach for Sparse Linear Discriminant Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation