Skip to main content
Log in

Model-based classification with dissimilarities: a maximum likelihood approach

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Most of classification problems concern applications with objects lying in an Euclidean space, but, in some situations, only dissimilarities between objects are known. We are concerned with supervised classification analysis from an observed dissimilarity table, which task is classifying new unobserved or implicit objects (only known through their dissimilarity measures with previously classified ones forming the training data set) into predefined classes. This work concentrates on developing model-based classifiers for dissimilarities which take into account the measurement error w.r.t. Euclidean distance. Basically, it is assumed that the unobserved objects are unknown parameters to estimate in an Euclidean space, and the observed dissimilarity table is a random perturbation of their Euclidean distances of gaussian type. Allowing the distribution of these perturbations to vary across pairs of classes in the population leads to more flexible classification methods than usual algorithms. Model parameters are estimated from the training data set via the maximum likelihood (ML) method, and allocation is done by assigning a new implicit object to the group in the population and positioning in the Euclidean space maximizing the conditional group likelihood with the estimated parameters. This point of view can be expected to be useful in classifying dissimilarity tables that are no longer Euclidean due to measurement error or instabilities of various types. Two possible structures are postulated for the error, resulting in two different model-based classifiers. First results on real or simulated data sets show interesting behavior of the two proposed algorithms, ant the respective effects of the dissimilarity type and of the data intrinsic dimension are investigated. For these latter two aspects, one of the constructed classifiers appears to be very promising. Interestingly, the data intrinsic dimension seems to have a much less adverse effect on our classifiers than initially feared, at least for small to moderate dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Balachander T, Kothari R (1999) Introducing locality and softness in subspace classification. Pattern Anal Appl 2(1):53–58

    Article  MATH  Google Scholar 

  2. Borg I, Groenen PJF (1997) Modern multidimensional scaling. Theory and applications. Springer Series in Statistics. Springer, New York

    Google Scholar 

  3. Bottigli U, Golosio B, Masala GL, Oliva P, Stumbo S, Cascio D, Fauci F, Magro R, Raso G, Vasile M, Bellotti R, De Carlo F, Tangaro S, De Mitri I, De Nunzio G, Quarta M, Preite Martinez A, Tata A, Cerello P, Cheran SC, Lopez Torres E (2006) Dissimilarity application in digitized mammographic images classification. J Syst Cybern Inf 4(3):18–22

    Google Scholar 

  4. Bozdogan H (1993) Choosing the number of component clusters in the mixture-model using a new informational complexity criterion of the inverse-Fisher information matrix. In: Opitz O, Lausen B, Klar R (eds) Studies in classification, data analysis, and knowledge organization. Springer, Heidelberg, pp 40–54

    Google Scholar 

  5. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont

  6. Celeux G (2003) Analyse discriminante. In: Govaert G (ed) Analyse des données. Lavoisier, Paris, pp 201–234

    Google Scholar 

  7. Chang C-C, Lin C-J (2003) LIBSVM: a library for support vector machines. Technical Report, National Taiwan University, Taipei, Taiwan. http://www.csie.ntu.edu.tw/~cjlin/libsvm

  8. Dickinson PJ, Bunke H, Dadej A, Kraetzl M (2004) Object-based image content characterisation for semantic-level image similarity calculation. Pattern Anal Appl 7(3):243–254

    MathSciNet  Google Scholar 

  9. Dimitriadou E, Hornik K, Leisch F, Meyer D (2006) e1071: misc functions of the Department of Statistics (e1071). R package, version 1.5–16, TU Wien, Vienna, Austria

  10. Duin RPW, Pekalska E, Paclík P, Tax DMJ (2004) The dissimilarity representation, a basis for domain based pattern recognition? Representations in pattern recognition, IAPR Workshop, Cambridge, pp 43–56

  11. Fournier J, Cordi M, Philipp-Foliguet S (2001) RETIN: a content-based image indexing and retrieval system. Pattern Anal Appl 4(2–3):153–173

    Article  MATH  Google Scholar 

  12. Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Computer Science and Scientific Computing Series. Academic Press Inc, Boston

  13. Glunt W, Hayden TL, Liu W-M (1991) The embedding problem for predistance matrices. Bull Math Biol 53:769–796

    MATH  Google Scholar 

  14. Guérin-Dugué A, Celeux G (2001) Discriminant analysis on dissimilarity data: a new fast Gaussian like algorithm. AISTAT 2001, Florida

  15. Guérin-Dugué A, Oliva A (2000) Classification of scene photographs from local orientation features. Preprint

  16. Guttman L (1968) A general nonmetric technique for finding the smallest coordinate space for a configuration of points. Psychometrika 33:469–506

    Article  MATH  Google Scholar 

  17. Haasdonk B, Bahlmann C (2004) Learning with distance substitution kernels. In: Proceedings of 26th DAGM symposium (Tübingen, Germany). Springer, Berlin, pp 220–227

  18. Harol A, Pekalska E, Verzakov S, Duin RPW (2006) Augmented embedding of dissimilarity data into (pseudo-)Euclidean spaces. Joint IAPR Iinternational workshops on statistical and structural pattern recognition (Honk Kong, China). Lect Notes Comp Sci 4109:613–621

    Article  Google Scholar 

  19. Hastie T, Tibshirani, R, Friedman J (2001) The elements of statistical learning. Data mining, inference and prediction. Springer Series in Statistics, Springer, New York.http://www-stat.stanford.edu/˜tibs/ElemStatLearn

  20. Heiser WJ, de Leeuw J (1986) SMACOF-I. Technical Report UG-86-02 Department of Data Theory, University of Leiden, Leiden, The Netherlands

  21. Higham NI (2002) Accuracy and stability of numerical algorithms, 2nd edn. Society for Industrial and Applied Mathematics. Philadelphia

    MATH  Google Scholar 

  22. Kearsley AJ, Tapia RA, Trosset MW (1998) The solution of the metric STRESS and SSTRESS problems in multidimensional scaling using Newton’s method. Comput Stat 13(3):369–396

    MATH  Google Scholar 

  23. Le Cun Y, Boser B, Denker J, Henderson D, Howard R, Hubbard W, Jackel L (1990) Handwritten digit recognition with a back-propagation network. In: Touretzky D (ed) Advances in neural information processing systems, vol 2. Morgan Kaufman, Denver

  24. de Leeuw J (1988) Convergence of the majorization method for multidimensional scaling. J Classification 5:163–180

    Article  MATH  MathSciNet  Google Scholar 

  25. Lozano M, Sotoca JM, Sánchez JS, Pla F, Pekalska E, Duin RPW (2006) Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces. Pattern Recogn 39:1827–1838

    Article  MATH  Google Scholar 

  26. Malone SW, Tarazaga P, Trosset MW (2002) Better initial configurations for metric multidimensional scaling. Comput Stat Data Anal 41:143–156

    Article  MATH  MathSciNet  Google Scholar 

  27. Malone SW, Trosset MW (2000) Optimal dilations for metric multidimensional Scaling. In: 2000 Proceedings of the statistical computing section and section on statistical graphics. American Statistical Association, Alexandria

  28. Martins A, Figueiredo M, Aguiar P (2007) Kernels and similarity measures for text classification. In: 6th Conference on telecommunications—ConfTele’2007, Peniche, Portugal

  29. Masala GL (2006) Pattern recognition techniques applied to biomedical patterns. Int J Biomed Sci 1(1):47–55

    Google Scholar 

  30. Orozco M, García ME, Duin RPW and Castellanos CG (2006) Dissimilarity-based classification of seismic signals at Nevado del Ruiz Volcano. Earth Sci Res J 10(2):57–65

    Google Scholar 

  31. Paclík P, Duin RPW (2003) Dissimilarity-based classification of spectra: computational issues. Real-Time Imaging 9:237–244

    Article  Google Scholar 

  32. Pekalska E, Duin RPW (2000) Classifiers for dissimilarity-based pattern recognition. In: Sanfeliu A, Villanueva JJ, Vanrell M, Alquezar R, Jain AK (eds) Proceedings of 15th international conference on pattern recognition (Barcelona, Spain), 2:12–16 Pattern recognition and neutral networks. IEEE Computer Society Press, Los Alamitos

  33. Pekalska E, Duin RPW (2000) Classification on dissimilarity data: a first look. In: Van Vliet LJ, Heinjnsdijk JWJ, Kielman T, Knijnenburg PMW (eds) Proceedings of annual conference of the advanced school for computing and imaging (Lommel, Belgium), Pattern recognition and neutral networks. IEEE Computer Society Press, Los Alamitos, pp 221–228

  34. Pekalska E, Duin RPW (2002) Dissimilarity representations allow for building good classifiers. Pattern Recogn Lett 23(8):943–956

    Article  MATH  Google Scholar 

  35. Pekalska E, Duin RPW (2006) Dissimilarity-based classification with vectorial representations. Int Conf Pattern Recogn Hong Kong. Hong Kong 3:137–140

    Google Scholar 

  36. Pekalska E, Duin RPW, Paclík P (2006) Prototype selection for dissimilarity-based classifiers. Pattern Recogn 39(2):189–208

    Article  MATH  Google Scholar 

  37. Pekalska E, Paclík P, Duin RPW (2002) A generalized kernel approach to dissimilarity-based classification. J Mach Learn Res Spec Issue Kernel Methods 2(2):175–211

    MATH  Google Scholar 

  38. The R Development Core Team (2007) R : a language and environment for statistical computing. Reference Index. Version 2.5.0, R Foundation for Statistical Science

  39. Ramsay JO (1982) Some statistical approaches to multidimensional scaling data. JJ Roy Stat Soc Ser A 145:285–312

    Article  MATH  MathSciNet  Google Scholar 

  40. Srisuk S, Petrou M, Kurutach W, Kadyrov A (2005) A face authentication system using the trace transform. Pattern Anal Appl 8(1–2):50–61

    MathSciNet  Google Scholar 

  41. Tolba AS, Abu-Rezq AN (1998) Arabic glove-talk (AGT): a communication aid for vocally impaired. Pattern Anal Appl 1(4):218–230

    Article  Google Scholar 

  42. Torgerson WS (1952) Multidimensional scaling: I. Theory and method. Psychometrika 17:401–419

    Article  MATH  MathSciNet  Google Scholar 

  43. Young G, Householder AS (1938) Discussion of a set of points in terms of their mutual distances. Psychometrika 3:19–22

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to thank Gilles Celeux and Jean-Michel Marin who introduced them in this research theme, and provided an invaluable insight into it all along, with many fruitful discussions resulting in useful critiques which helped in the shaping of the provided solutions. Thanks also go to Elizabeth Gassiat who took patience to read and comment an early draft of this work. The same to Anne Guérin-Dugué who kindly provided the Jeffreys and proteins data. And, not the least, the first author wishes to express warm thanks to Prs. D. Dacunha-Castelle and H. Gwét for having encouraged him to engage in this research and Pr. Jean Coursol for having initiated him in the Classification and Data Mining areas and the use of the R Statistical Computing System through a course he gave at the Ecole Polytechnique of Yaoundé in June 2004. He is also indebted to Patrick Jakubowicz, Yves Misiti and Marc Lavielle for assistance with computing facilities at Orsay.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eugène-Patrice Ndong Nguéma.

Additional information

This work was started when the authors were at the Université de Paris-Sud at Orsay, the first author as a visiting researcher through a scholarship of the French Cooperation, and the second in a postdoctoral position funded by the Institut National de la Recherche en Informatique et en Automatique (INRIA).

Appendix

Appendix

1.1 The computation of an initial approximation \({\widehat{U}}_{k,0}\) to minimize \({\widehat{H}_{k}(U)}\)

Since we intend to compute \(U = {\widehat{U}}_{k},\) the point at which the function \({\widehat{H}_{k}(U)}\) defined by (44) reaches its minimum value, intuitively this minimum should satisfy, as closely as possible, the approximate equalities:

$$ {{\|{U-{{\widehat{X}}_{i}}}\|}} \approx d_{i},\quad \hbox{for}\; i = 1,\ldots,n, $$

which implies, by squaring and expanding the squared Euclidean norms of the differences:

$$ 2 < {{\widehat{X}}_{i}}, U > - {\|{U}\|}^{2} \approx {\|{{{\widehat{X}}_{i}}}\|}^{2} - d_{i}^{2} (i = 1,\ldots,n), $$

or

$$ 2 {\widehat{x}_{i1}} u_1 + \cdots + 2 {\widehat{x}_{ip}} u_p - u_{p+1} \approx {\|{{{\widehat{X}}_{i}}}\|}^{2} - d_{i}^{2} (i = 1,\ldots,n), $$
(49)

where

$$ U = ({{u}_{{1}},\ldots,{u}_{{p}}})^{\rm T}, u_{p+1} = {\|{U}\|}^{2}, $$

and

$$ {{\widehat{X}}_{i}}=({\widehat{x}_{i1}}, \ldots, {\widehat{x}_{ip}})\;(i = 1,\ldots,n). $$

Now, (49) can be regarded as an approximate linear system of n equations in the p + 1 scalar unknowns \({{u}_{{1}},\ldots,{u}_{{p+1}}}.\) The unknown u p+1 introduces a nonlinear constraint: u p+1 = u 21  + ⋯ + u 2 p . This constraint can be eliminated by suppressing an arbitrarily chosen equation from (49) while subtracting it from all the remaining n − 1 ones, therefore obtaining an approximate linear system of n − 1 equations in the p unknowns u 1,...,u p , which one then solves by Least Squares. Instead we chose the simpler strategy to directly solve (49) by Least Squares and merely discard the estimation obtained for u p+1.

In any event, whatever the strategy chosen between the two outlined above to approximately solve (49), it is important to note that the matrix of the solved Least Squares system is entirely determined by the estimated learning configuration \({\widehat{{\mathbf{X}}}}.\) Thus, its QR factorization can be computed once for all after the Learning Phase for use throughout the Prediction Phase to quickly solve (49) for each new explicit observation d to classify.

However, since n p, there are obvious far more economical ways to approximately solve (49). For instance (see [22, 26]), one could retain just p + 1 equations among the n in (49) and solve a (hopefully) Cramer system for \({{u}_{{1}},\ldots,{u}_{{p+1}}}\) (or for u 1,...,u p by first suppressing one of the p + 1 equations and subtracting it from the p others). The reason we chose not to follow these cheap paths is twofold:

  1. 1.

    One would have to devise a cheap selecting criterion for the p + 1 equations among the initial n in (49). Now, the two simplest choices are that of the first p + 1 ones or a random choice, but this may result in undesirable effects for the numerical stability of the resulting system in certain cases and/or the convergence of the numerical minimization algorithm starting from the obtained initial approximation of \({\widehat{U}}_{k}.\) On the other hand, more sophisticated selecting criteria may imply a significant overhead in computation with no appreciable gain over the use of all the equations. Whereas, since n p, using all the equations almost certainly guarantees that we least squarely solve a linear system of full rank p.

  2. 2.

    By using all the equations in (49) for computing an initial approximation \({\widehat{U}}_{k,0}\) to \({\widehat{U}}_{k},\) we increase our chances of obtaining a good one in order to guarantee fast convergence in the numerical nonlinear minimizer used to minimize the function \({\widehat{H}_{k}(U)}\) defined by (44).

However, pushing this latter argument further, one may legitimately argue that, considering the structure of the function \({\widehat{H}_{k}(U)},\) weighted least squares should be the right strategy in solving our suggested linear systems approximately to obtain \({\widehat{U}}_{k,0}.\) This is right, but would entail a serious increase in computational cost, since the weights would then vary according to each implicit object U and each possible group label in {1,...,G}.

1.2 Contribution and Originality

Traditional classification problems concern applications with objects lying in an Euclidean space, but, in certain situations, only some type of pairwise dissimilarity measure between objects is available. Today, practical applications are numerous (e.g., [3, 8, 11, 2831, 40, 41]). Moreover, using dissimilarity measures can be of much interest to analyze proximity between curves or objects in high dimensional spaces, or, more generally, between objects of complicated intrinsic structure.

None of the existing algorithms for dissimilarity data classification is based on standard principles of statistical inference. Moreover, apart from [37], they do not take into account measurement error in the dissimilarities. Therefore, they are more suited to classify dissimilarity tables that result from exactly computed pairwise distances between objects in a data set that were originally given through attributes in an Euclidean space. They are seldom recommended for coarser or noisy dissimilarity types, whereas real world dissimilarity data quite often fall in that category.

In the work presented here, we develop a new approach based on the purely statistical viewpoint of MDS [39]. Although interesting alternatives exist, MDS remains today the leading mathematical methodology for handling dissimilarity data. We are concerned with classification analysis from a table of such observed data from an otherwise non observable population of objects. The main goal is to assign a new object to one of a priori groups in the population using as only information its dissimilarities with previously classified ones which thus form the Training Data Set. Basically, our approach to solving this problem assumes a probability model in which the observed dissimilarities are Euclidean distances perturbed with random gaussian errors. Actually, two possible probability models are investigated, differing by the structure they postulate for the perturbation of the Euclidean distances which led to the observed dissimilarities. In each of these models, the unobserved objects are regarded as unknown parameters lying in an Euclidean space. Estimating these parameters in the statistical sense is thus equivalent to positioning the unobserved objects in the Euclidean space given their respective pairwise dissimilarities, which is the traditional MDS main concern. Such a model-based approach then has the advantage to simultaneously estimate objects positioning and group labeling. Since it is unknown, the dimension p of the Euclidean space shall serve as a tuning parameter to be estimated from the data by cross validation and the “within one standard error of the minimum error towards model parsimony” rule.

Each of the two probabilistic models so postulated for dissimilarity data allows us to derive a general purpose classifier for such data. The two constructed classifiers are nicknamed M1.BC and M2.BC respectively. The latter (M2.BC) exhibits high flexibility to adapt to the dissimilarity type at hand in the data. Its classification performance on some classical data sets appears comparable to that of some of the already available best classifiers on dissimilarity data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ndong Nguéma, EP., Saint-Pierre, G. Model-based classification with dissimilarities: a maximum likelihood approach. Pattern Anal Applic 11, 281–298 (2008). https://doi.org/10.1007/s10044-008-0105-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-008-0105-2

Keywords

Navigation