Data Mining and Knowledge Discovery

, Volume 14, Issue 1, pp 1–23

An efficient approach to external cluster assessment with an application to martian topography



Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known (i.e., external) classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that are meaningful and novel. An assessment of the information gained with new clusters can be effected by looking at the degree of separation between each new cluster and its most similar class. Our approach models each cluster and class as a multivariate Gaussian distribution and estimates their degree of separation through an information theoretic measure (i.e., through relative entropy or Kullback–Leibler distance). The inherently large computational cost of this step is alleviated by first projecting all data over the single dimension that best separates both distributions (using Fisher’s Linear Discriminant). We test our algorithm on a dataset of Martian surfaces using the traditional division into geological units as external classes and the new, hydrology-inspired, automatically performed division as novel clusters. We find the new partitioning constitutes a formally meaningful classification that deviates substantially from the traditional classification.


External cluster validation Multivariate Gaussian distributions Martian topography 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Chapman MG, Masursky H, Dial ALJ (1989) Geological map of science area 1A, East Mangala Valles region on Mars. USGS Misc Geol Inv Map I-1696Google Scholar
  2. Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): Theory and results. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press/MIT Press, Cambridge, MAGoogle Scholar
  3. Cover TM, Thomas J (1991) Elements of information theory. Wiley-Interscience, New YorkMATHGoogle Scholar
  4. Diggle P (1983) Statistical analysis of spatial point patterns. Academic Press, New YorkMATHGoogle Scholar
  5. Dom B (2001) An information-theoretic external cluster-validity measure. Research report, IBM T.J. Watson Research Center RJ 10219Google Scholar
  6. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New YorkMATHGoogle Scholar
  7. Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–569MATHCrossRefGoogle Scholar
  8. Hubert L, Schultz J (1976) Quadratic assignment as a general data analysis strategy. Br J Math Stat Psychol 29:190–241MATHMathSciNetGoogle Scholar
  9. Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJGoogle Scholar
  10. Kanungo T, Dom B, Niblack W, Steele D (1996) A fast algorithm for MDL-based multi-band image segmentation. In: Sanz J (ed) Image technology. Springer-Verlag, BerlinGoogle Scholar
  11. Krishnapuran R, Frigui H, Nasraoui O (1995) Fussy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation, part II. IEEE Trans Fuzzy Syst 3(1):44–60CrossRefGoogle Scholar
  12. McLachlan G, Krishnan T (1997) The EM algorithm and extensions. Wiley, New YorkMATHGoogle Scholar
  13. Milligan GW, Soon SC, Sokol LM (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patterns Anal Mach Intell 5(1):40–47CrossRefGoogle Scholar
  14. Panayirci E, Dubes R (1983) A test for multidimensional clustering tendency. Pattern Recognit 16(4):433–444MATHCrossRefGoogle Scholar
  15. Rand WM (1971) Objective criterion for evaluation of clustering methods. J Am Stat Assoc 66:846–851CrossRefGoogle Scholar
  16. Ripley B (1981) Spatial statistics. Wiley, New YorkMATHCrossRefGoogle Scholar
  17. Rolph F, Fisher D (1968) Test for hierarchical structure in random data sets. Syst Zool 17:407–412CrossRefGoogle Scholar
  18. Smith D, Neumann G, Arvidson R, Guinness E, Slavney S (2003) Global surveyor laser altimeter mission experiment gridded data record. NASA Planetary Data System, MGS-M-MOLA-5-MEGDR-L3-V1.0Google Scholar
  19. Stepinski T, Marinova MM, McGovern P, Clifford SM (2002) Fractal analysis of drainage basins on Mars. Geophys Res Lett 29(8)Google Scholar
  20. Stepinski TEA (2004) Martian geomorphology from fractal analysis of drainage networks. J Geophys Res 109 (E02005, 10.1029/2003JE0020988)Google Scholar
  21. Tanaka K (1994) The Venus geologic mappers handbook. US Geol Surv Open File Rep 99–438Google Scholar
  22. Theodoridis S, Koutroumbas K (2003) Pattern recognition. Academic Press, New YorkGoogle Scholar
  23. Vaithyanathan S, Dom B (2000) Model selection in unsupervised learning with applications to document clustering. In: Proceedings of the 16th international conference on machine learning, Stanford University, CAGoogle Scholar
  24. Wilhelms DE (1990) Planetary mapping. Cambridge University Press, Cambridge, UKGoogle Scholar
  25. Witten IH Frank E (2000) Data mining: practical machine learning tools and techniques with java implementations. Academic Press, New YorkGoogle Scholar
  26. Zeng G, Dubes R (1985) A comparison of tests for randomness. Pattern recognition 18(2):191–198CrossRefGoogle Scholar
  27. Zuber M, Smith D, Solomon S, Muhleman D, Head J, Garvin J, Abshire J, Bufton J (1992) The Mars observer laser altimeter investigation. J Geophys Res 97:7781–7797CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of HoustonHoustonUSA
  2. 2.Lunar and Planetary InstituteHoustonUSA

Personalised recommendations