Skip to main content

A Robust Clustering Method and Visualization Tool Based on Data Depth

  • Conference paper
Statistical Data Analysis Based on the L1-Norm and Related Methods

Abstract

We present a robust clustering method based on a modified Weisz-feld algorithm for the multivariate median, and associated data depth. The multivariate medians are used to represent the clusters, while the induced relative L 1-depths are used to identify outliers and to select the number of clusters. We develop a cluster validation and visualization tool based on the within-cluster data depths, and the cluster data depths with respect to competing clusters. We apply our method to high-dimensional gene expression data, and several simulated data sets. Our method successfully identifies the number of clusters in noisy data sets, and generates accurate cluster assignments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Dudoit, J. Fridlyand, T. Speed. Comparision of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97 (2002), 77–87.

    Article  MATH  MathSciNet  Google Scholar 

  2. S. Dudoit, J. Fridlyand. Application of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method. Technical report 600 (2001), Department of Statistics, UC Berkeley.

    Google Scholar 

  3. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caliguiri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286 (1999), 531–537

    Article  Google Scholar 

  4. T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Straudt, W. C. Chang, D. Botstein, P. Brown. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 1(2) (2000), 1–21

    Article  Google Scholar 

  5. T. Hastie, R. Tibshirani, D. Botstein, P. Brown. Supervised Harvesting of Expression Trees. Technical report (2000), Department of Statistics, Stanford University.

    Google Scholar 

  6. R. Jornsten Data compression and its statistical implications: with an application to the analysis of microarray images., PhD Thesis (2001), Department of Statistics, UC Berkeley.

    Google Scholar 

  7. L. Kaufman, and P. J. Rousseeuw. Finding Groups in Data: An introduction to cluster analysis. (1990) Wiley, New York.

    Book  Google Scholar 

  8. K. Pollard, M van der Laan. Statistical inference for simultaneous clustering of gene expression data. Technical report (2001), Department of Biostatistics, UC Berkeley.

    Google Scholar 

  9. J. Möttönen, and H. Oja. Multivariate spatial sign and rank methods J. Nonparametric Statistics, 5 (1995), 201–203.

    Article  MATH  MathSciNet  Google Scholar 

  10. A. Owen, and L. Lazzeroni. The plaid model. Technical report (2000), Department of Statistics, Stanford University.

    Google Scholar 

  11. D. Rocke, D. Nguyen. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1) (2002), 39–50.

    Article  Google Scholar 

  12. R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a dataset via the gap statistic. Technical report (2000), Stanford University, Department of Biostatistics.

    Google Scholar 

  13. R. Tibshirani, G. Walther, D. Botstein, and P. Brown. Cluster validation by prediction strength Technical report (2001), Stanford University, Department of Biostatistics.

    Google Scholar 

  14. Y. Vardi, and C-H. Zhang. The multivariate L 1-median and associated data depth. Proceedings of the National Academy of Sciences, 97 (2000), 1423–1426.

    Article  MATH  MathSciNet  Google Scholar 

  15. M. West, J. R. Nevins, J. R. Marks, R. Spang, C. Blanchette, H. Zuzan. DNA microarray data analysis and regression modeling for genetic expression profiling. Preprint (2001), Department of Statistics (Duke Univ).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Basel AG

About this paper

Cite this paper

Jörnsten, R., Vardi, Y., Zhang, CH. (2002). A Robust Clustering Method and Visualization Tool Based on Data Depth. In: Dodge, Y. (eds) Statistical Data Analysis Based on the L1-Norm and Related Methods. Statistics for Industry and Technology. Birkhäuser, Basel. https://doi.org/10.1007/978-3-0348-8201-9_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-0348-8201-9_29

  • Publisher Name: Birkhäuser, Basel

  • Print ISBN: 978-3-0348-9472-2

  • Online ISBN: 978-3-0348-8201-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics