A Robust Clustering Method and Visualization Tool Based on Data Depth

Jörnsten, Rebecka; Vardi, Yehuda; Zhang, Cun-Hui

doi:10.1007/978-3-0348-8201-9_29

Rebecka Jörnsten³,
Yehuda Vardi³ &
Cun-Hui Zhang³

Part of the book series: Statistics for Industry and Technology ((SIT))

854 Accesses
9 Citations

Abstract

We present a robust clustering method based on a modified Weisz-feld algorithm for the multivariate median, and associated data depth. The multivariate medians are used to represent the clusters, while the induced relative L ₁-depths are used to identify outliers and to select the number of clusters. We develop a cluster validation and visualization tool based on the within-cluster data depths, and the cluster data depths with respect to competing clusters. We apply our method to high-dimensional gene expression data, and several simulated data sets. Our method successfully identifies the number of clusters in noisy data sets, and generates accurate cluster assignments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Dudoit, J. Fridlyand, T. Speed. Comparision of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97 (2002), 77–87.
Article MATH MathSciNet Google Scholar
S. Dudoit, J. Fridlyand. Application of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method. Technical report 600 (2001), Department of Statistics, UC Berkeley.
Google Scholar
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caliguiri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286 (1999), 531–537
Article Google Scholar
T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Straudt, W. C. Chang, D. Botstein, P. Brown. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 1(2) (2000), 1–21
Article Google Scholar
T. Hastie, R. Tibshirani, D. Botstein, P. Brown. Supervised Harvesting of Expression Trees. Technical report (2000), Department of Statistics, Stanford University.
Google Scholar
R. Jornsten Data compression and its statistical implications: with an application to the analysis of microarray images., PhD Thesis (2001), Department of Statistics, UC Berkeley.
Google Scholar
L. Kaufman, and P. J. Rousseeuw. Finding Groups in Data: An introduction to cluster analysis. (1990) Wiley, New York.
Book Google Scholar
K. Pollard, M van der Laan. Statistical inference for simultaneous clustering of gene expression data. Technical report (2001), Department of Biostatistics, UC Berkeley.
Google Scholar
J. Möttönen, and H. Oja. Multivariate spatial sign and rank methods J. Nonparametric Statistics, 5 (1995), 201–203.
Article MATH MathSciNet Google Scholar
A. Owen, and L. Lazzeroni. The plaid model. Technical report (2000), Department of Statistics, Stanford University.
Google Scholar
D. Rocke, D. Nguyen. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1) (2002), 39–50.
Article Google Scholar
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a dataset via the gap statistic. Technical report (2000), Stanford University, Department of Biostatistics.
Google Scholar
R. Tibshirani, G. Walther, D. Botstein, and P. Brown. Cluster validation by prediction strength Technical report (2001), Stanford University, Department of Biostatistics.
Google Scholar
Y. Vardi, and C-H. Zhang. The multivariate L ₁-median and associated data depth. Proceedings of the National Academy of Sciences, 97 (2000), 1423–1426.
Article MATH MathSciNet Google Scholar
M. West, J. R. Nevins, J. R. Marks, R. Spang, C. Blanchette, H. Zuzan. DNA microarray data analysis and regression modeling for genetic expression profiling. Preprint (2001), Department of Statistics (Duke Univ).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Rutgers University, Piscataway, NJ, 08854, USA
Rebecka Jörnsten, Yehuda Vardi & Cun-Hui Zhang

Authors

Rebecka Jörnsten
View author publications
You can also search for this author in PubMed Google Scholar
Yehuda Vardi
View author publications
You can also search for this author in PubMed Google Scholar
Cun-Hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Statistics Group, University of Neuchâtel, P.O. Box 805, CH-2002, Neuchâtel, Switzerland
Yadolah Dodge (Prof. of Statistics and Operation Research) (Prof. of Statistics and Operation Research)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jörnsten, R., Vardi, Y., Zhang, CH. (2002). A Robust Clustering Method and Visualization Tool Based on Data Depth. In: Dodge, Y. (eds) Statistical Data Analysis Based on the L₁-Norm and Related Methods. Statistics for Industry and Technology. Birkhäuser, Basel. https://doi.org/10.1007/978-3-0348-8201-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-0348-8201-9_29
Publisher Name: Birkhäuser, Basel
Print ISBN: 978-3-0348-9472-2
Online ISBN: 978-3-0348-8201-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics