Clustering dependent observations with copula functions

Regular Article

Abstract

This paper deals with the problem of clustering dependent observations according to their underlying complex generating process. Di Lascio and Giannerini (Journal of Classification 29(1):50–75, 2012) introduced the CoClust, a clustering algorithm based on copula function that achieves the task but has a high computational burden. Moreover, the CoClust automatically allocates all the observations to the clusters; thus, it cannot discard potentially irrelevant observations. In this paper we introduce an improved version of the CoClust that both overcomes these issues and performs better in many respects. By means of a Monte Carlo study we investigate the features of the algorithm and show that it improves consistently with respect to the old CoClust. The validity of our proposal is also supported by applications to real data sets of human breast tumor samples for which the algorithm provides a meaningful biological interpretation. The new algorithm is implemented and made available through an updated version of the R package CoClust.

Keywords

Copula function Multivariate dependence structure Clustering Biological tumor sample 

Mathematics Subject Classification

62H30 62H20 62P10 

References

  1. Brechmann E, Schepsmeier U (2013) Modeling dependence with c- and d-vine copulas: the R package CDVine. J Stat Softw 52(3):1–27CrossRefGoogle Scholar
  2. Cherubini U, Luciano E, Vecchiato W (2004) Copula methods in finance. Wiley, ChichesterCrossRefMATHGoogle Scholar
  3. Clarke K (2007) A simple distribution-free test for non-nested model selection. Polit Anal 15:347–363CrossRefGoogle Scholar
  4. Di Lascio FML, Giannerini S (2012) A copula-based algorithm for discovering patterns of dependent observations. J Classif 29(1):50–75MathSciNetCrossRefMATHGoogle Scholar
  5. Di Lascio FML, Giannerini S (2015) CoClust. R package version 0.3-1Google Scholar
  6. Di Lascio FML, Giannerini S, Reale A (2015) Exploring copulas for the imputation of complex dependent data. Stat Methods Appl 24(1):159–175MathSciNetCrossRefMATHGoogle Scholar
  7. Dortet-Bernadet JL, Wicker N (2008) Model-based clustering on the unit sphere with an illustration using gene expression profiles. Biostatistics 9(1):66–80CrossRefMATHGoogle Scholar
  8. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:14863–14868CrossRefGoogle Scholar
  9. Fraley C, Raftery A (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588CrossRefMATHGoogle Scholar
  10. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, kallioniemi OP, Wilfond B, Borg A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Sauter G (2001) Gene-expression profiles in hereditary breast cancer. N Engl J Med 344(8):539–548CrossRefGoogle Scholar
  11. Joe H, Xu J (1996) The estimation method of inference functions for margins for multivariate models. Technical Report 166, Department of Statistics, University of British ColumbiaGoogle Scholar
  12. Nelsen RB (2006) Introduction to copulas. Springer, New YorkMATHGoogle Scholar
  13. Roverato A, Di Lascio FML (2011) Wilks’ \(\lambda \) dissimilarity measures for gene clustering: an approach based on the identification of transcription modules. Biometrics 67(4):1236–1248MathSciNetCrossRefMATHGoogle Scholar
  14. Sklar A (1959) Fonctions de répartition à n dimensions et leurs marges. Publ Inst Stat Univ Paris 8:229–231MathSciNetMATHGoogle Scholar
  15. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96(6):2907–2912CrossRefGoogle Scholar
  16. Trivedi PK, Zimmer DM (2005) Copula modeling: an introduction for practitioners. Found Trends Econom 1:1–111CrossRefMATHGoogle Scholar
  17. Vuong Q (1989) Likelihood ratio tests formodel selection and non-nested hypotheses. Econometrica 57:307–333MathSciNetCrossRefMATHGoogle Scholar
  18. Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987CrossRefGoogle Scholar
  19. Zimmer DM, Trivedi PK (2006) Using trivariate copulas to model sample selection and treatment effects: application to family health care demand. J Bus Econ Stat 24:63–76MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Faculty of Economics and ManagementUniversity of Bozen-BolzanoBolzanoItaly
  2. 2.Department of Statistical SciencesUniversity of BolognaBolognaItaly

Personalised recommendations