Advertisement

Convex clustering for binary data

  • Hosik Choi
  • Seokho Lee
Regular Article
  • 32 Downloads

Abstract

We present a new clustering algorithm for multivariate binary data. The new algorithm is based on the convex relaxation of hierarchical clustering, which is achieved by considering the binomial likelihood as a natural distribution for binary data and by formulating convex clustering using a pairwise penalty on prototypes of clusters. Under convex clustering, we show that the typical \(\ell _1\) pairwise fused penalty results in ineffective cluster formation. In an attempt to promote the clustering performance and select the relevant clustering variables, we propose the penalized maximum likelihood estimation with an \(\ell _2\) fused penalty on the fusion parameters and an \(\ell _1\) penalty on the loading matrix. We provide an efficient algorithm to solve the optimization by using majorization-minimization algorithm and alternative direction method of multipliers. Numerical studies confirmed its good performance and real data analysis demonstrates the practical usefulness of the proposed method.

Keywords

Binary data Convex clustering Dimension reduction Fused penalty 

Mathematics Subject Classification

62H30 

Notes

Acknowledgements

Hosik Choi was supported by the Basic Science Research Program through the National Research Foundation of  Korea (NRF) funded by the Ministry of Education (2017R1D1A1B05028565). Seokho Lee was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2017R1D1A1B04030695).

Supplementary material

11634_2018_350_MOESM1_ESM.pdf (105 kb)
Supplementary material 1 (pdf 104 KB)

References

  1. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2:183–202MathSciNetCrossRefGoogle Scholar
  2. Böhning D (1992) Multinomial logistic regression algorithm. Ann Inst Stat Math 44:197–200CrossRefGoogle Scholar
  3. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trend${}^{\textregistered }$ Mach Learn 3:1–122zbMATHGoogle Scholar
  4. Chi EC, Lange K (2015) Splitting methods for convex clustering. J Comput Graph Stat 24:994–1013MathSciNetCrossRefGoogle Scholar
  5. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499MathSciNetCrossRefGoogle Scholar
  6. Finch H (2005) Comparison of distance measures in cluster analysis with dichotomous data. J Data Sci 3:85–100Google Scholar
  7. Goldstein T, O’Donoghue B, Setzer S, Baraniuk R (2014) Fast alternating direction optimization methods. SIAM J Imaging Sci 7:1588–1623MathSciNetCrossRefGoogle Scholar
  8. Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, BaltimorezbMATHGoogle Scholar
  9. Hallac D, Leskovec J, Boyd S (2015) Network lasso: clustering and optimization in large graphs. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 387–396Google Scholar
  10. Hocking TD, Joullin A, Bach F, Vert J-P (2011) Cluterpath: an algorithm for clustering using convex fusion penalties. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 754–762Google Scholar
  11. Jolliffe IT (2012) Principal component analysis, 2nd edn. Springer, New YorkzbMATHGoogle Scholar
  12. Lange K (2004) Optimization. Springer, New YorkCrossRefGoogle Scholar
  13. Lee S, Huang JZ (2014) A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood. Stat Comput 24:429–441MathSciNetCrossRefGoogle Scholar
  14. Lee S, Huang JZ, Hu J (2010) Sparse logistic principal component analysis for binary data. Ann Appl Stat 4:1579–1601MathSciNetCrossRefGoogle Scholar
  15. Li T (2006) A unified view on clustering binary data. Mach Learn 62:199–215CrossRefGoogle Scholar
  16. Lichman M (2013) UCI machine learning repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine
  17. Pan W, Shen X, Liu B (2013) Cluster analysis: unsupervised learning via supervised learning with a non-convex penalty. J Mach Learn Res 14:1865–1889MathSciNetzbMATHGoogle Scholar
  18. Polson NG, Scott JG, Willard BT (2015) Proximal algorithms in statistics and machine learning. Stat Sci 30:559–581MathSciNetCrossRefGoogle Scholar
  19. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850CrossRefGoogle Scholar
  20. Shen X, Huang HC (2010) Grouping pursuit through a regularization solution surface. J Am Stat Assoc 105:727–739MathSciNetCrossRefGoogle Scholar
  21. Shen X, Pan W (2012) Simultaneous supervised clustering and feature selection over a graph. Biometrika 99:899–914MathSciNetCrossRefGoogle Scholar
  22. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63:411–423MathSciNetCrossRefGoogle Scholar
  23. Turlach BA, Venables W (2005) Simultaneous variable selection. Technometrics 47:349–363MathSciNetCrossRefGoogle Scholar
  24. Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726MathSciNetCrossRefGoogle Scholar
  25. Wu C, Kwon S, Shen X, Pan W (2016) A new algorithm and theory for penalized regression-based clustering. J Mach Learn Res 17:1–25MathSciNetzbMATHGoogle Scholar
  26. Yang H, Liu X (2017) Studies on the clustering algorithm for analyzing gene expression data with a bidirectional penalty. J Comput Biol 24:689–698MathSciNetCrossRefGoogle Scholar
  27. Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: SIGKDD ’02 Edmonton, Alberta, Canada, pp 682–687Google Scholar
  28. Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization with applications. In: IEEE international conference on data mining, pp 391–400Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Kyonggi UniversitySuwonRepublic of Korea
  2. 2.Hankuk University of Foreign StudiesSeoulRepublic of Korea

Personalised recommendations