Binary matrix factorization for analyzing gene expression data

  • Zhong-Yuan Zhang
  • Tao LiEmail author
  • Chris Ding
  • Xian-Wen Ren
  • Xiang-Sun Zhang


The advent of microarray technology enables us to monitor an entire genome in a single chip using a systematic approach. Clustering, as a widely used data mining approach, has been used to discover phenotypes from the raw expression data. However traditional clustering algorithms have limitations since they can not identify the substructures of samples and features hidden behind the data. Different from clustering, biclustering is a new methodology for discovering genes that are highly related to a subset of samples. Several biclustering models/methods have been presented and used for tumor clinical diagnosis and pathological research. In this paper, we present a new biclustering model using Binary Matrix Factorization (BMF). BMF is a new variant rooted from non-negative matrix factorization (NMF). We begin by proving a new boundedness property of NMF. Two different algorithms to implement the model and their comparison are then presented. We show that the microarray data biclustering problem can be formulated as a BMF problem and can be solved effectively using our proposed algorithms. Unlike the greedy strategy-based algorithms, our proposed algorithms for BMF are more likely to find the global optima. Experimental results on synthetic and real datasets demonstrate the advantages of BMF over existing biclustering methods. Besides the attractive clustering performance, BMF can generate sparse results (i.e., the number of genes/features involved in each biclustering structure is very small related to the total number of genes/features) that are in accordance with the common practice in molecular biology.


Biclustering Non-negative matrix factorization Boundedness property of NMF Binary matrix 


  1. Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB ’02: proceedings of the 6th annual international conference on computational biology. ACM, New York, pp 49–57Google Scholar
  2. Berry M, Browne M, Langville A, Pauca P, Plemmons R (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1): 155–173zbMATHCrossRefMathSciNetGoogle Scholar
  3. Brunet J-P, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101(12): 4164–4169CrossRefGoogle Scholar
  4. Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A (2006) Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 7(1): 78CrossRefGoogle Scholar
  5. Chee M, Yang R, Hubbell E, Berno A, Huang X, Stern D, Winkler J, Lockhart D, Morris M, Fodor S (1996) Accessing genetic information with high density DNA arrays. Science 274: 610–614CrossRefGoogle Scholar
  6. Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103Google Scholar
  7. Cooper M, Foote J (2002) Summarizing video using non-negative similarity matrix factorization. In: Proceedings of IEEE workshop on multimedia signal processing, pp 25–28Google Scholar
  8. Dhillon I, Sra S (2005) Generalized nonnegative matrix approximations with Bregman divergences. In: Advances in neural information processing systems, vol 17. MIT Press, CambridgeGoogle Scholar
  9. Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM data mining conferenceGoogle Scholar
  10. Ding C, Li T, Jordan M (2006) Convex and semi-nonnegative matrix factorizations for clustering and low-dimension representation. Technical Report LBNL-60428, Lawrence Berkeley National Laboratory, University of California, BerkeleyGoogle Scholar
  11. Ding C, Li T, Peng W (2006) Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method. In: Proceedings of national conference on artificial intelligence (AAAI-06)Google Scholar
  12. Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA (2003) Onto-tools, the toolkit of the modern biologist: onto-express, onto-compare, onto-design and onto-translate. Nucleic Acids Res 31(13): 3775–3781CrossRefGoogle Scholar
  13. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95: 14863–14868CrossRefGoogle Scholar
  14. Fodor S, Read J, Pirrung M, Stryer L, Lu A, Solas D (1991) Light-directed, spatially addressable parallel chemical synthesis. Science 251: 767–783CrossRefGoogle Scholar
  15. Gaussier E, Goutte C (2005) Relation between plsa and nmf and implications. In: SIGIR ’05, pp 601–602Google Scholar
  16. Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62: 4963–4967Google Scholar
  17. Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5: 1457–1469MathSciNetGoogle Scholar
  18. Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl 1): S96–S104Google Scholar
  19. Ideker T et al (2000) Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol 7(6): 805–817CrossRefGoogle Scholar
  20. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N (2002) Revealing modular organization in the yeast transcriptional network. Nature Genet 31: 370–377Google Scholar
  21. Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13): 1993–2003CrossRefGoogle Scholar
  22. Khatri P, Draghici S, Ostermeier G, Krawetz S (2002) Profiling gene expression using onto-express. Genomics 79(2): 266–270CrossRefGoogle Scholar
  23. Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12): 1495–1502CrossRefGoogle Scholar
  24. Koyuturk M, Grama A, Ramakrishnan N (2006) Non-orthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans Math Softw 32(1): 33–69CrossRefMathSciNetGoogle Scholar
  25. la Torre FD, Kanade T (2006) Discriminative cluster analysis. In: Proceedings of the 23rd international conference on machine learning (ICML 2006)Google Scholar
  26. Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791CrossRefGoogle Scholar
  27. Lee D, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Dietterich TG, Tresp V (eds) Advances in neural information processing systems, vol 13. MIT Press, CambridgeGoogle Scholar
  28. Li T (2005) A general model for clustering binary data. In: Proceedings of the 11th ACM SIGKDD international conference, pp 188–197Google Scholar
  29. Li S, Hou X, Zhang H, Cheng Q (2001) Learning spatially localized, parts-based representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 207–212Google Scholar
  30. Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15): 2429–2437CrossRefGoogle Scholar
  31. Madeira SC et al (2004) Biclustering algorithms for biological data analysis: a survey. IEEE Trans Comput Biol Bioinformatics 1: 24–45CrossRefGoogle Scholar
  32. Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5: 111–126CrossRefGoogle Scholar
  33. Pauca VP, Shahnaz F, Berry M, Plemmons R (2004) Text mining using non-negative matrix factorization. In: Proceedings of SIAM international conference on data mining, pp 452–456Google Scholar
  34. Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9): 1122–1129CrossRefGoogle Scholar
  35. Rocke D, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8(6): 557–569CrossRefGoogle Scholar
  36. Sha F, Saul L, Lee D (2003) Multiplicative updates for nonnegative quadratic programming in support vector machines. In: Advances in neural information processing systems, vol 15, pp 1041–1048Google Scholar
  37. Sharan R, Maron-Katz A, Shamir R (2003) Click and expander: a system for clustering and visualizing gene expression data. Bioinformatics 19(14): 1787–1799CrossRefGoogle Scholar
  38. Srebro N, Rennie J, Jaakkola T (2005) Maximum margin matrix factorization. In: Advances in neural information processing systems. MIT Press, CambridgeGoogle Scholar
  39. Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617zbMATHCrossRefMathSciNetGoogle Scholar
  40. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organizing maps. In: Proceedings of the national academy of sciences of USA, vol 96Google Scholar
  41. Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(90001): S136–S144Google Scholar
  42. Tanay A, Sharan R, Kupiec M, Shamir R, Karp RM (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genome-wide data. Proc Natl Acad Sci USA 101(9): 2981–2986CrossRefGoogle Scholar
  43. Vavasis SA (2007) On the complexity of nonnegative matrix factorization.
  44. Xie Y-L, Hopke P, Paatero P (1999) Positive matrix factorization applied to a curve resolution problem. J Chemom 12(6): 357–364CrossRefGoogle Scholar
  45. Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of ACM conference on research and development in IR(SIGIR), Toronto, pp 267–273Google Scholar
  46. Zeimpekis D, Gallopoulos E (2005) Clsi: a flexible approximation scheme from clustered term-document matrices. Proceedings of SIAM data mining conference, pp 631–635Google Scholar
  47. Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization and applications. In: Proceedings of 2007 IEEE international conference on data miningGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Zhong-Yuan Zhang
    • 1
  • Tao Li
    • 2
    Email author
  • Chris Ding
    • 3
  • Xian-Wen Ren
    • 4
  • Xiang-Sun Zhang
    • 4
  1. 1.School of StatisticsCentral University of Finance and EconomicsBeijingPeople’s Republic of China
  2. 2.School of Computing and Information SciencesFlorida International UniversityMiamiUSA
  3. 3.Department of Computer Science and EngineeringUniversity of TexasArlingtonUSA
  4. 4.Academy of Mathematics and Systems ScienceChinese Academy of SciencesBeijingPeople’s Republic of China

Personalised recommendations