FastStep: Scalable Boolean Matrix Decomposition

  • Miguel Araujo
  • Pedro Ribeiro
  • Christos Faloutsos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9651)


Matrix Decomposition methods are applied to a wide range of tasks, such as data denoising, dimensionality reduction, co-clustering and community detection. However, in the presence of boolean inputs, common methods either do not scale or do not provide a boolean reconstruction, which results in high reconstruction error and low interpretability of the decomposition. We propose a novel step decomposition of boolean matrices in non-negative factors with boolean reconstruction. By formulating the problem using threshold operators and through suitable relaxation of this problem, we provide a scalable algorithm that can be applied to boolean matrices with millions of non-zero entries. We show that our method achieves significantly lower reconstruction error when compared to standard state of the art algorithms. We also show that the decomposition keeps its interpretability by analyzing communities in a flights dataset (where the matrix is interpreted as a graph in which nodes are airports) and in a movie-ratings dataset with 10 million non-zeros.



Partially funded by the ERDF through the COMPETE 2020 Program and by FCT within project POCI-01-0145-FEDER-006961 and through the CMU|Portugal Program under Grant SFRH/BD/52362/2013. Based upon work supported by the National Science Foundation under Grants No. CNS-1314632 and IIS-1408924, and by a Google Focused Research Award. Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of the funding parties.


  1. 1.
    Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Araujo, M., Günnemann, S., Mateos, G., Faloutsos, C.: Beyond blocks: hyperbolic community detection. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014, Part I. LNCS, vol. 8724, pp. 50–65. Springer, Heidelberg (2014)Google Scholar
  3. 3.
    Bell, R.M., Koren, Y.: Lessons from the netflix prize challenge. ACM SIGKDD Explor. Newslett. 9(2), 75–79 (2007)CrossRefGoogle Scholar
  4. 4.
    Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM (2003)Google Scholar
  5. 5.
    Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936). CrossRefzbMATHGoogle Scholar
  6. 6.
    Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Golub, G., Kahan, W.: Calculating the singular values and pseudo-inverse of a matrix. J. Soc. Ind. Appl. Math. Ser. B Numer. Anal. 2(2), 205–224 (1965)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Grünwald, P.D.: The Minimum Description Length Principle. The MIT Press, Cambridge (2007)Google Scholar
  9. 9.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  10. 10.
    Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., Ghahramani, Z.: Kronecker graphs: an approach to modeling networks. J. Mach. Learn. Res. 11, 985–1042 (2010)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Li, T.: A general model for clustering binary data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 188–197. ACM (2005)Google Scholar
  12. 12.
    Schein, A.I., Saul, L.K., Ungar, L.H.: A generalized linear model for principal component analysis of binary data. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, pp. 14–21 (2003)Google Scholar
  13. 13.
    Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Compact matrix decomposition for large sparse graphs. In: Proceedings of the Seventh SIAM International Conference on Data Mining, vol. 127, p. 366. SIAM (2007)Google Scholar
  15. 15.
    Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. Handb. Comput. Mol. Biol. 9(1–20), 122–124 (2005)Google Scholar
  16. 16.
    Vlachos, M., Fusco, F., Mavroforakis, C., Kyrillidis, A., Vassiliadis, V.G.: Improving co-cluster quality with application to product recommendations. In: 23rd ACM Conference on Information and Knowledge Management, pp. 679–688 (2014)Google Scholar
  17. 17.
    Zhang, Z.Y., Li, T., Ding, C., Ren, X.W., Zhang, X.S.: Binary matrix factorization for analyzing gene expression data. Data Min. Knowl. Disc. 20(1), 28–52 (2010)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Miguel Araujo
    • 1
    • 2
  • Pedro Ribeiro
    • 1
  • Christos Faloutsos
    • 2
  1. 1.Cracs/INESC-TECUniversity of PortoPortoPortugal
  2. 2.Computer Science DepartmentCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations