FastStep: Scalable Boolean Matrix Decomposition

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9651)

Abstract

Matrix Decomposition methods are applied to a wide range of tasks, such as data denoising, dimensionality reduction, co-clustering and community detection. However, in the presence of boolean inputs, common methods either do not scale or do not provide a boolean reconstruction, which results in high reconstruction error and low interpretability of the decomposition. We propose a novel step decomposition of boolean matrices in non-negative factors with boolean reconstruction. By formulating the problem using threshold operators and through suitable relaxation of this problem, we provide a scalable algorithm that can be applied to boolean matrices with millions of non-zero entries. We show that our method achieves significantly lower reconstruction error when compared to standard state of the art algorithms. We also show that the decomposition keeps its interpretability by analyzing communities in a flights dataset (where the matrix is interpreted as a graph in which nodes are airports) and in a movie-ratings dataset with 10 million non-zeros.

References

  1. 1.
    Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Araujo, M., Günnemann, S., Mateos, G., Faloutsos, C.: Beyond blocks: hyperbolic community detection. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014, Part I. LNCS, vol. 8724, pp. 50–65. Springer, Heidelberg (2014)Google Scholar
  3. 3.
    Bell, R.M., Koren, Y.: Lessons from the netflix prize challenge. ACM SIGKDD Explor. Newslett. 9(2), 75–79 (2007)CrossRefGoogle Scholar
  4. 4.
    Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM (2003)Google Scholar
  5. 5.
    Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936). http://dx.org/10.1007/BF02288367 CrossRefMATHGoogle Scholar
  6. 6.
    Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Golub, G., Kahan, W.: Calculating the singular values and pseudo-inverse of a matrix. J. Soc. Ind. Appl. Math. Ser. B Numer. Anal. 2(2), 205–224 (1965)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Grünwald, P.D.: The Minimum Description Length Principle. The MIT Press, Cambridge (2007)Google Scholar
  9. 9.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  10. 10.
    Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., Ghahramani, Z.: Kronecker graphs: an approach to modeling networks. J. Mach. Learn. Res. 11, 985–1042 (2010)MathSciNetMATHGoogle Scholar
  11. 11.
    Li, T.: A general model for clustering binary data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 188–197. ACM (2005)Google Scholar
  12. 12.
    Schein, A.I., Saul, L.K., Ungar, L.H.: A generalized linear model for principal component analysis of binary data. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, pp. 14–21 (2003)Google Scholar
  13. 13.
    Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Compact matrix decomposition for large sparse graphs. In: Proceedings of the Seventh SIAM International Conference on Data Mining, vol. 127, p. 366. SIAM (2007)Google Scholar
  15. 15.
    Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. Handb. Comput. Mol. Biol. 9(1–20), 122–124 (2005)Google Scholar
  16. 16.
    Vlachos, M., Fusco, F., Mavroforakis, C., Kyrillidis, A., Vassiliadis, V.G.: Improving co-cluster quality with application to product recommendations. In: 23rd ACM Conference on Information and Knowledge Management, pp. 679–688 (2014)Google Scholar
  17. 17.
    Zhang, Z.Y., Li, T., Ding, C., Ren, X.W., Zhang, X.S.: Binary matrix factorization for analyzing gene expression data. Data Min. Knowl. Disc. 20(1), 28–52 (2010)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Miguel Araujo
    • 1
    • 2
  • Pedro Ribeiro
    • 1
  • Christos Faloutsos
    • 2
  1. 1.Cracs/INESC-TECUniversity of PortoPortoPortugal
  2. 2.Computer Science DepartmentCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations