FastStep: Scalable Boolean Matrix Decomposition
Matrix Decomposition methods are applied to a wide range of tasks, such as data denoising, dimensionality reduction, co-clustering and community detection. However, in the presence of boolean inputs, common methods either do not scale or do not provide a boolean reconstruction, which results in high reconstruction error and low interpretability of the decomposition. We propose a novel step decomposition of boolean matrices in non-negative factors with boolean reconstruction. By formulating the problem using threshold operators and through suitable relaxation of this problem, we provide a scalable algorithm that can be applied to boolean matrices with millions of non-zero entries. We show that our method achieves significantly lower reconstruction error when compared to standard state of the art algorithms. We also show that the decomposition keeps its interpretability by analyzing communities in a flights dataset (where the matrix is interpreted as a graph in which nodes are airports) and in a movie-ratings dataset with 10 million non-zeros.
Partially funded by the ERDF through the COMPETE 2020 Program and by FCT within project POCI-01-0145-FEDER-006961 and through the CMU|Portugal Program under Grant SFRH/BD/52362/2013. Based upon work supported by the National Science Foundation under Grants No. CNS-1314632 and IIS-1408924, and by a Google Focused Research Award. Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of the funding parties.
- 2.Araujo, M., Günnemann, S., Mateos, G., Faloutsos, C.: Beyond blocks: hyperbolic community detection. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014, Part I. LNCS, vol. 8724, pp. 50–65. Springer, Heidelberg (2014)Google Scholar
- 4.Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM (2003)Google Scholar
- 8.Grünwald, P.D.: The Minimum Description Length Principle. The MIT Press, Cambridge (2007)Google Scholar
- 11.Li, T.: A general model for clustering binary data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 188–197. ACM (2005)Google Scholar
- 12.Schein, A.I., Saul, L.K., Ungar, L.H.: A generalized linear model for principal component analysis of binary data. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, pp. 14–21 (2003)Google Scholar
- 14.Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Compact matrix decomposition for large sparse graphs. In: Proceedings of the Seventh SIAM International Conference on Data Mining, vol. 127, p. 366. SIAM (2007)Google Scholar
- 15.Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. Handb. Comput. Mol. Biol. 9(1–20), 122–124 (2005)Google Scholar
- 16.Vlachos, M., Fusco, F., Mavroforakis, C., Kyrillidis, A., Vassiliadis, V.G.: Improving co-cluster quality with application to product recommendations. In: 23rd ACM Conference on Information and Knowledge Management, pp. 679–688 (2014)Google Scholar