The VLDB Journal

, Volume 25, Issue 4, pp 519–544 | Cite as

Mining billion-scale tensors: algorithms and discoveries

  • Inah Jeon
  • Evangelos E. Papalexakis
  • Christos Faloutsos
  • Lee Sael
  • U. Kang
Regular Paper

Abstract

How can we analyze large-scale real-world data with various attributes? Many real-world data (e.g., network traffic logs, web data, social networks, knowledge bases, and sensor streams) with multiple attributes are represented as multi-dimensional arrays, called tensors. For analyzing a tensor, tensor decompositions are widely used in many data mining applications: detecting malicious attackers in network traffic logs (with source IP, destination IP, port-number, timestamp), finding telemarketers in a phone call history (with sender, receiver, date), and identifying interesting concepts in a knowledge base (with subject, object, relation). However, current tensor decomposition methods do not scale to large and sparse real-world tensors with millions of rows and columns and ‘fibers.’ In this paper, we propose HaTen2, a distributed method for large-scale tensor decompositions that runs on the MapReduce framework. Our careful design and implementation of HaTen2 dramatically reduce the size of intermediate data and the number of jobs leading to achieve high scalability compared with the state-of-the-art method. Thanks to HaTen2, we analyze big real-world sparse tensors that cannot be handled by the current state of the art, and discover hidden concepts.

Keywords

Tensor Distributed computing Big data MapReduce Hadoop 

References

  1. 1.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., E.R.H. Jr., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)Google Scholar
  2. 2.
    Kolda, T.G., Bader, B.W.: The tophits model for higher-order web link analysis. In: Workshop on Link Analysis, Counterterrorism and Security, Vol. 7, pp. 26–29 (2006)Google Scholar
  3. 3.
    Maruhashi, K., Guo, F., Faloutsos, C.: Multiaspectforensics: Pattern mining on large-scale heterogeneous networks with tensor analysis. In: Proceedings of the Third International Conference on Advances in Social Network Analysis and Mining (2011)Google Scholar
  4. 4.
    Sun, J., Papadimitriou, S., Yu, P.S.: Window-based tensor analysis on high-dimensional and multi-aspect streams. In: ICDM (2006)Google Scholar
  5. 5.
    Kolda, T.G., Sun, J.: Scalable tensor decompositions for multi-aspect data mining. In: ICDM, pp. 363–372 (2008)Google Scholar
  6. 6.
    Davidson, I.N., Gilpin, S., Carmichael, O.T., Walker, P.B.: Network discovery via constrained tensor analysis of fmri data. In: KDD, pp. 194–202, ACM, New York (2013)Google Scholar
  7. 7.
    Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: Dynamic tensor analysis. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, New York, NY, pp. 374–383. ACM, New York (2006)Google Scholar
  8. 8.
    Hadoop information. http://hadoop.apache.org/
  9. 9.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI’04, Dec (2004)Google Scholar
  10. 10.
    Jeon, I., Papalexakis, E.E., Kang, U., Faloutsos, C.: Haten2: Billion-scale tensor decompositions. In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, April 13–17, 2015, pp. 1047–1058 (2015)Google Scholar
  11. 11.
    Harshman, R.: Foundations of the parafac procedure: model and conditions for an explanatory multi-mode factor analysis. In: UCLA Working Papers in Phonetics, Vol. 16, pp. 1–84 (1970)Google Scholar
  12. 12.
    Tomasi, G., Bro, R.: A comparison of algorithms for fitting the parafac model. Comput. Stat. Data Anal. 50(7), 1700–1734 (2006)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966c)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Andersson, C.A., Bro, R.: Improving the speed of multi-way algorithms: Part I. Tucker3. Chemometr. Intell. Lab. Syst. 42, 93–103 (1998)CrossRefGoogle Scholar
  15. 15.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: NIPS, pp. 556–562 (2000)Google Scholar
  16. 16.
    Chen, D., Plemmons, R.J.: Nonnegativity constraints in numerical analysis. In: Symposium on the Birth of Numerical Analysis (2007)Google Scholar
  17. 17.
    Kim, Y.D., Choi, S.: Nonnegative tucker decomposition. In: CVPR, IEEE Computer Society (2007)Google Scholar
  18. 18.
    Kang, U., Papalexakis, E.E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times—algorithms and discoveries. In: KDD, pp. 316–324 (2012)Google Scholar
  19. 19.
    Freebase dataset. https://www.freebase.com/
  20. 20.
  21. 21.
    Bader, B.W., Kolda, T.G., et al.: Matlab tensor toolbox version 2.5, January 2012Google Scholar
  22. 22.
    Acar, E., Aykut-Bingol, C., Bingol, H., Bro, R., Yener, B.: Multiway analysis of epilepsy tensors. Bioinformatics 23(13), i10–i18 (2007)CrossRefGoogle Scholar
  23. 23.
    Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Parcube: sparse parallelizable tensor decompositions. In: Machine Learning and Knowledge Discovery in Databases, pp. 521–536. Springer, Berlin (2012)Google Scholar
  24. 24.
    Papalexakis, E.E., Akoglu, L., Ienco, D.: Do more views of a graph help? community detection and clustering in multi-graphs. In: 16th International Conference on Information Fusion (FUSION), 2013, pp. 899–905, IEEE (2013)Google Scholar
  25. 25.
    Araujo, M., Papadimitriou, S., Günnemann, S., Faloutsos, C., Basu, P., Swami, A., Papalexakis, E.E., Koutra, D.: Com2: fast automatic discovery of temporal (comet) communities. In: Advances in Knowledge Discovery and Data Mining, pp. 271–283. Springer, Berlin (2014)Google Scholar
  26. 26.
    Kolda, T.G., Sun, J.: Scalable tensor decompositions for multi-aspect data mining. In: ICDM 2008: Proceedings of the 8th IEEE International Conference on Data Mining, pp. 363–372 (2008)Google Scholar
  27. 27.
    Chang, K.W., Yih, W.T., Meek, C.: Multi-relational latent semantic analysis. In: EMNLP, pp. 1602–1612 (2013)Google Scholar
  28. 28.
    De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000)MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Sun, J., Zeng, H., Liu, H., Lu, Y., Chen, Z.: Cubesvd: a novel approach to personalized web search. In: WWW (2005)Google Scholar
  30. 30.
    Vasilescu, M., Terzopoulos, D.: Multilinear analysis of image ensembles: tensorfaces. Comput. Vis. ECCV 2002, 447–460 (2002)MATHGoogle Scholar
  31. 31.
    Luo, D., Huang, H., Ding, C.: Discriminative high order SVD: adaptive tensor subspace selection for image classification, clustering, and retrieval. In: ICCV (2011)Google Scholar
  32. 32.
    Bader, B.W., Kolda, T.G.: Efficient MATLAB computations with sparse and factored tensors. SIAM J. Sci. Comput. 30, 205–231 (2007)MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Beutel, A., Talukdar, P.P., Kumar, A.,Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: SDM (2014)Google Scholar
  34. 34.
    Bro, R., Sidiropoulos, N., Giannakis, G.: A fast least squares algorithm for separating trilinear mixtures. In: International Workshop Independent Component and Blind Signal Separation Analytical, pp. 11–15 (1999)Google Scholar
  35. 35.
    Kim, M., Candan, K.S.: Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient tensor decomposition. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 355–364. ACM, New York (2012)Google Scholar
  36. 36.
    Erdös, D., Miettinen, P.: Scalable boolean tensor factorizations using random walks. In: CoRR, vol. abs/1310.4843 (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Inah Jeon
    • 1
  • Evangelos E. Papalexakis
    • 2
  • Christos Faloutsos
    • 2
  • Lee Sael
    • 3
  • U. Kang
    • 4
  1. 1.LG ElectronicsSeoulKorea
  2. 2.Computer Science Department and iLabCMUPittsburghUSA
  3. 3.Department of Computer ScienceSUNYIncheonKorea
  4. 4.Department of Computer Science and EngineeringSeoul National UniversitySeoulKorea

Personalised recommendations