Skip to main content

Large Graph Mining: Recent Developments, Challenges and Potential Solutions

  • Chapter

Part of the Lecture Notes in Business Information Processing book series (LNBIP,volume 138)

Abstract

With the recent growth of the graph-based data, the large graph processing becomes more and more important. In order to explore and to extract knowledge from such data, graph mining methods, like community detection, is a necessity. Although the graph mining is a relatively recent development in the Data Mining domain, it has been studied extensively in different areas (biology, social networks, telecommunications and Internet). The legacy graph processing tools mainly rely on single machine computational capacity, which cannot process large graph with billions of nodes. Therefore, the main challenge of new tools and frameworks lies on the development of new paradigms that are scalable, efficient and flexible. In this paper, we will review the new paradigms of large graph processing and their applications to graph mining domain using the distributed and shared nothing approach used for large data by Internet players. The paper will be organized as a walk through different industrial needs in terms of graph mining passing by the existing solutions. Finally, we will expose a set of open research questions linked with several new business requirements as the graph data warehouse.

Keywords

  • Data Mining
  • Large graphs
  • Distributed Processing
  • Business Intelligence

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-36318-4_5
  • Chapter length: 22 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-642-36318-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abelló, A., Samos, J., Saltor, F.: YAM2: a multidimensional conceptual model extending UML. Inf. Syst. 31(6), 541–567 (2006)

    CrossRef  Google Scholar 

  2. Aggarwal, C.C., Wang, H. (eds.): Managing and Mining Graph Data. Advances in Database Systems, vol. 40. Springer (2010)

    Google Scholar 

  3. Akkaoui, Z.E., Zimányi, E., Mazón, J.-N., Trujillo, J.: A model-driven framework for ETL process development. In: Proceedings of the 14th ACM International Workshop on Data Warehousing and OLAP, DOLAP 2011, pp. 45–52. ACM (2011)

    Google Scholar 

  4. Avram, A.: Gremlin, a language for working with graphs. Technical report, InfoQ (2010), http://www.infoq.com/news/2010/01/Gremlin

  5. Bader, D.: Analyzing Massive Social Networks Using Multicore and Multithreaded Architectures. In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore-Challenge. LNCS, vol. 6310, p. 1. Springer, Heidelberg (2010)

    CrossRef  Google Scholar 

  6. Bader, D.A., Madduri, K.: Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–12. IEEE (2008)

    Google Scholar 

  7. Balakrishnan, A., Magnanti, T.L., Wong, R.T.: A Dual-Ascent procedure for Large-Scale uncapacitated network design. Operations Research 37(5), 716–740 (1989)

    CrossRef  Google Scholar 

  8. Bauer, A., Hümmer, W., Lehner, W.: An Alternative Relational OLAP Modeling Approach. In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds.) DaWaK 2000. LNCS, vol. 1874, pp. 189–198. Springer, Heidelberg (2000)

    CrossRef  Google Scholar 

  9. Bellman, R.: On a routing problem. Quarterly of Applied Mathematics 16, 87–90 (1958)

    CrossRef  Google Scholar 

  10. Bialecki, A., Cafarella, M., Cutting, D., O’Malley, O.: Hadoop: A framework for running applications on large clusters built of commodity hardware (2005), http://lucene.apache.org/hadoop/

  11. Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Finding authorities and hubs from link structures on the World Wide Web. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 415–429. ACM (2001)

    Google Scholar 

  12. Botafogo, R.A., Rivlin, E., Shneiderman, B.: Structural analysis of hypertexts: Identifying hierarchies and useful metrics. ACM Trans. Inf. Syst. 10(2), 142–180 (1992)

    CrossRef  Google Scholar 

  13. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, pp. 858–867. ACL (2007)

    Google Scholar 

  14. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998)

    Google Scholar 

  15. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 3, 285–296 (2010)

    CrossRef  Google Scholar 

  16. Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. SIGMOD Record 26(1), 65–74 (1997)

    CrossRef  Google Scholar 

  17. Chen, M.-S., Han, J., Yu, P.S.: Data mining: An overview from a database perspective. IEEE Trans. Knowl. Data Eng. 8(6), 866–883 (1996)

    CrossRef  Google Scholar 

  18. Chen, R., Weng, X., He, B., Yang, M.: Large graph processing in the cloud. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1123–1126. ACM (2010)

    Google Scholar 

  19. Chung, F.R.K.: A local graph partitioning algorithm using heat kernel pagerank. Internet Mathematics 6(3), 315–330 (2009)

    CrossRef  Google Scholar 

  20. Cohn, D., Chang, H.: Learning to probabilistically identify authoritative documents. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning, ICML 2007, pp. 167–174. Morgan Kaufmann (2007)

    Google Scholar 

  21. Datta, D., Figueira, J.R.: Graph partitioning by multi-objective real-valued metaheuristics: A comparative study. Appl. Soft Comput. 11(5), 3976–3987 (2011)

    CrossRef  Google Scholar 

  22. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1(1), 269–271 (1959)

    CrossRef  Google Scholar 

  23. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 810–818. ACM (2010)

    Google Scholar 

  24. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996)

    Google Scholar 

  25. Fedak, G., Fox, G., Antoniu, G., He, H.: Future of mapreduce for scientific computing. In: Proceedings of the Second International Workshop on MapReduce and its Applications, MapReduce 2011, pp. 75–76. ACM (2011)

    Google Scholar 

  26. Floyd, R.W.: Algorithm 97: Shortest path. Commun. ACM 5, 345 (1962)

    CrossRef  Google Scholar 

  27. Freeman, L.: Centrality in social networks conceptual clarification. Social Networks 1(3), 215–239 (1979)

    CrossRef  Google Scholar 

  28. Gaujal, B., Navet, N., Walsh, C.: Shortest-path algorithms for real-time scheduling of FIFO tasks with minimal energy use. ACM Trans. Embed. Comput. Syst. 4, 907–933 (2005)

    CrossRef  Google Scholar 

  29. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)

    CrossRef  Google Scholar 

  30. Gupta, R., Malik, S.K.: SPARQL semantics and execution analysis in semantic web using various tools. In: Proceedings of the 2011 International Conference on Communication Systems and Network Technologies, CSNT 2011, pp. 278–282. IEEE Computer Society (2011)

    Google Scholar 

  31. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)

    Google Scholar 

  32. Husemann, B., Lechtenbörger, J., Vossen, G.: Conceptual data warehouse design. In: Proceedings of the International Workshop on Design and Management of Data Warehouses, DMDW 2000, pp. 3–9 (2000)

    Google Scholar 

  33. Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. Commun. ACM 39(11), 58–64 (1996)

    CrossRef  Google Scholar 

  34. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys 2007, pp. 59–72. ACM (2007)

    Google Scholar 

  35. Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)

    CrossRef  Google Scholar 

  36. Kaeli, D.R., Fong, L.L., Booth, R.C., Imming, K.C., Weigel, J.P.: Performance analysis on a cc-numa prototype. IBM J. Res. Dev. 41, 205–214 (1997)

    CrossRef  Google Scholar 

  37. Kang, U., Tsourakakis, C., Appel, A., Faloutsos, C., Leskovec, J.: Hadi: Fast diameter estimation and mining in massive graphs with hadoop. CMU-ML-08-117 (2008)

    Google Scholar 

  38. Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: A peta-scale graph mining system. In: Proceedings of the Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 229–238. IEEE Computer Society (2009)

    Google Scholar 

  39. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience (2005)

    Google Scholar 

  40. Khosrow-Pour, M. (ed.): Encyclopedia of Information Science and Technology, 5 volumes. Idea Group (2005)

    Google Scholar 

  41. Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd edn. John Wiley & Sons, Inc. (2002)

    Google Scholar 

  42. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1998, pp. 668–677. ACM/SIAM (1998)

    Google Scholar 

  43. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999)

    CrossRef  Google Scholar 

  44. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 611–617. ACM (2006)

    Google Scholar 

  45. Lämmel, R.: Google’s mapreduce programming model revisited. Sci. Comput. Program. 70, 1–30 (2008)

    CrossRef  Google Scholar 

  46. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, WWW 2008, pp. 695–704. ACM (2008)

    Google Scholar 

  47. Liu, C., Guo, F., Faloutsos, C.: Bbm: bayesian browsing model from petabyte-scale data. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 537–546. ACM (2009)

    Google Scholar 

  48. Lorenz, D.H., Orda, A.: Qos routing in networks with uncertain parameters. IEEE/ACM Trans. Netw. 6, 768–778 (1998)

    CrossRef  Google Scholar 

  49. Luján-Mora, S., Trujillo, J., Song, I.-Y.: A uml profile for multidimensional modeling in data warehouses. Data Knowl. Eng. 59(3), 725–769 (2006)

    CrossRef  Google Scholar 

  50. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)

    Google Scholar 

  51. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 135–146. ACM (2010)

    Google Scholar 

  52. Malinowski, E., Zimányi, E.: Advanced data warehouse design: From conventional to spatial and temporal applications. Springer (2008)

    Google Scholar 

  53. Malinowski, E., Zimányi, E.: Multidimensional conceptual modeling. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, 2nd edn., pp. 293–300. IGI Global (2008)

    Google Scholar 

  54. Marchiori, M.: The quest for correct information on the web: Hyper search engines. Computer Networks 29(8-13), 1225–1236 (1997)

    Google Scholar 

  55. Martínez-Bazan, N., Muntés-Mulero, V., Gómez-Villamor, S., Nin, J., Sánchez-Martínez, M.-A., Larriba-Pey, J.-L.: Dex: high-performance exploration on large graphs for information retrieval. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, pp. 573–582. ACM (2007)

    Google Scholar 

  56. McSherry, F.: Spectral partitioning of random graphs. In: Proceedings of the 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, pp. 529–537 (2001)

    Google Scholar 

  57. Mortensen, E.N., Barrett, W.A.: Interactive segmentation with intelligent scissors. Graph. Models Image Process. 60, 349–384 (1998)

    CrossRef  Google Scholar 

  58. Nolan, C.: Manipulate and query OLAP data using ADOMD and multidimensional expressions. Technical report, Microsoft Research (1999)

    Google Scholar 

  59. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120 (November 1999)

    Google Scholar 

  60. Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., Gannon, D.: Cloud technologies for bioinformatics applications. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 2009, pp. 6:1–6:10. ACM (2009)

    Google Scholar 

  61. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 101(9), 2658 (2004)

    CrossRef  Google Scholar 

  62. Rattigan, M.J., Maier, M.E., Jensen, D.: Graph clustering with network structure indices. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning, ICML 2007, pp. 783–790. ACM (2007)

    Google Scholar 

  63. Rizzi, S.: Conceptual modeling solutions for the data warehouse. In: Erickson, J. (ed.) Database Technologies: Concepts, Methodologies, Tools, and Applications, pp. 86–104. IGI Global (2009)

    Google Scholar 

  64. Rodriguez, M.A., Neubauer, P.: A path algebra for multi-relational graphs. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering Workshops, ICDEW 2011, pp. 128–131. IEEE Computer Society (2011)

    Google Scholar 

  65. Sapia, C., Blaschka, M., Höfling, G., Dinter, B.: Extending the E/R Model for the Multidimensional Paradigm. In: Kambayashi, Y., Lee, D.-L., Lim, E.-P., Mohania, M., Masunaga, Y. (eds.) ER 1998. LNCS, vol. 1552, pp. 105–116. Springer, Heidelberg (1999)

    CrossRef  Google Scholar 

  66. Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011, pp. 4:1–4:8. ACM (2011)

    Google Scholar 

  67. Segaran, T., Evans, C., Taylor, J.: Programming the Semantic Web - Build Flexible Applications with Graph Data. O’Reilly (2009)

    Google Scholar 

  68. Sommer, C.: Approximate Shortest Path and Distance Queries in Networks. PhD thesis, University of Tokyo (2010)

    Google Scholar 

  69. Sui, X., Nguyen, D., Burtscher, M., Pingali, K.: Parallel Graph Partitioning on Multicore Architectures. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 246–260. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  70. Tryfona, N., Busborg, F., Christiansen, J.G.B.: Starer: A conceptual model for data warehouse design. In: Proceedings of the Second ACM International Workshop on Data Warehousing and OLAP, DOLAP 1999, pp. 3–8. ACM (1999)

    Google Scholar 

  71. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)

    CrossRef  Google Scholar 

  72. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Structural analysis in the social sciences, vol. 8. Cambridge University Press (1994)

    Google Scholar 

  73. Zhan, F.B., Noon, C.E.: Shortest path algorithms: An evaluation using real road networks. Transportation Science 32, 65–73 (1998)

    CrossRef  Google Scholar 

  74. Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD 2011, pp. 853–864. ACM (2011)

    Google Scholar 

  75. Zhao, X., Sala, A., Wilson, C., Zheng, H., Zhao, B.Y.: Orion: shortest path estimation for large social graphs. In: Proceedings of the 3rd Conference on Online Social Networks, WOSN 2010, p. 9. USENIX Association (2010)

    Google Scholar 

  76. Zhou, A., Qian, W., Tao, D., Ma, Q.: Disg: A distributed graph repository for web infrastructure (invited paper). In: Proceedings of the Second International Symposium on Universal Communication, ISUC 2008, pp. 141–145. IEEE Computer Society (2008)

    Google Scholar 

  77. Zhuang, L., Dunagan, J., Simon, D.R., Wang, H.J., Tygar, J.D.: Characterizing botnets from email spam records. In: Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, pp. 2:1–2:9. USENIX Association (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Skhiri, S., Jouili, S. (2013). Large Graph Mining: Recent Developments, Challenges and Potential Solutions. In: Aufaure, MA., Zimányi, E. (eds) Business Intelligence. eBISS 2012. Lecture Notes in Business Information Processing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36318-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36318-4_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36317-7

  • Online ISBN: 978-3-642-36318-4

  • eBook Packages: Computer ScienceComputer Science (R0)