Abstract
With the recent growth of the graph-based data, the large graph processing becomes more and more important. In order to explore and to extract knowledge from such data, graph mining methods, like community detection, is a necessity. Although the graph mining is a relatively recent development in the Data Mining domain, it has been studied extensively in different areas (biology, social networks, telecommunications and Internet). The legacy graph processing tools mainly rely on single machine computational capacity, which cannot process large graph with billions of nodes. Therefore, the main challenge of new tools and frameworks lies on the development of new paradigms that are scalable, efficient and flexible. In this paper, we will review the new paradigms of large graph processing and their applications to graph mining domain using the distributed and shared nothing approach used for large data by Internet players. The paper will be organized as a walk through different industrial needs in terms of graph mining passing by the existing solutions. Finally, we will expose a set of open research questions linked with several new business requirements as the graph data warehouse.
Keywords
- Data Mining
- Large graphs
- Distributed Processing
- Business Intelligence
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Abelló, A., Samos, J., Saltor, F.: YAM2: a multidimensional conceptual model extending UML. Inf. Syst. 31(6), 541–567 (2006)
Aggarwal, C.C., Wang, H. (eds.): Managing and Mining Graph Data. Advances in Database Systems, vol. 40. Springer (2010)
Akkaoui, Z.E., Zimányi, E., Mazón, J.-N., Trujillo, J.: A model-driven framework for ETL process development. In: Proceedings of the 14th ACM International Workshop on Data Warehousing and OLAP, DOLAP 2011, pp. 45–52. ACM (2011)
Avram, A.: Gremlin, a language for working with graphs. Technical report, InfoQ (2010), http://www.infoq.com/news/2010/01/Gremlin
Bader, D.: Analyzing Massive Social Networks Using Multicore and Multithreaded Architectures. In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore-Challenge. LNCS, vol. 6310, p. 1. Springer, Heidelberg (2010)
Bader, D.A., Madduri, K.: Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–12. IEEE (2008)
Balakrishnan, A., Magnanti, T.L., Wong, R.T.: A Dual-Ascent procedure for Large-Scale uncapacitated network design. Operations Research 37(5), 716–740 (1989)
Bauer, A., Hümmer, W., Lehner, W.: An Alternative Relational OLAP Modeling Approach. In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds.) DaWaK 2000. LNCS, vol. 1874, pp. 189–198. Springer, Heidelberg (2000)
Bellman, R.: On a routing problem. Quarterly of Applied Mathematics 16, 87–90 (1958)
Bialecki, A., Cafarella, M., Cutting, D., O’Malley, O.: Hadoop: A framework for running applications on large clusters built of commodity hardware (2005), http://lucene.apache.org/hadoop/
Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Finding authorities and hubs from link structures on the World Wide Web. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 415–429. ACM (2001)
Botafogo, R.A., Rivlin, E., Shneiderman, B.: Structural analysis of hypertexts: Identifying hierarchies and useful metrics. ACM Trans. Inf. Syst. 10(2), 142–180 (1992)
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, pp. 858–867. ACL (2007)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 3, 285–296 (2010)
Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. SIGMOD Record 26(1), 65–74 (1997)
Chen, M.-S., Han, J., Yu, P.S.: Data mining: An overview from a database perspective. IEEE Trans. Knowl. Data Eng. 8(6), 866–883 (1996)
Chen, R., Weng, X., He, B., Yang, M.: Large graph processing in the cloud. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1123–1126. ACM (2010)
Chung, F.R.K.: A local graph partitioning algorithm using heat kernel pagerank. Internet Mathematics 6(3), 315–330 (2009)
Cohn, D., Chang, H.: Learning to probabilistically identify authoritative documents. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning, ICML 2007, pp. 167–174. Morgan Kaufmann (2007)
Datta, D., Figueira, J.R.: Graph partitioning by multi-objective real-valued metaheuristics: A comparative study. Appl. Soft Comput. 11(5), 3976–3987 (2011)
Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1(1), 269–271 (1959)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 810–818. ACM (2010)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996)
Fedak, G., Fox, G., Antoniu, G., He, H.: Future of mapreduce for scientific computing. In: Proceedings of the Second International Workshop on MapReduce and its Applications, MapReduce 2011, pp. 75–76. ACM (2011)
Floyd, R.W.: Algorithm 97: Shortest path. Commun. ACM 5, 345 (1962)
Freeman, L.: Centrality in social networks conceptual clarification. Social Networks 1(3), 215–239 (1979)
Gaujal, B., Navet, N., Walsh, C.: Shortest-path algorithms for real-time scheduling of FIFO tasks with minimal energy use. ACM Trans. Embed. Comput. Syst. 4, 907–933 (2005)
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)
Gupta, R., Malik, S.K.: SPARQL semantics and execution analysis in semantic web using various tools. In: Proceedings of the 2011 International Conference on Communication Systems and Network Technologies, CSNT 2011, pp. 278–282. IEEE Computer Society (2011)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)
Husemann, B., Lechtenbörger, J., Vossen, G.: Conceptual data warehouse design. In: Proceedings of the International Workshop on Design and Management of Data Warehouses, DMDW 2000, pp. 3–9 (2000)
Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. Commun. ACM 39(11), 58–64 (1996)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys 2007, pp. 59–72. ACM (2007)
Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)
Kaeli, D.R., Fong, L.L., Booth, R.C., Imming, K.C., Weigel, J.P.: Performance analysis on a cc-numa prototype. IBM J. Res. Dev. 41, 205–214 (1997)
Kang, U., Tsourakakis, C., Appel, A., Faloutsos, C., Leskovec, J.: Hadi: Fast diameter estimation and mining in massive graphs with hadoop. CMU-ML-08-117 (2008)
Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: A peta-scale graph mining system. In: Proceedings of the Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 229–238. IEEE Computer Society (2009)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience (2005)
Khosrow-Pour, M. (ed.): Encyclopedia of Information Science and Technology, 5 volumes. Idea Group (2005)
Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd edn. John Wiley & Sons, Inc. (2002)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1998, pp. 668–677. ACM/SIAM (1998)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999)
Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 611–617. ACM (2006)
Lämmel, R.: Google’s mapreduce programming model revisited. Sci. Comput. Program. 70, 1–30 (2008)
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, WWW 2008, pp. 695–704. ACM (2008)
Liu, C., Guo, F., Faloutsos, C.: Bbm: bayesian browsing model from petabyte-scale data. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 537–546. ACM (2009)
Lorenz, D.H., Orda, A.: Qos routing in networks with uncertain parameters. IEEE/ACM Trans. Netw. 6, 768–778 (1998)
Luján-Mora, S., Trujillo, J., Song, I.-Y.: A uml profile for multidimensional modeling in data warehouses. Data Knowl. Eng. 59(3), 725–769 (2006)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 135–146. ACM (2010)
Malinowski, E., Zimányi, E.: Advanced data warehouse design: From conventional to spatial and temporal applications. Springer (2008)
Malinowski, E., Zimányi, E.: Multidimensional conceptual modeling. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, 2nd edn., pp. 293–300. IGI Global (2008)
Marchiori, M.: The quest for correct information on the web: Hyper search engines. Computer Networks 29(8-13), 1225–1236 (1997)
Martínez-Bazan, N., Muntés-Mulero, V., Gómez-Villamor, S., Nin, J., Sánchez-Martínez, M.-A., Larriba-Pey, J.-L.: Dex: high-performance exploration on large graphs for information retrieval. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, pp. 573–582. ACM (2007)
McSherry, F.: Spectral partitioning of random graphs. In: Proceedings of the 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, pp. 529–537 (2001)
Mortensen, E.N., Barrett, W.A.: Interactive segmentation with intelligent scissors. Graph. Models Image Process. 60, 349–384 (1998)
Nolan, C.: Manipulate and query OLAP data using ADOMD and multidimensional expressions. Technical report, Microsoft Research (1999)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120 (November 1999)
Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., Gannon, D.: Cloud technologies for bioinformatics applications. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 2009, pp. 6:1–6:10. ACM (2009)
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 101(9), 2658 (2004)
Rattigan, M.J., Maier, M.E., Jensen, D.: Graph clustering with network structure indices. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning, ICML 2007, pp. 783–790. ACM (2007)
Rizzi, S.: Conceptual modeling solutions for the data warehouse. In: Erickson, J. (ed.) Database Technologies: Concepts, Methodologies, Tools, and Applications, pp. 86–104. IGI Global (2009)
Rodriguez, M.A., Neubauer, P.: A path algebra for multi-relational graphs. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering Workshops, ICDEW 2011, pp. 128–131. IEEE Computer Society (2011)
Sapia, C., Blaschka, M., Höfling, G., Dinter, B.: Extending the E/R Model for the Multidimensional Paradigm. In: Kambayashi, Y., Lee, D.-L., Lim, E.-P., Mohania, M., Masunaga, Y. (eds.) ER 1998. LNCS, vol. 1552, pp. 105–116. Springer, Heidelberg (1999)
Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011, pp. 4:1–4:8. ACM (2011)
Segaran, T., Evans, C., Taylor, J.: Programming the Semantic Web - Build Flexible Applications with Graph Data. O’Reilly (2009)
Sommer, C.: Approximate Shortest Path and Distance Queries in Networks. PhD thesis, University of Tokyo (2010)
Sui, X., Nguyen, D., Burtscher, M., Pingali, K.: Parallel Graph Partitioning on Multicore Architectures. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 246–260. Springer, Heidelberg (2011)
Tryfona, N., Busborg, F., Christiansen, J.G.B.: Starer: A conceptual model for data warehouse design. In: Proceedings of the Second ACM International Workshop on Data Warehousing and OLAP, DOLAP 1999, pp. 3–8. ACM (1999)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)
Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Structural analysis in the social sciences, vol. 8. Cambridge University Press (1994)
Zhan, F.B., Noon, C.E.: Shortest path algorithms: An evaluation using real road networks. Transportation Science 32, 65–73 (1998)
Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD 2011, pp. 853–864. ACM (2011)
Zhao, X., Sala, A., Wilson, C., Zheng, H., Zhao, B.Y.: Orion: shortest path estimation for large social graphs. In: Proceedings of the 3rd Conference on Online Social Networks, WOSN 2010, p. 9. USENIX Association (2010)
Zhou, A., Qian, W., Tao, D., Ma, Q.: Disg: A distributed graph repository for web infrastructure (invited paper). In: Proceedings of the Second International Symposium on Universal Communication, ISUC 2008, pp. 141–145. IEEE Computer Society (2008)
Zhuang, L., Dunagan, J., Simon, D.R., Wang, H.J., Tygar, J.D.: Characterizing botnets from email spam records. In: Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, pp. 2:1–2:9. USENIX Association (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Skhiri, S., Jouili, S. (2013). Large Graph Mining: Recent Developments, Challenges and Potential Solutions. In: Aufaure, MA., Zimányi, E. (eds) Business Intelligence. eBISS 2012. Lecture Notes in Business Information Processing, vol 138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36318-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-36318-4_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36317-7
Online ISBN: 978-3-642-36318-4
eBook Packages: Computer ScienceComputer Science (R0)