Distributed and Parallel Databases

, Volume 35, Issue 1, pp 23–53 | Cite as

Distributed block formation and layout for disk-based management of large-scale graphs

  • Abdurrahman Yaşar
  • Buğra Gedik
  • Hakan Ferhatosmanoğlu
Article
  • 958 Downloads

Abstract

We are witnessing an enormous growth in social networks as well as in the volume of data generated by them. An important portion of this data is in the form of graphs. In recent years, several graph processing and management systems emerged to handle large-scale graphs. The primary goal of these systems is to run graph algorithms and queries in an efficient and scalable manner. Unlike relational data, graphs are semi-structured in nature. Thus, storing and accessing graph data using secondary storage requires new solutions that can provide locality of access for graph processing workloads. In this work, we propose a scalable block formation and layout technique for graphs, which aims at reducing the I/O cost of disk-based graph processing algorithms. To achieve this, we designed a scalable MapReduce-style method called ICBL, which can divide the graph into a series of disk blocks that contain sub-graphs with high locality. Furthermore, ICBL can order the resulting blocks on disk to further reduce non-local accesses. We experimentally evaluated ICBL to showcase its scalability, layout quality, as well as the effectiveness of automatic parameter tuning for ICBL. We deployed the graph layouts generated by ICBL on the Neo4j open source graph database, http://www.neo4j.org/ (2015) graph database management system. Our results show that the layout generated by ICBL reduces the query running times over Neo4j more than \(2\times \) compared to the default layout.

Keywords

Graph management systems Locality Layout Large scale graphs Database management Distributed systems 

References

  1. 1.
    Aggarwal, C., Wang, H.: Graph data management and mining. In: Aggarwal, C. (ed.) A Survey of Algorithms and Applications. Springer, Berlin (2010)Google Scholar
  2. 2.
    Akyurek, S., Salem, K.: Adaptive block rearrangement. ACM Trans. Comput. Syst. 13(2), 89–121 (1995). doi: 10.1145/201045.201046 CrossRefGoogle Scholar
  3. 3.
    Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., Hristidis, V.: BORG: block-reorganization for self-optimizing storage systems. In: Proceedings of the 7th Conference on File and Storage Technologies, pp. 183–196 (2009)Google Scholar
  4. 4.
    Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), pp. 595–601 (2004)Google Scholar
  5. 5.
    Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th International Conference on World Wide Web (2011)Google Scholar
  6. 6.
    Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: Fourth SIAM International Conference on Data Mining (2004)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)Google Scholar
  8. 8.
    Dominguez-Sal, D., Martinez-Bazan, N., Muntes-Mulero, V., Baleta, P., Larriba-Pey, J.: A discussion on the design of graph database benchmarks. In: Nambiar, R., Poess, M. (eds.) Performance Evaluation, Measurement and Characterization of Complex Systems. Springer, Berlin (2011)Google Scholar
  9. 9.
    Fortunato, S.: Community detection in graphs. Phys. Rep. 483(3–5), 75–174 (2009)MathSciNetGoogle Scholar
  10. 10.
    Gedik, B., Bordawekar, R.: Disk-based management of interaction graphs. IEEE Trans. Knowl. Data Eng. 26(11), 2689–2702 (2014)CrossRefGoogle Scholar
  11. 11.
    Giraph: Apache Giraph. http://www.giraph.apache.org/. Accessed June 2015
  12. 12.
    Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Symposium on Operating System Design and Implementation (OSDI), pp. 17–30 (2012)Google Scholar
  13. 13.
    Han, W.S., Lee, S., Park, K., Lee, J.H., Kim, M.S., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 77–85 (2013)Google Scholar
  14. 14.
    Hoque, I., Gupta, I.: Disk layout techniques for online social network data. IEEE Comput. 16(3), 24–36 (2012)CrossRefGoogle Scholar
  15. 15.
    Kang, U., Tong, H., Sun, J., Lin, C.Y., Faloutsos, C.: GBASE: a scalable and general graph management system. In: ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1091–1099 (2011)Google Scholar
  16. 16.
    Karypis, G., Kumar, V.: Multilevel graph partitioning schemes. In: International Conference on Parallel Processing (ICPP), pp. 113–122 (1995)Google Scholar
  17. 17.
    Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: WWW’10: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600 (2010)Google Scholar
  18. 18.
    Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Symposium on Operating System Design and Implementation (OSDI), pp. 31–46 (2012)Google Scholar
  19. 19.
    Lasalle, D., Karypis, G.: Multi-threaded graph partitioning. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 225–236 (2013)Google Scholar
  20. 20.
    Leskovec, J., Krevl, A.: SNAP datasets: Stanford large network dataset collection (2015). http://www.snap.stanford.edu/data
  21. 21.
    Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012). doi: 10.14778/2212351.2212354 CrossRefGoogle Scholar
  22. 22.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281–297 (1967)Google Scholar
  23. 23.
    Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: ACM International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)Google Scholar
  24. 24.
    Mondal, J., Deshpande, A.: Managing large dynamic graphs efficiently. In: ACM International Conference on Management of Data (SIGMOD), pp. 145–156 (2012)Google Scholar
  25. 25.
    Nanavati, A.A., Siva, G., Das, G., Chakraborty, D., Dasgupta, K., Mukherjea, S., Joshi, A.: On the structural properties of massive telecom call graphs: findings and implications. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 435–444 (2006)Google Scholar
  26. 26.
    Neo4j: Neo4j open source graph database (2015). http://www.neo4j.org/
  27. 27.
    Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46(5), 323–351 (2005). doi: 10.1080/00107510500052444 CrossRefGoogle Scholar
  28. 28.
    Nodine, M.H., Goodrich, M.T., Vitter, J.S.: Blocking for external graph searching. Algorithmica 16(2), 181–214 (1996)MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Prabhakaran, V., Wu, M., Weng, X., McSherry, F., Zhou, L., Haridasan, M.: Managing large graphs on multi-cores with graph awareness. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, pp. 4–4 (2012)Google Scholar
  30. 30.
    Rajaraman, A., Ullman, J.D.: Data mining. In: Mining of Massive Datasets, pp. 1–17. Cambridge University Press, Cambridge (2011)Google Scholar
  31. 31.
    Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: ACM International Conference on Management of Data (SIGMOD) (2013)Google Scholar
  32. 32.
    Siek, J.G., Lee, L.Q., Lumsdaine, A.: Boost Graph Library. The User Guide and Reference Manual. Addison-Wesley, Boston (2002)Google Scholar
  33. 33.
    Simmhan, Y., Kumbhare, A., Wickramaarachchi, C., et al.: Goffish: a sub-graph centric framework for large-scale graph analytics. In: European Conference on Parallel Processing (Euro-Par), pp. 451–462 (2015)Google Scholar
  34. 34.
    Steinhaus, R.: G-Store: a storage manager for graph data. Master’s Thesis, University of Oxford (2011)Google Scholar
  35. 35.
    Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From think like a vertex to think like a graph. Proc. Very Large Databases Conf. 7(3), 193–204 (2013)Google Scholar
  36. 36.
    Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 409–410 (1998)CrossRefGoogle Scholar
  37. 37.
    Xie, W., Wang, G., Bindel, D., Demers, A., Gehrke, J.: Fast iterative graph computation with block updates. Proc. Very Large Databases Conf. 6(14), 2014–2025 (2013). doi: 10.14778/2556549.2556581 Google Scholar
  38. 38.
    Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, pp. 2:1–2:6 (2013)Google Scholar
  39. 39.
    Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. Very Large Databases Conf. 7(14), 1981–1992 (2014)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Abdurrahman Yaşar
    • 1
  • Buğra Gedik
    • 2
  • Hakan Ferhatosmanoğlu
    • 2
  1. 1.College of ComputingGeorgia Institute of TechnologyAtlantaUSA
  2. 2.Department of Computer EngineeringBilkent UniversityAnkaraTurkey

Personalised recommendations