Distributed block formation and layout for disk-based management of large-scale graphs

Abstract

We are witnessing an enormous growth in social networks as well as in the volume of data generated by them. An important portion of this data is in the form of graphs. In recent years, several graph processing and management systems emerged to handle large-scale graphs. The primary goal of these systems is to run graph algorithms and queries in an efficient and scalable manner. Unlike relational data, graphs are semi-structured in nature. Thus, storing and accessing graph data using secondary storage requires new solutions that can provide locality of access for graph processing workloads. In this work, we propose a scalable block formation and layout technique for graphs, which aims at reducing the I/O cost of disk-based graph processing algorithms. To achieve this, we designed a scalable MapReduce-style method called ICBL, which can divide the graph into a series of disk blocks that contain sub-graphs with high locality. Furthermore, ICBL can order the resulting blocks on disk to further reduce non-local accesses. We experimentally evaluated ICBL to showcase its scalability, layout quality, as well as the effectiveness of automatic parameter tuning for ICBL. We deployed the graph layouts generated by ICBL on the Neo4j open source graph database, http://www.neo4j.org/ (2015) graph database management system. Our results show that the layout generated by ICBL reduces the query running times over Neo4j more than \(2\times \) compared to the default layout.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    Acronym is formed by the initial letters of the four solution stages.

References

  1. 1.

    Aggarwal, C., Wang, H.: Graph data management and mining. In: Aggarwal, C. (ed.) A Survey of Algorithms and Applications. Springer, Berlin (2010)

    Google Scholar 

  2. 2.

    Akyurek, S., Salem, K.: Adaptive block rearrangement. ACM Trans. Comput. Syst. 13(2), 89–121 (1995). doi:10.1145/201045.201046

    Article  Google Scholar 

  3. 3.

    Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., Hristidis, V.: BORG: block-reorganization for self-optimizing storage systems. In: Proceedings of the 7th Conference on File and Storage Technologies, pp. 183–196 (2009)

  4. 4.

    Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), pp. 595–601 (2004)

  5. 5.

    Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th International Conference on World Wide Web (2011)

  6. 6.

    Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: Fourth SIAM International Conference on Data Mining (2004)

  7. 7.

    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)

  8. 8.

    Dominguez-Sal, D., Martinez-Bazan, N., Muntes-Mulero, V., Baleta, P., Larriba-Pey, J.: A discussion on the design of graph database benchmarks. In: Nambiar, R., Poess, M. (eds.) Performance Evaluation, Measurement and Characterization of Complex Systems. Springer, Berlin (2011)

    Google Scholar 

  9. 9.

    Fortunato, S.: Community detection in graphs. Phys. Rep. 483(3–5), 75–174 (2009)

    MathSciNet  Google Scholar 

  10. 10.

    Gedik, B., Bordawekar, R.: Disk-based management of interaction graphs. IEEE Trans. Knowl. Data Eng. 26(11), 2689–2702 (2014)

    Article  Google Scholar 

  11. 11.

    Giraph: Apache Giraph. http://www.giraph.apache.org/. Accessed June 2015

  12. 12.

    Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Symposium on Operating System Design and Implementation (OSDI), pp. 17–30 (2012)

  13. 13.

    Han, W.S., Lee, S., Park, K., Lee, J.H., Kim, M.S., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 77–85 (2013)

  14. 14.

    Hoque, I., Gupta, I.: Disk layout techniques for online social network data. IEEE Comput. 16(3), 24–36 (2012)

    Article  Google Scholar 

  15. 15.

    Kang, U., Tong, H., Sun, J., Lin, C.Y., Faloutsos, C.: GBASE: a scalable and general graph management system. In: ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1091–1099 (2011)

  16. 16.

    Karypis, G., Kumar, V.: Multilevel graph partitioning schemes. In: International Conference on Parallel Processing (ICPP), pp. 113–122 (1995)

  17. 17.

    Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: WWW’10: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600 (2010)

  18. 18.

    Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Symposium on Operating System Design and Implementation (OSDI), pp. 31–46 (2012)

  19. 19.

    Lasalle, D., Karypis, G.: Multi-threaded graph partitioning. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 225–236 (2013)

  20. 20.

    Leskovec, J., Krevl, A.: SNAP datasets: Stanford large network dataset collection (2015). http://www.snap.stanford.edu/data

  21. 21.

    Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012). doi:10.14778/2212351.2212354

    Article  Google Scholar 

  22. 22.

    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281–297 (1967)

  23. 23.

    Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: ACM International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)

  24. 24.

    Mondal, J., Deshpande, A.: Managing large dynamic graphs efficiently. In: ACM International Conference on Management of Data (SIGMOD), pp. 145–156 (2012)

  25. 25.

    Nanavati, A.A., Siva, G., Das, G., Chakraborty, D., Dasgupta, K., Mukherjea, S., Joshi, A.: On the structural properties of massive telecom call graphs: findings and implications. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 435–444 (2006)

  26. 26.

    Neo4j: Neo4j open source graph database (2015). http://www.neo4j.org/

  27. 27.

    Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46(5), 323–351 (2005). doi:10.1080/00107510500052444

    Article  Google Scholar 

  28. 28.

    Nodine, M.H., Goodrich, M.T., Vitter, J.S.: Blocking for external graph searching. Algorithmica 16(2), 181–214 (1996)

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Prabhakaran, V., Wu, M., Weng, X., McSherry, F., Zhou, L., Haridasan, M.: Managing large graphs on multi-cores with graph awareness. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, pp. 4–4 (2012)

  30. 30.

    Rajaraman, A., Ullman, J.D.: Data mining. In: Mining of Massive Datasets, pp. 1–17. Cambridge University Press, Cambridge (2011)

  31. 31.

    Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: ACM International Conference on Management of Data (SIGMOD) (2013)

  32. 32.

    Siek, J.G., Lee, L.Q., Lumsdaine, A.: Boost Graph Library. The User Guide and Reference Manual. Addison-Wesley, Boston (2002)

    Google Scholar 

  33. 33.

    Simmhan, Y., Kumbhare, A., Wickramaarachchi, C., et al.: Goffish: a sub-graph centric framework for large-scale graph analytics. In: European Conference on Parallel Processing (Euro-Par), pp. 451–462 (2015)

  34. 34.

    Steinhaus, R.: G-Store: a storage manager for graph data. Master’s Thesis, University of Oxford (2011)

  35. 35.

    Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From think like a vertex to think like a graph. Proc. Very Large Databases Conf. 7(3), 193–204 (2013)

    Google Scholar 

  36. 36.

    Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 409–410 (1998)

    Article  Google Scholar 

  37. 37.

    Xie, W., Wang, G., Bindel, D., Demers, A., Gehrke, J.: Fast iterative graph computation with block updates. Proc. Very Large Databases Conf. 6(14), 2014–2025 (2013). doi:10.14778/2556549.2556581

    Google Scholar 

  38. 38.

    Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, pp. 2:1–2:6 (2013)

  39. 39.

    Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. Very Large Databases Conf. 7(14), 1981–1992 (2014)

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Abdurrahman Yaşar.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yaşar, A., Gedik, B. & Ferhatosmanoğlu, H. Distributed block formation and layout for disk-based management of large-scale graphs. Distrib Parallel Databases 35, 23–53 (2017). https://doi.org/10.1007/s10619-017-7191-3

Download citation

Keywords

  • Graph management systems
  • Locality
  • Layout
  • Large scale graphs
  • Database management
  • Distributed systems