Abstract
Dissatisfaction with relational databases for large-scale graph processing has motivated a new class of graph databases that offer fast graph processing but sacrifice the ability to express basic relational idioms. However, we hypothesize that the performance benefits amount to implementation details, not a fundamental limitation of the relational model. To evaluate this hypothesis, we are exploring code-generation to produce fast in-memory algorithms and data structures for graph patterns that are inaccessible to conventional relational optimizers.
In this paper, we present preliminary results for this approach on path-counting queries, which includes triangle counting as a special case. We compile Datalog queries into main-memory pipelined hash-join plans in C\(++\), and show that the resulting programs easily outperform PostgreSQL on real graphs with different degrees of skew. We then produce analogous parallel programs for Grappa, a runtime system for distributed memory architectures. Grappa is a good target for building a parallel query system as its shared memory programming model and communication mechanisms provide productivity and performance when building communication-intensive applications. Our experiments suggest that Grappa programs using hash joins have competitive performance with queries executed on a commercial parallel database. We find preliminary evidence that a code generation approach simplifies the design of a query engine for graph analysis and improves performance over conventional relational databases.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
All queries considered in this paper can be expressed with a single Datalog rule.
References
neo4j open source graph database, May 2013. http://neo4j.org/
Ahmad, Y., Koch, C.: DBToaster: a SQL compiler for high-performance delta processing in main-memory databases. Proc. VLDB Endow. 2(2), 1566–1569 (2009)
Angles, R., Gutierrez, C.: The expressive power of SPARQL. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 114–129. Springer, Heidelberg (2008)
Backstrom, L., et al.: Group formation in large social networks: membership, growth, and evolution. In: ACM KDD, pp. 44–54 (2006)
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1995, pp. 207–216. ACM, New York (1995)
Caverlee, J., Liu, L.: Countering web spam with credibility-based link analysis. In: ACM Principles of Distributed Computing (PODC), pp. 157–166 (2007)
Chen, S., Ailamaki, A., Gibbons, P., Mowry, T.: Improving hash join performance through prefetching. In: International Conference on Data Engineering (ICDE), pp. 116–127 (2004)
Erling, O., Mikhailov, I.: Virtuoso: RDF support in a native RDBMS. In: de Virgilio, R., Giunchiglia, F., Tanca, L. (eds.) Semantic Web Information Management, pp. 501–519. Springer, Heidelberg (2010)
Gonzalez, J.E., et al.: PowerGraph: distributed graph-parallel computation on natural graphs. In: USENIX Operating Systems Design and Implementation (OSDI), pp. 17–30 (2012)
Hagberg, A.A., et al.: Exploring network structure, dynamics, and function using NetworkX. In: Python in Science Conference (SciPy), pp. 11–15, August 2008
HP-Vertica. Vertica analytics platform, June 2013. http://www.vertica.com
Kolda, T.G., Pinar, A., Plantenga, T., Seshadhri, C., Task, C.: Counting triangles in massive graphs with MapReduce. arXiv preprint arXiv:1301.5887 (2013)
Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: International Conference on World Wide Web (WWW), pp. 591–600 (2010)
Leskovec, J., et al.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. CoRR, abs/0810.1355 (2008)
Loo, B.T., et al.: Declarative routing: extensible routing with declarative queries. SIGCOMM Comput. Commun. Rev. 35(4), 289–300 (2005)
Losemann, K., Martens, W.: The complexity of evaluating path expressions in SPARQL. In: Proceedings of Principles of Database Systems (PODS) (2012)
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: ACM SIGMOD, pp. 135–146 (2010)
Mandal, A., Fowler, R., Porterfield, A.: Modeling memory concurrency for multi-socket multi-core systems. In: Performance Analysis of Systems Software (ISPASS), March 2010
Nelson, J., et al.: Crunching large graphs with commodity processors. In: USENIX Conference on Hot Topics in Parallelism (HotPar), pp. 10–10 (2011)
Neumann, T.: Efficiently compiling efficient query plans for modern hardware. Proc. VLDB Endow. 4(9), 539–550 (2011)
Neumann, T., Weikum, G.: x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. In: Proceedings of the 36th International Conference on Very Large Data Bases, PVLDB 2013 (2010)
Pavan, A., Tangwongan, K., Tirthapura, S.: Parallel and distributed triangle counting on graph streams. Technical report, IBM (2013)
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 30–43. Springer, Heidelberg (2006)
Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., Lausen, G.: RDFPath: path query processing on large RDF graphs with MapReduce. In: García-Castro, R., Fensel, D., Antoniou, G. (eds.) ESWC 2011. LNCS, vol. 7117, pp. 50–64. Springer, Heidelberg (2012)
Rompf, T., Odersky, M.: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled dsls. SIGPLAN Not. 46(2), 127–136 (2010)
Seo, J., Guo, S., Lam, M.S.: SociaLite: datalog extensions for efficient social network analysis. In: 29th IEEE International Conference on Data Engineering. IEEE (2013)
Waas, F.M.: Beyond conventional data warehousing-massively parallel data processing with Greenplum database. In: Castellanos, M., Dayal, U., Sellis, T. (eds.) BIRTE 2008. LNBIP, vol. 27, pp. 89–96. Springer, Heidelberg (2009)
Welc, A., Raman, R., Wu, Z., Hong, S., Chafi, H., Banerjee, J.: Graph analysis: do we have to reinvent the wheel? In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, pp. 7:1–7:6. ACM, New York (2013)
Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. In: ACM SIGKDD Workshop on Mining Data Semantics, pp. 3:1–3:8 (2012)
Zhang, W., Zhao, D., Wang, X.: Agglomerative clustering via maximum incremental path integral. Pattern Recogn. 46, 3056–3065 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Myers, B., Hyrkas, J., Halperin, D., Howe, B. (2015). Compiled Plans for In-Memory Path-Counting Queries. In: Jagatheesan, A., Levandoski, J., Neumann, T., Pavlo, A. (eds) In Memory Data Management and Analysis. IMDM IMDM 2013 2014. Lecture Notes in Computer Science(), vol 8921. Springer, Cham. https://doi.org/10.1007/978-3-319-13960-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-13960-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13959-3
Online ISBN: 978-3-319-13960-9
eBook Packages: Computer ScienceComputer Science (R0)