The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Sahu, Siddhartha; Mhedhbi, Amine; Salihoglu, Semih; Lin, Jimmy; Özsu, M. Tamer

doi:10.1007/s00778-019-00548-x

The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Special Issue Paper
Published: 29 June 2019

Volume 29, pages 595–618, (2020)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Siddhartha Sahu ORCID: orcid.org/0000-0003-1174-5115¹,
Amine Mhedhbi¹,
Semih Salihoglu¹,
Jimmy Lin¹ &
…
M. Tamer Özsu¹

2632 Accesses
40 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Complex Networks: a Mini-review

Article 13 July 2020

Visualizing Bibliometric Networks

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

Notes

The linear algebra software we considered, e.g., BLAS [17] and MATLAB [63], either did not have a public mailing list or their lists were inactive.
For each conference, we initially surveyed one year selected between 2014 and 2016 and later extended the survey to include the years 2017 and 2018. Note that OSDI and SOSP are held in alternating years.
Six entities that the participants mentioned did not fall under any of our 7 categories, which we list for completeness: call records, computers, cars, houses, time slots, and specialties.
Some participants selected multiple graph sizes and multiple entities, so we cannot perform a direct match of which graph size corresponds to which entity. The entities we list here are taken from the participants who selected a single graph size and entity, so we can directly match the size of the graph to the entity.
In the publications, link prediction referred to problems that predict a missing edge in a graph or data on an existing edge. Influence maximization referred to finding influential vertices in a graph, e.g., those that can bring more vertices to the graph. We did not provide detailed explanations about the problems to the participants.
This feature is called composition and is supported in SPARQL but not in the languages of some graph database systems.
This functionality is supported in RDF engines but not supported in some graph database systems.
Note that the use of term “knowledge graph” vs other terms such as “property graph” or simply “graph” is slightly arbitrary. We found our interviewees referring to any data stored in RDF stores as knowledge graphs. We also found that interviewees referred to graphs that represent abstract things, e.g., keyword topics or concepts, also as knowledge graphs even if they were not stored in an RDF system.

References

Cyber Threat Intelligence. https://bitnine.net/agensgraph-usecase-cyber-threat-intelligence-en
Personalized Education Service. https://bitnine.net/agensgraph-usecase-personalized-education-service-en
Aggarwal, C.C., Wang, H.: Graph Data Management and Mining: A Survey of Algorithms and Applications, pp. 13–68. Springer, Berlin (2010)
Book Google Scholar
Alexa. https://en.wikipedia.org/wiki/Amazon_Alexa
AliGenie. https://en.wikipedia.org/wiki/AliGenie
AllegroGraph. https://franz.com/agraph/allegrograph
Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified Stress Testing of RDF Data Management Systems. In: ISWC (2014)
Amer-Yahia, S., Pei, J. (eds.): PVLDB, Volume 11 (2017–2018). http://www.vldb.org/pvldb/vol11.html
Angles, R., Arenas, M., Barceló, P., Boncz, P.A., Fletcher, G.H.L., Gutierrez, C., Lindaaker, T., Paradies, M., Plantikow, S., Sequeda, J.F., van Rest, O., Voigt, H.: G-CORE: a core for future graph query languages. In: Proceedings of International Conference on Management of Data (2018)
Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J.L., Vrgoc, D.: Foundations of modern query languages for graph databases. ACM Comput. Surv. 50(5), 68 (2017)
Article Google Scholar
AnzoGraph. https://www.cambridgesemantics.com/product/anzograph
Arpaci-Dusseau, A.C., Voelker, G. (eds.): Proceedings of the Symposium on Operating Systems Design and Implementation. USENIX Association (2018). https://www.usenix.org/conference/osdi18
ArrangoDB. https://www.arangodb.com
AboutYou Data-Driven Personalization with ArangoDB. https://www.arangodb.com/why-arangodb/case-studies/aboutyou-data-driven-personalization-with-arangodb
Balcan, M., Weinberger, K.Q. (eds.): Proceedings of the International Conference on Machine Learning. JMLR.org (2016). http://jmlr.org/proceedings/papers/v48
Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S.M.R., Barnawi, A., Sakr, S.: Large scale graph processing systems: survey and an experimental evaluation. Cluster Comput. 18(3), 1189–1213 (2015)
Article Google Scholar
Basic Linear Algebra Subprograms. http://www.netlib.org/blas
Boncz, P., Salem, K. (eds.): PVLDB, Volume 10 (2016–2017). http://www.vldb.org/pvldb/vol10.html
Bridgeman, S., Tamassia, R.: A User Study in Similarity Measures for Graph Drawing, pp. 19–30. Springer, Berlin (2001)
MATH Google Scholar
Caley. https://cayley.io
Ching, A., Edunov, S., Kabiljo, M., Logothetis, D., Muthukrishnan, S.: One trillion edges: graph processing at facebook-scale. PVLDB 8(12), 1804–1815 (2015)
Google Scholar
Click Farm. https://en.wikipedia.org/wiki/Click_farm
Conceptual Graphs. http://conceptualgraphs.org
Cui, W., Qu, H.: A Survey on Graph Visualization. PhD Qualifying Exam Report, Computer Science Department, Hong Kong University of Science and Technology (2007)
Cytoscape. http://www.cytoscape.org
DGraph. https://dgraph.io
DTD and XSD XML Schemas. https://www.w3.org/standards/xml/schema
Dy, J.G., Krause, A. (eds.): Proceedings of the International Conference on Machine Learning. JMLR.org (2018). http://jmlr.org/proceedings/papers/v80/
Elasticsearch X-Pack Graph. https://www.elastic.co/products/x-pack/graph
Apache Flink. https://flink.apache.org
Apache Flink User Survey 2016. https://github.com/dataArtisans/flink-user-survey-2016
FullContact. https://www.fullcontact.com
Gephi. https://gephi.org
Apache Giraph. https://giraph.apache.org
Graph for Scala. http://www.scala-graph.org
Graph 500 Benchmarks. http://graph500.org
GraphStream. http://graphstream-project.org
Graph-tool. https://graph-tool.skewed.de
Graphviz. https://graphviz.readthedocs.io
Apache Spark GraphX. https://spark.apache.org/graphx
Apache TinkerPop. https://tinkerpop.apache.org
Group, W.: Common format for exchange of solved load flow data. IEEE Trans. Power App. Syst. 92(6), 1916–1925 (1973)
Article Google Scholar
GQL Standard. https://www.gqlstandards.org
Haase, P., Broekstra, J., Eberhart, A., Volz, R.: A Comparison of RDF Query Languages, pp. 502–517. Springer, Berlin (2004)
Google Scholar
Herman, I., Melançon, G., Marshall, M.S.: Graph visualization and navigation in information visualization: a survey. IEEE Trans. Vis. Comput. Graph. 6(1), 24–43 (2000)
Article Google Scholar
Holten, D., van Wijk, J.J.: A User Study on Visualizing Directed Edges in Graphs. In: Proceedings of International Conference on Human Factors in Computing Systems (2009)
Holzschuher, F., Peinl, R.: Performance of graph query languages: comparison of Cypher, Gremlin and Native Access in Neo4j. In: Proceedings of the Joint EDBT/ICDT Workshops (2013)
Jagadish, H.V., Zhou, A. (eds.): PVLDB, Vol. 7 (2013–2014). http://www.vldb.org/pvldb/vol7.html
JanusGraph. http://janusgraph.org
Jayaram, N., Khan, A., Li, C., Yan, X., Elmasri, R.: Querying knowledge graphs by example entity tuples. In: Proceedings of International Conference on Data Engineering (2016)
JDBC. http://www.oracle.com/technetwork/java/overview-141217.html
Apache Jena. https://jena.apache.org
Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., Giannopoulou, E.: Ontology visualization methods: a survey. ACM Comput. Surv. 39(4), 10 (2007)
Article Google Scholar
Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2015). http://dl.acm.org/citation.cfm?id=2783258
Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2017). http://dl.acm.org/citation.cfm?id=3097983
Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2018). http://dl.acm.org/citation.cfm?id=3219819
Keeton, K., Roscoe, T. (eds.): Proceedings of the Symposium on Operating Systems Design and Implementation. USENIX Association (2016). https://www.usenix.org/conference/osdi16
Knowledge Graph at Siemens. https://youtu.be/9pmQXua9LWA?t=1109
LDBC Benchmarks. http://ldbcouncil.org/benchmarks
Letunic, I., Bork, P.: Interactive tree of life: an online tool for phylogenetic tree display and annotation. Bioinformatics 23(1), 127–128 (2006)
Article Google Scholar
Lu, Y., Cheng, J., Yan, D., Wu, H.: Large-scale Distributed graph computing systems: an experimental evaluation. PVLDB 8(3), 281–292 (2014)
Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of International Conference on Management of Data (2010)
MATLAB. https://www.mathworks.com
Mattson, T., Bader, D.A., Berry, J.W., Buluç, A., Dongarra, J.J., Faloutsos, C., Feo, J., Gilbert, J.R., Gonzalez, J., Hendrickson, B., Kepner, J., Leiserson, C.E., Lumsdaine, A., Padua, D.A., Poole, S., Reinhardt, S.P., Stonebraker, M., Wallach, S., Yoo, A.: Standards for graph algorithm primitives. In: Proceedings of High Performance Extreme Computing Conference (2013)
Neo4j. https://neo4j.com
Detect Fraud in Real Time with Graph Databases. https://neo4j.com/whitepapers/fraud-detection-graph-databases
The 2016 State of the Graph Report. https://neo4j.com/resources/2016-state-of-the-graph
NetworKit. https://networkit.iti.kit.edu
NetworkX. https://networkx.github.io
GraphDB by Ontotext. https://www.ontotext.com/products/graphdb
OpenBEL. http://openbel.org
openCypher. http://www.opencypher.org
OrientDB. https://orientdb.com
Pienta, R., Tamersoy, A., Endert, A., Navathe, S., Tong, H., Chau, D.H.: VISAGE: Interactive visual graph querying. In: Proceedings of International Working Conference on Advanced Visual Interfaces (2016)
Precup, D., Teh, Y.W. (eds.): Proceedings of the International Conference on Machine Learning. JMLR.org (2017). http://jmlr.org/proceedings/papers/v70
Qiu, X., Cen, W., Qian, Z., Peng, Y., Zhang, Y., Lin, X., Zhou, J.: Real-time constrained cycle detection in large dynamic graphs. PVLDB 11(12), 1876–1888 (2018)
Google Scholar
Rath, M., Akehurst, D., Borowski, C., Mäder, P.: Are graph query languages applicable for requirements traceability analysis? In: Proceedings of International Conference on Requirements Engineering: Foundation for Software Quality (2017)
van Rest, O., Hong, S., Kim, J., Meng, X., Chafi, H.: PGQL: a property graph query language. In: Proceedings of Graph Data Management Experiences and Systems (2016)
Rodriguez, M.A.: The Gremlin Graph Traversal Machine and Language. CoRR arXiv:1508.03843 (2015)
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2016). https://dl.acm.org/citation.cfm?id=3014904
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2017). https://dl.acm.org/citation.cfm?id=3126908
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2018). https://dl.acm.org/citation.cfm?id=3291656
Sharma, A., Jiang, J., Bommannavar, P., Larson, B., Lin, J.: GraphJet: real-time content recommendations at twitter. PVLDB 9(13), 1281–1292 (2016)
Google Scholar
SNAP: Standford Network Analysis Project. https://snap.stanford.edu
Proceedings of the Symposium on Cloud Computing. ACM (2015). http://dl.acm.org/citation.cfm?id=2806777
Proceedings of the Symposium on Cloud Computing. ACM (2017). http://dl.acm.org/citation.cfm?id=3127479
Proceedings of the Symposium on Cloud Computing. ACM (2018). http://dl.acm.org/citation.cfm?id=3267809
Proceedings of the Symposium on Operating Systems Principles. ACM (2017). http://dl.acm.org/citation.cfm?id=3132747
Apache Spark—Preparing for the Next Wave of Reactive Big Data. https://info.lightbend.com/white-paper-spark-survey-trends-adoption-report-register.html
Sparksee. http://www.sparsity-technologies.com
Stardog. https://www.stardog.com
State Grid. http://www.sgcc.com.cn/ywlm/index.shtml
TigerGraph. https://www.tigergraph.com
The TPC-C Benchmark. http://www.tpc.org/tpcc
Vehlow, C., Beck, F., Weiskopf, D.: Visualizing group structures in graphs: a survey. Computer Graphics Forum 36(6), 201–225 (2017)
Article Google Scholar
OpenLink Virtuoso. https://virtuoso.openlinksw.com
Wang, C., Tao, J.: Graphs in scientific visualization: a survey. Computer Graphics Forum 36(1), 263–287 (2017)
Article MathSciNet Google Scholar
Zhao, Y., Yuan, C., Liu, G., Grinberg, I.: Graph-based preconditioning conjugate gradient algorithm for “N-1” contingency analysis. In: IEEE Power Energy Society General Meeting (2018)

Download references

Acknowledgements

We are grateful to Chen Zou for helping us in using online survey tools and drafting an early version of this survey. We are also grateful to Nafisa Anzum, Jeremy Chen, Pranjal Gupta, Chathura Kankanamge, and Shahid Khaliq for their valuable comments on the survey and help in categorizing the academic publications, user emails, and issues. We would like to thank our online survey participants and our interviewees: Brad Bebee, Scott Brave, Jordan Crombie, Luna Dong, William Hayes, Thomas Hubauer, Peter Lawrence, Stephen Ruppert, Chen Yuan, with a special thanks to Zhengping Qian for the many hours of follow-up discussions after our interview. Finally, we would like to thank the anonymous reviewers for their valuable comments. This research was partially supported by multiple Discovery Grants from the Natural Sciences and Engineering Council (NSERC) of Canada.

Author information

Authors and Affiliations

University of Waterloo, Waterloo, Canada
Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin & M. Tamer Özsu

Authors

Siddhartha Sahu
View author publications
You can also search for this author in PubMed Google Scholar
Amine Mhedhbi
View author publications
You can also search for this author in PubMed Google Scholar
Semih Salihoglu
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Lin
View author publications
You can also search for this author in PubMed Google Scholar
M. Tamer Özsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siddhartha Sahu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (xlsx 83 KB)

Appendices

Choices of graph computations

One way to ask this question is to include a short-answer question that asks “What queries and graph computations do you perform on your graphs?”. However, the terms graph queries and computations are very general and we thought this version of the question could be under-specified. We also knew that participants respond less to short-answer questions, so instead we first asked a multiple-choice question followed by a short-answer question for computations that may not have appeared in the first question as a choice.

In a multiple-choice question, it is very challenging to provide a list of graph queries and computations from which participants can select, as there is no consensus on what constitutes a graph computation, let alone a reasonable taxonomy of graph computations. We decided to select a list of graph queries and computations that appeared in the publications of six conferences, as described in Sect. 2.2. We use the term graph computation here to refer to a query, a problem, or an algorithm.

For each of the 90 papers, we identified each graph computation, if (i) it was directly studied in the paper; or (ii) for papers describing a software, it was used to evaluate the software. We used our best judgment to categorize the computations that were variants of each other or appeared as different names under a single category. For example, we identified motif finding, subgraph finding, and subgraph matching as subgraph matching. When reviewing papers studying linear algebra operations, e.g., a matrix–vector multiplication, for solving a graph problem such as BFS traversal, we identified the graph problem and not the linear algebra operation as a computation.

Finally, for each identified and categorized computation, we counted the number of papers that study it and selected the ones that appeared in at least 2 papers. In the end, we provided the participants with the 13 choices that are shown in Table 10.

Choices of machine learning computations

Similar to graph computation, machine learning computation is a very general term. Instead of providing a list of ad hoc computations as choices, we reviewed each machine learning computation that appeared in the 90 graph papers we had selected. Specifically, the list of machine learning computations we identified included the following: (i) high-level classes of machine learning techniques, such as clustering, classification, and regression; (ii) specific algorithms and techniques, such as stochastic gradient descent and alternating least squares that can be used as part of multiple higher-level techniques; and (iii) problems that are commonly solved using a machine learning technique, such as community detection, link prediction, and recommendations. We then selected the computations, i.e., high-level techniques, specific techniques, or problems, that appeared in at least 2 papers. As in the graph computations question, we used our best judgment to identify and categorize similar computations under the same name.

Storage in multiple formats

We asked the 33 participants who said that they store their data in multiple formats, which formats they use as a short-answer question. Out of the 33 participants, 25 responded. Their responses contained explicit data storage formats as well as the internal formats of different software. Table 19 shows the number of responses we received for the main formats. A relational database and a graph database format combination was the most popular combination. Other combinations varied significantly, examples of which include HBase and Hive, GraphML and CSV, and XML and triplestore.

Table 19 Data storage formats

Full size table

Table 20 Graph sizes in user emails and issues

Full size table

Table 21 Challenges found in user emails and issues

Full size table

Other tables from the survey

Table 20 shows the sizes of graphs we found in user emails and issues. Table 21 shows the number of emails and issues we identified for each specific challenge we discussed in Sect. 3.4.2. Table 22 shows the total number of emails and issues we reviewed for each software product from January to September of 2017. The table also shows the number of commits in the code repositories of each software product during the same period.

Table 22 Number of reviewed emails and issues, and the code commits in the repositories of each software product

Full size table

Other applications from interviews

Large-Scale Data Integration for Analysis of Turbines: Our interviewee from Siemens described an application that Siemens engineers use to analyze different properties of gas turbines Siemens produces using a knowledge graph. The application emphasized the advantage of using graphs to integrate different sources of corporate data, in this case mainly about where turbines are installed, measurements from the installed turbines’ sensors, and information about maintenance activity on the turbines. The knowledge graph is stored in an RDF engine and engineers ask queries, such as “What is the mean time failure of turbines with coating loss?” through a visual interface where they navigate the different types of nodes in the knowledge graph and express aggregations. The visually expressed queries get translated to SPARQL queries.

Contact Deduplication: One of our interviews was with two engineers from a company called FullContact that manages contact information about individuals by integrating public and manually curated information, which is sold to other businesses. An over 10 B-edge and 4 B-vertex graph models this contact information as follows: nodes represent different pieces of information, such as addresses, phone numbers, and edges between nodes indicate the likelihood that the information belongs to the same individual. The company uses GraphX to run a connected component-like algorithm to finding the contacts that are likely to belong to the same individual.

Other applications using knowledge graphs: One of our interviewees was a consultant to a chemical company specializing in agricultural chemicals. The company has an over 30 billion-edge knowledge graph on pesticides, seeds, chemicals that are stored in a commercial RDF system. This graph is used by many applications, such as, to track the evolution of seeds, to power internal wiki pages and tools used by analysts. Another interviewee was the founder of a start-up that works on tools that can be used by biologists to publish biological knowledge. Our interviewee described examples of knowledge graphs that model the cellular activity in the context of different species’ different tissues. Triples included facts about which genes transcribe which protein and which proteins’ presence decreases other proteins’ presence, etc. [71]. The example applications were similar broadly to our interviewee from the chemical company and power wikis and Web site which biologists use to analyze these interactions.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sahu, S., Mhedhbi, A., Salihoglu, S. et al. The ubiquity of large graphs and surprising challenges of graph processing: extended survey. The VLDB Journal 29, 595–618 (2020). https://doi.org/10.1007/s00778-019-00548-x

Download citation

Received: 21 January 2019
Revised: 09 May 2019
Accepted: 13 June 2019
Published: 29 June 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s00778-019-00548-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Abstract

Access this article

Similar content being viewed by others

Complex Networks: a Mini-review

Visualizing Bibliometric Networks

Clustering graph data: the roadmap to spectral techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (xlsx 83 KB)

Appendices

Choices of graph computations

Choices of machine learning computations

Storage in multiple formats

Other tables from the survey

Other applications from interviews

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Abstract

Access this article

Similar content being viewed by others

Complex Networks: a Mini-review

Visualizing Bibliometric Networks

Clustering graph data: the roadmap to spectral techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (xlsx 83 KB)

Appendices

Choices of graph computations

Choices of machine learning computations

Storage in multiple formats

Other tables from the survey

Other applications from interviews

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation