The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Abstract

Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research.

This is a preview of subscription content, log in to check access.

Fig. 1

Notes

  1. 1.

    The linear algebra software we considered, e.g., BLAS [17] and MATLAB [63], either did not have a public mailing list or their lists were inactive.

  2. 2.

    For each conference, we initially surveyed one year selected between 2014 and 2016 and later extended the survey to include the years 2017 and 2018. Note that OSDI and SOSP are held in alternating years.

  3. 3.

    Six entities that the participants mentioned did not fall under any of our 7 categories, which we list for completeness: call records, computers, cars, houses, time slots, and specialties.

  4. 4.

    Some participants selected multiple graph sizes and multiple entities, so we cannot perform a direct match of which graph size corresponds to which entity. The entities we list here are taken from the participants who selected a single graph size and entity, so we can directly match the size of the graph to the entity.

  5. 5.

    In the publications, link prediction referred to problems that predict a missing edge in a graph or data on an existing edge. Influence maximization referred to finding influential vertices in a graph, e.g., those that can bring more vertices to the graph. We did not provide detailed explanations about the problems to the participants.

  6. 6.

    This feature is called composition and is supported in SPARQL but not in the languages of some graph database systems.

  7. 7.

    This functionality is supported in RDF engines but not supported in some graph database systems.

  8. 8.

    Note that the use of term “knowledge graph” vs other terms such as “property graph” or simply “graph” is slightly arbitrary. We found our interviewees referring to any data stored in RDF stores as knowledge graphs. We also found that interviewees referred to graphs that represent abstract things, e.g., keyword topics or concepts, also as knowledge graphs even if they were not stored in an RDF system.

References

  1. 1.

    Cyber Threat Intelligence. https://bitnine.net/agensgraph-usecase-cyber-threat-intelligence-en

  2. 2.

    Personalized Education Service. https://bitnine.net/agensgraph-usecase-personalized-education-service-en

  3. 3.

    Aggarwal, C.C., Wang, H.: Graph Data Management and Mining: A Survey of Algorithms and Applications, pp. 13–68. Springer, Berlin (2010)

    Google Scholar 

  4. 4.

    Alexa. https://en.wikipedia.org/wiki/Amazon_Alexa

  5. 5.

    AliGenie. https://en.wikipedia.org/wiki/AliGenie

  6. 6.

    AllegroGraph. https://franz.com/agraph/allegrograph

  7. 7.

    Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified Stress Testing of RDF Data Management Systems. In: ISWC (2014)

  8. 8.

    Amer-Yahia, S., Pei, J. (eds.): PVLDB, Volume 11 (2017–2018). http://www.vldb.org/pvldb/vol11.html

  9. 9.

    Angles, R., Arenas, M., Barceló, P., Boncz, P.A., Fletcher, G.H.L., Gutierrez, C., Lindaaker, T., Paradies, M., Plantikow, S., Sequeda, J.F., van Rest, O., Voigt, H.: G-CORE: a core for future graph query languages. In: Proceedings of International Conference on Management of Data (2018)

  10. 10.

    Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J.L., Vrgoc, D.: Foundations of modern query languages for graph databases. ACM Comput. Surv. 50(5), 68 (2017)

    Article  Google Scholar 

  11. 11.

    AnzoGraph. https://www.cambridgesemantics.com/product/anzograph

  12. 12.

    Arpaci-Dusseau, A.C., Voelker, G. (eds.): Proceedings of the Symposium on Operating Systems Design and Implementation. USENIX Association (2018). https://www.usenix.org/conference/osdi18

  13. 13.

    ArrangoDB. https://www.arangodb.com

  14. 14.

    AboutYou Data-Driven Personalization with ArangoDB. https://www.arangodb.com/why-arangodb/case-studies/aboutyou-data-driven-personalization-with-arangodb

  15. 15.

    Balcan, M., Weinberger, K.Q. (eds.): Proceedings of the International Conference on Machine Learning. JMLR.org (2016). http://jmlr.org/proceedings/papers/v48

  16. 16.

    Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S.M.R., Barnawi, A., Sakr, S.: Large scale graph processing systems: survey and an experimental evaluation. Cluster Comput. 18(3), 1189–1213 (2015)

    Article  Google Scholar 

  17. 17.

    Basic Linear Algebra Subprograms. http://www.netlib.org/blas

  18. 18.

    Boncz, P., Salem, K. (eds.): PVLDB, Volume 10 (2016–2017). http://www.vldb.org/pvldb/vol10.html

  19. 19.

    Bridgeman, S., Tamassia, R.: A User Study in Similarity Measures for Graph Drawing, pp. 19–30. Springer, Berlin (2001)

    Google Scholar 

  20. 20.

    Caley. https://cayley.io

  21. 21.

    Ching, A., Edunov, S., Kabiljo, M., Logothetis, D., Muthukrishnan, S.: One trillion edges: graph processing at facebook-scale. PVLDB 8(12), 1804–1815 (2015)

    Google Scholar 

  22. 22.

    Click Farm. https://en.wikipedia.org/wiki/Click_farm

  23. 23.

    Conceptual Graphs. http://conceptualgraphs.org

  24. 24.

    Cui, W., Qu, H.: A Survey on Graph Visualization. PhD Qualifying Exam Report, Computer Science Department, Hong Kong University of Science and Technology (2007)

  25. 25.

    Cytoscape. http://www.cytoscape.org

  26. 26.

    DGraph. https://dgraph.io

  27. 27.

    DTD and XSD XML Schemas. https://www.w3.org/standards/xml/schema

  28. 28.

    Dy, J.G., Krause, A. (eds.): Proceedings of the International Conference on Machine Learning. JMLR.org (2018). http://jmlr.org/proceedings/papers/v80/

  29. 29.

    Elasticsearch X-Pack Graph. https://www.elastic.co/products/x-pack/graph

  30. 30.

    Apache Flink. https://flink.apache.org

  31. 31.

    Apache Flink User Survey 2016. https://github.com/dataArtisans/flink-user-survey-2016

  32. 32.

    FullContact. https://www.fullcontact.com

  33. 33.

    Gephi. https://gephi.org

  34. 34.

    Apache Giraph. https://giraph.apache.org

  35. 35.

    Graph for Scala. http://www.scala-graph.org

  36. 36.

    Graph 500 Benchmarks. http://graph500.org

  37. 37.

    GraphStream. http://graphstream-project.org

  38. 38.

    Graph-tool. https://graph-tool.skewed.de

  39. 39.

    Graphviz. https://graphviz.readthedocs.io

  40. 40.

    Apache Spark GraphX. https://spark.apache.org/graphx

  41. 41.

    Apache TinkerPop. https://tinkerpop.apache.org

  42. 42.

    Group, W.: Common format for exchange of solved load flow data. IEEE Trans. Power App. Syst. 92(6), 1916–1925 (1973)

    Article  Google Scholar 

  43. 43.

    GQL Standard. https://www.gqlstandards.org

  44. 44.

    Haase, P., Broekstra, J., Eberhart, A., Volz, R.: A Comparison of RDF Query Languages, pp. 502–517. Springer, Berlin (2004)

    Google Scholar 

  45. 45.

    Herman, I., Melançon, G., Marshall, M.S.: Graph visualization and navigation in information visualization: a survey. IEEE Trans. Vis. Comput. Graph. 6(1), 24–43 (2000)

    Article  Google Scholar 

  46. 46.

    Holten, D., van Wijk, J.J.: A User Study on Visualizing Directed Edges in Graphs. In: Proceedings of International Conference on Human Factors in Computing Systems (2009)

  47. 47.

    Holzschuher, F., Peinl, R.: Performance of graph query languages: comparison of Cypher, Gremlin and Native Access in Neo4j. In: Proceedings of the Joint EDBT/ICDT Workshops (2013)

  48. 48.

    Jagadish, H.V., Zhou, A. (eds.): PVLDB, Vol. 7 (2013–2014). http://www.vldb.org/pvldb/vol7.html

  49. 49.

    JanusGraph. http://janusgraph.org

  50. 50.

    Jayaram, N., Khan, A., Li, C., Yan, X., Elmasri, R.: Querying knowledge graphs by example entity tuples. In: Proceedings of International Conference on Data Engineering (2016)

  51. 51.

    JDBC. http://www.oracle.com/technetwork/java/overview-141217.html

  52. 52.

    Apache Jena. https://jena.apache.org

  53. 53.

    Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., Giannopoulou, E.: Ontology visualization methods: a survey. ACM Comput. Surv. 39(4), 10 (2007)

    Article  Google Scholar 

  54. 54.

    Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2015). http://dl.acm.org/citation.cfm?id=2783258

  55. 55.

    Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2017). http://dl.acm.org/citation.cfm?id=3097983

  56. 56.

    Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2018). http://dl.acm.org/citation.cfm?id=3219819

  57. 57.

    Keeton, K., Roscoe, T. (eds.): Proceedings of the Symposium on Operating Systems Design and Implementation. USENIX Association (2016). https://www.usenix.org/conference/osdi16

  58. 58.

    Knowledge Graph at Siemens. https://youtu.be/9pmQXua9LWA?t=1109

  59. 59.

    LDBC Benchmarks. http://ldbcouncil.org/benchmarks

  60. 60.

    Letunic, I., Bork, P.: Interactive tree of life: an online tool for phylogenetic tree display and annotation. Bioinformatics 23(1), 127–128 (2006)

    Article  Google Scholar 

  61. 61.

    Lu, Y., Cheng, J., Yan, D., Wu, H.: Large-scale Distributed graph computing systems: an experimental evaluation. PVLDB 8(3), 281–292 (2014)

    Google Scholar 

  62. 62.

    Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of International Conference on Management of Data (2010)

  63. 63.

    MATLAB. https://www.mathworks.com

  64. 64.

    Mattson, T., Bader, D.A., Berry, J.W., Buluç, A., Dongarra, J.J., Faloutsos, C., Feo, J., Gilbert, J.R., Gonzalez, J., Hendrickson, B., Kepner, J., Leiserson, C.E., Lumsdaine, A., Padua, D.A., Poole, S., Reinhardt, S.P., Stonebraker, M., Wallach, S., Yoo, A.: Standards for graph algorithm primitives. In: Proceedings of High Performance Extreme Computing Conference (2013)

  65. 65.

    Neo4j. https://neo4j.com

  66. 66.

    Detect Fraud in Real Time with Graph Databases. https://neo4j.com/whitepapers/fraud-detection-graph-databases

  67. 67.

    The 2016 State of the Graph Report. https://neo4j.com/resources/2016-state-of-the-graph

  68. 68.

    NetworKit. https://networkit.iti.kit.edu

  69. 69.

    NetworkX. https://networkx.github.io

  70. 70.

    GraphDB by Ontotext. https://www.ontotext.com/products/graphdb

  71. 71.

    OpenBEL. http://openbel.org

  72. 72.

    openCypher. http://www.opencypher.org

  73. 73.

    OrientDB. https://orientdb.com

  74. 74.

    Pienta, R., Tamersoy, A., Endert, A., Navathe, S., Tong, H., Chau, D.H.: VISAGE: Interactive visual graph querying. In: Proceedings of International Working Conference on Advanced Visual Interfaces (2016)

  75. 75.

    Precup, D., Teh, Y.W. (eds.): Proceedings of the International Conference on Machine Learning. JMLR.org (2017). http://jmlr.org/proceedings/papers/v70

  76. 76.

    Qiu, X., Cen, W., Qian, Z., Peng, Y., Zhang, Y., Lin, X., Zhou, J.: Real-time constrained cycle detection in large dynamic graphs. PVLDB 11(12), 1876–1888 (2018)

    Google Scholar 

  77. 77.

    Rath, M., Akehurst, D., Borowski, C., Mäder, P.: Are graph query languages applicable for requirements traceability analysis? In: Proceedings of International Conference on Requirements Engineering: Foundation for Software Quality (2017)

  78. 78.

    van Rest, O., Hong, S., Kim, J., Meng, X., Chafi, H.: PGQL: a property graph query language. In: Proceedings of Graph Data Management Experiences and Systems (2016)

  79. 79.

    Rodriguez, M.A.: The Gremlin Graph Traversal Machine and Language. CoRR arXiv:1508.03843 (2015)

  80. 80.

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2016). https://dl.acm.org/citation.cfm?id=3014904

  81. 81.

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2017). https://dl.acm.org/citation.cfm?id=3126908

  82. 82.

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2018). https://dl.acm.org/citation.cfm?id=3291656

  83. 83.

    Sharma, A., Jiang, J., Bommannavar, P., Larson, B., Lin, J.: GraphJet: real-time content recommendations at twitter. PVLDB 9(13), 1281–1292 (2016)

    Google Scholar 

  84. 84.

    SNAP: Standford Network Analysis Project. https://snap.stanford.edu

  85. 85.

    Proceedings of the Symposium on Cloud Computing. ACM (2015). http://dl.acm.org/citation.cfm?id=2806777

  86. 86.

    Proceedings of the Symposium on Cloud Computing. ACM (2017). http://dl.acm.org/citation.cfm?id=3127479

  87. 87.

    Proceedings of the Symposium on Cloud Computing. ACM (2018). http://dl.acm.org/citation.cfm?id=3267809

  88. 88.

    Proceedings of the Symposium on Operating Systems Principles. ACM (2017). http://dl.acm.org/citation.cfm?id=3132747

  89. 89.

    Apache Spark—Preparing for the Next Wave of Reactive Big Data. https://info.lightbend.com/white-paper-spark-survey-trends-adoption-report-register.html

  90. 90.

    Sparksee. http://www.sparsity-technologies.com

  91. 91.

    Stardog. https://www.stardog.com

  92. 92.

    State Grid. http://www.sgcc.com.cn/ywlm/index.shtml

  93. 93.

    TigerGraph. https://www.tigergraph.com

  94. 94.

    The TPC-C Benchmark. http://www.tpc.org/tpcc

  95. 95.

    Vehlow, C., Beck, F., Weiskopf, D.: Visualizing group structures in graphs: a survey. Computer Graphics Forum 36(6), 201–225 (2017)

    Article  Google Scholar 

  96. 96.

    OpenLink Virtuoso. https://virtuoso.openlinksw.com

  97. 97.

    Wang, C., Tao, J.: Graphs in scientific visualization: a survey. Computer Graphics Forum 36(1), 263–287 (2017)

    MathSciNet  Article  Google Scholar 

  98. 98.

    Zhao, Y., Yuan, C., Liu, G., Grinberg, I.: Graph-based preconditioning conjugate gradient algorithm for “N-1” contingency analysis. In: IEEE Power Energy Society General Meeting (2018)

Download references

Acknowledgements

We are grateful to Chen Zou for helping us in using online survey tools and drafting an early version of this survey. We are also grateful to Nafisa Anzum, Jeremy Chen, Pranjal Gupta, Chathura Kankanamge, and Shahid Khaliq for their valuable comments on the survey and help in categorizing the academic publications, user emails, and issues. We would like to thank our online survey participants and our interviewees: Brad Bebee, Scott Brave, Jordan Crombie, Luna Dong, William Hayes, Thomas Hubauer, Peter Lawrence, Stephen Ruppert, Chen Yuan, with a special thanks to Zhengping Qian for the many hours of follow-up discussions after our interview. Finally, we would like to thank the anonymous reviewers for their valuable comments. This research was partially supported by multiple Discovery Grants from the Natural Sciences and Engineering Council (NSERC) of Canada.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Siddhartha Sahu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (xlsx 83 KB)

Appendices

Choices of graph computations

One way to ask this question is to include a short-answer question that asks “What queries and graph computations do you perform on your graphs?”. However, the terms graph queries and computations are very general and we thought this version of the question could be under-specified. We also knew that participants respond less to short-answer questions, so instead we first asked a multiple-choice question followed by a short-answer question for computations that may not have appeared in the first question as a choice.

In a multiple-choice question, it is very challenging to provide a list of graph queries and computations from which participants can select, as there is no consensus on what constitutes a graph computation, let alone a reasonable taxonomy of graph computations. We decided to select a list of graph queries and computations that appeared in the publications of six conferences, as described in Sect. 2.2. We use the term graph computation here to refer to a query, a problem, or an algorithm.

For each of the 90 papers, we identified each graph computation, if (i) it was directly studied in the paper; or (ii) for papers describing a software, it was used to evaluate the software. We used our best judgment to categorize the computations that were variants of each other or appeared as different names under a single category. For example, we identified motif finding, subgraph finding, and subgraph matching as subgraph matching. When reviewing papers studying linear algebra operations, e.g., a matrix–vector multiplication, for solving a graph problem such as BFS traversal, we identified the graph problem and not the linear algebra operation as a computation.

Finally, for each identified and categorized computation, we counted the number of papers that study it and selected the ones that appeared in at least 2 papers. In the end, we provided the participants with the 13 choices that are shown in Table 10.

Choices of machine learning computations

Similar to graph computation, machine learning computation is a very general term. Instead of providing a list of ad hoc computations as choices, we reviewed each machine learning computation that appeared in the 90 graph papers we had selected. Specifically, the list of machine learning computations we identified included the following: (i) high-level classes of machine learning techniques, such as clustering, classification, and regression; (ii) specific algorithms and techniques, such as stochastic gradient descent and alternating least squares that can be used as part of multiple higher-level techniques; and (iii) problems that are commonly solved using a machine learning technique, such as community detection, link prediction, and recommendations. We then selected the computations, i.e., high-level techniques, specific techniques, or problems, that appeared in at least 2 papers. As in the graph computations question, we used our best judgment to identify and categorize similar computations under the same name.

Storage in multiple formats

We asked the 33 participants who said that they store their data in multiple formats, which formats they use as a short-answer question. Out of the 33 participants, 25 responded. Their responses contained explicit data storage formats as well as the internal formats of different software. Table 19 shows the number of responses we received for the main formats. A relational database and a graph database format combination was the most popular combination. Other combinations varied significantly, examples of which include HBase and Hive, GraphML and CSV, and XML and triplestore.

Table 19 Data storage formats
Table 20 Graph sizes in user emails and issues
Table 21 Challenges found in user emails and issues

Other tables from the survey

Table 20 shows the sizes of graphs we found in user emails and issues. Table 21 shows the number of emails and issues we identified for each specific challenge we discussed in Sect. 3.4.2. Table 22 shows the total number of emails and issues we reviewed for each software product from January to September of 2017. The table also shows the number of commits in the code repositories of each software product during the same period.

Table 22 Number of reviewed emails and issues, and the code commits in the repositories of each software product

Other applications from interviews

Large-Scale Data Integration for Analysis of Turbines: Our interviewee from Siemens described an application that Siemens engineers use to analyze different properties of gas turbines Siemens produces using a knowledge graph. The application emphasized the advantage of using graphs to integrate different sources of corporate data, in this case mainly about where turbines are installed, measurements from the installed turbines’ sensors, and information about maintenance activity on the turbines. The knowledge graph is stored in an RDF engine and engineers ask queries, such as “What is the mean time failure of turbines with coating loss?” through a visual interface where they navigate the different types of nodes in the knowledge graph and express aggregations. The visually expressed queries get translated to SPARQL queries.

Contact Deduplication: One of our interviews was with two engineers from a company called FullContact that manages contact information about individuals by integrating public and manually curated information, which is sold to other businesses. An over 10 B-edge and 4 B-vertex graph models this contact information as follows: nodes represent different pieces of information, such as addresses, phone numbers, and edges between nodes indicate the likelihood that the information belongs to the same individual. The company uses GraphX to run a connected component-like algorithm to finding the contacts that are likely to belong to the same individual.

Other applications using knowledge graphs: One of our interviewees was a consultant to a chemical company specializing in agricultural chemicals. The company has an over 30 billion-edge knowledge graph on pesticides, seeds, chemicals that are stored in a commercial RDF system. This graph is used by many applications, such as, to track the evolution of seeds, to power internal wiki pages and tools used by analysts. Another interviewee was the founder of a start-up that works on tools that can be used by biologists to publish biological knowledge. Our interviewee described examples of knowledge graphs that model the cellular activity in the context of different species’ different tissues. Triples included facts about which genes transcribe which protein and which proteins’ presence decreases other proteins’ presence, etc. [71]. The example applications were similar broadly to our interviewee from the chemical company and power wikis and Web site which biologists use to analyze these interactions.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sahu, S., Mhedhbi, A., Salihoglu, S. et al. The ubiquity of large graphs and surprising challenges of graph processing: extended survey. The VLDB Journal 29, 595–618 (2020). https://doi.org/10.1007/s00778-019-00548-x

Download citation

Keywords

  • User survey
  • Graph processing
  • Graph databases
  • RDF systems