The ubiquity of large graphs and surprising challenges of graph processing: extended survey


Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research.

This is a preview of subscription content, log in to check access.

Fig. 1


  1. 1.

    The linear algebra software we considered, e.g., BLAS [17] and MATLAB [63], either did not have a public mailing list or their lists were inactive.

  2. 2.

    For each conference, we initially surveyed one year selected between 2014 and 2016 and later extended the survey to include the years 2017 and 2018. Note that OSDI and SOSP are held in alternating years.

  3. 3.

    Six entities that the participants mentioned did not fall under any of our 7 categories, which we list for completeness: call records, computers, cars, houses, time slots, and specialties.

  4. 4.

    Some participants selected multiple graph sizes and multiple entities, so we cannot perform a direct match of which graph size corresponds to which entity. The entities we list here are taken from the participants who selected a single graph size and entity, so we can directly match the size of the graph to the entity.

  5. 5.

    In the publications, link prediction referred to problems that predict a missing edge in a graph or data on an existing edge. Influence maximization referred to finding influential vertices in a graph, e.g., those that can bring more vertices to the graph. We did not provide detailed explanations about the problems to the participants.

  6. 6.

    This feature is called composition and is supported in SPARQL but not in the languages of some graph database systems.

  7. 7.

    This functionality is supported in RDF engines but not supported in some graph database systems.

  8. 8.

    Note that the use of term “knowledge graph” vs other terms such as “property graph” or simply “graph” is slightly arbitrary. We found our interviewees referring to any data stored in RDF stores as knowledge graphs. We also found that interviewees referred to graphs that represent abstract things, e.g., keyword topics or concepts, also as knowledge graphs even if they were not stored in an RDF system.


  1. 1.

    Cyber Threat Intelligence.

  2. 2.

    Personalized Education Service.

  3. 3.

    Aggarwal, C.C., Wang, H.: Graph Data Management and Mining: A Survey of Algorithms and Applications, pp. 13–68. Springer, Berlin (2010)

    Google Scholar 

  4. 4.


  5. 5.


  6. 6.


  7. 7.

    Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified Stress Testing of RDF Data Management Systems. In: ISWC (2014)

  8. 8.

    Amer-Yahia, S., Pei, J. (eds.): PVLDB, Volume 11 (2017–2018).

  9. 9.

    Angles, R., Arenas, M., Barceló, P., Boncz, P.A., Fletcher, G.H.L., Gutierrez, C., Lindaaker, T., Paradies, M., Plantikow, S., Sequeda, J.F., van Rest, O., Voigt, H.: G-CORE: a core for future graph query languages. In: Proceedings of International Conference on Management of Data (2018)

  10. 10.

    Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J.L., Vrgoc, D.: Foundations of modern query languages for graph databases. ACM Comput. Surv. 50(5), 68 (2017)

    Article  Google Scholar 

  11. 11.


  12. 12.

    Arpaci-Dusseau, A.C., Voelker, G. (eds.): Proceedings of the Symposium on Operating Systems Design and Implementation. USENIX Association (2018).

  13. 13.


  14. 14.

    AboutYou Data-Driven Personalization with ArangoDB.

  15. 15.

    Balcan, M., Weinberger, K.Q. (eds.): Proceedings of the International Conference on Machine Learning. (2016).

  16. 16.

    Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S.M.R., Barnawi, A., Sakr, S.: Large scale graph processing systems: survey and an experimental evaluation. Cluster Comput. 18(3), 1189–1213 (2015)

    Article  Google Scholar 

  17. 17.

    Basic Linear Algebra Subprograms.

  18. 18.

    Boncz, P., Salem, K. (eds.): PVLDB, Volume 10 (2016–2017).

  19. 19.

    Bridgeman, S., Tamassia, R.: A User Study in Similarity Measures for Graph Drawing, pp. 19–30. Springer, Berlin (2001)

    Google Scholar 

  20. 20.


  21. 21.

    Ching, A., Edunov, S., Kabiljo, M., Logothetis, D., Muthukrishnan, S.: One trillion edges: graph processing at facebook-scale. PVLDB 8(12), 1804–1815 (2015)

    Google Scholar 

  22. 22.

    Click Farm.

  23. 23.

    Conceptual Graphs.

  24. 24.

    Cui, W., Qu, H.: A Survey on Graph Visualization. PhD Qualifying Exam Report, Computer Science Department, Hong Kong University of Science and Technology (2007)

  25. 25.


  26. 26.


  27. 27.

    DTD and XSD XML Schemas.

  28. 28.

    Dy, J.G., Krause, A. (eds.): Proceedings of the International Conference on Machine Learning. (2018).

  29. 29.

    Elasticsearch X-Pack Graph.

  30. 30.

    Apache Flink.

  31. 31.

    Apache Flink User Survey 2016.

  32. 32.


  33. 33.


  34. 34.

    Apache Giraph.

  35. 35.

    Graph for Scala.

  36. 36.

    Graph 500 Benchmarks.

  37. 37.


  38. 38.


  39. 39.


  40. 40.

    Apache Spark GraphX.

  41. 41.

    Apache TinkerPop.

  42. 42.

    Group, W.: Common format for exchange of solved load flow data. IEEE Trans. Power App. Syst. 92(6), 1916–1925 (1973)

    Article  Google Scholar 

  43. 43.

    GQL Standard.

  44. 44.

    Haase, P., Broekstra, J., Eberhart, A., Volz, R.: A Comparison of RDF Query Languages, pp. 502–517. Springer, Berlin (2004)

    Google Scholar 

  45. 45.

    Herman, I., Melançon, G., Marshall, M.S.: Graph visualization and navigation in information visualization: a survey. IEEE Trans. Vis. Comput. Graph. 6(1), 24–43 (2000)

    Article  Google Scholar 

  46. 46.

    Holten, D., van Wijk, J.J.: A User Study on Visualizing Directed Edges in Graphs. In: Proceedings of International Conference on Human Factors in Computing Systems (2009)

  47. 47.

    Holzschuher, F., Peinl, R.: Performance of graph query languages: comparison of Cypher, Gremlin and Native Access in Neo4j. In: Proceedings of the Joint EDBT/ICDT Workshops (2013)

  48. 48.

    Jagadish, H.V., Zhou, A. (eds.): PVLDB, Vol. 7 (2013–2014).

  49. 49.


  50. 50.

    Jayaram, N., Khan, A., Li, C., Yan, X., Elmasri, R.: Querying knowledge graphs by example entity tuples. In: Proceedings of International Conference on Data Engineering (2016)

  51. 51.


  52. 52.

    Apache Jena.

  53. 53.

    Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., Giannopoulou, E.: Ontology visualization methods: a survey. ACM Comput. Surv. 39(4), 10 (2007)

    Article  Google Scholar 

  54. 54.

    Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2015).

  55. 55.

    Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2017).

  56. 56.

    Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM (2018).

  57. 57.

    Keeton, K., Roscoe, T. (eds.): Proceedings of the Symposium on Operating Systems Design and Implementation. USENIX Association (2016).

  58. 58.

    Knowledge Graph at Siemens.

  59. 59.

    LDBC Benchmarks.

  60. 60.

    Letunic, I., Bork, P.: Interactive tree of life: an online tool for phylogenetic tree display and annotation. Bioinformatics 23(1), 127–128 (2006)

    Article  Google Scholar 

  61. 61.

    Lu, Y., Cheng, J., Yan, D., Wu, H.: Large-scale Distributed graph computing systems: an experimental evaluation. PVLDB 8(3), 281–292 (2014)

    Google Scholar 

  62. 62.

    Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of International Conference on Management of Data (2010)

  63. 63.


  64. 64.

    Mattson, T., Bader, D.A., Berry, J.W., Buluç, A., Dongarra, J.J., Faloutsos, C., Feo, J., Gilbert, J.R., Gonzalez, J., Hendrickson, B., Kepner, J., Leiserson, C.E., Lumsdaine, A., Padua, D.A., Poole, S., Reinhardt, S.P., Stonebraker, M., Wallach, S., Yoo, A.: Standards for graph algorithm primitives. In: Proceedings of High Performance Extreme Computing Conference (2013)

  65. 65.


  66. 66.

    Detect Fraud in Real Time with Graph Databases.

  67. 67.

    The 2016 State of the Graph Report.

  68. 68.


  69. 69.


  70. 70.

    GraphDB by Ontotext.

  71. 71.


  72. 72.


  73. 73.


  74. 74.

    Pienta, R., Tamersoy, A., Endert, A., Navathe, S., Tong, H., Chau, D.H.: VISAGE: Interactive visual graph querying. In: Proceedings of International Working Conference on Advanced Visual Interfaces (2016)

  75. 75.

    Precup, D., Teh, Y.W. (eds.): Proceedings of the International Conference on Machine Learning. (2017).

  76. 76.

    Qiu, X., Cen, W., Qian, Z., Peng, Y., Zhang, Y., Lin, X., Zhou, J.: Real-time constrained cycle detection in large dynamic graphs. PVLDB 11(12), 1876–1888 (2018)

    Google Scholar 

  77. 77.

    Rath, M., Akehurst, D., Borowski, C., Mäder, P.: Are graph query languages applicable for requirements traceability analysis? In: Proceedings of International Conference on Requirements Engineering: Foundation for Software Quality (2017)

  78. 78.

    van Rest, O., Hong, S., Kim, J., Meng, X., Chafi, H.: PGQL: a property graph query language. In: Proceedings of Graph Data Management Experiences and Systems (2016)

  79. 79.

    Rodriguez, M.A.: The Gremlin Graph Traversal Machine and Language. CoRR arXiv:1508.03843 (2015)

  80. 80.

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2016).

  81. 81.

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2017).

  82. 82.

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press (2018).

  83. 83.

    Sharma, A., Jiang, J., Bommannavar, P., Larson, B., Lin, J.: GraphJet: real-time content recommendations at twitter. PVLDB 9(13), 1281–1292 (2016)

    Google Scholar 

  84. 84.

    SNAP: Standford Network Analysis Project.

  85. 85.

    Proceedings of the Symposium on Cloud Computing. ACM (2015).

  86. 86.

    Proceedings of the Symposium on Cloud Computing. ACM (2017).

  87. 87.

    Proceedings of the Symposium on Cloud Computing. ACM (2018).

  88. 88.

    Proceedings of the Symposium on Operating Systems Principles. ACM (2017).

  89. 89.

    Apache Spark—Preparing for the Next Wave of Reactive Big Data.

  90. 90.


  91. 91.


  92. 92.

    State Grid.

  93. 93.


  94. 94.

    The TPC-C Benchmark.

  95. 95.

    Vehlow, C., Beck, F., Weiskopf, D.: Visualizing group structures in graphs: a survey. Computer Graphics Forum 36(6), 201–225 (2017)

    Article  Google Scholar 

  96. 96.

    OpenLink Virtuoso.

  97. 97.

    Wang, C., Tao, J.: Graphs in scientific visualization: a survey. Computer Graphics Forum 36(1), 263–287 (2017)

    MathSciNet  Article  Google Scholar 

  98. 98.

    Zhao, Y., Yuan, C., Liu, G., Grinberg, I.: Graph-based preconditioning conjugate gradient algorithm for “N-1” contingency analysis. In: IEEE Power Energy Society General Meeting (2018)

Download references


We are grateful to Chen Zou for helping us in using online survey tools and drafting an early version of this survey. We are also grateful to Nafisa Anzum, Jeremy Chen, Pranjal Gupta, Chathura Kankanamge, and Shahid Khaliq for their valuable comments on the survey and help in categorizing the academic publications, user emails, and issues. We would like to thank our online survey participants and our interviewees: Brad Bebee, Scott Brave, Jordan Crombie, Luna Dong, William Hayes, Thomas Hubauer, Peter Lawrence, Stephen Ruppert, Chen Yuan, with a special thanks to Zhengping Qian for the many hours of follow-up discussions after our interview. Finally, we would like to thank the anonymous reviewers for their valuable comments. This research was partially supported by multiple Discovery Grants from the Natural Sciences and Engineering Council (NSERC) of Canada.

Author information



Corresponding author

Correspondence to Siddhartha Sahu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (xlsx 83 KB)


Choices of graph computations

One way to ask this question is to include a short-answer question that asks “What queries and graph computations do you perform on your graphs?”. However, the terms graph queries and computations are very general and we thought this version of the question could be under-specified. We also knew that participants respond less to short-answer questions, so instead we first asked a multiple-choice question followed by a short-answer question for computations that may not have appeared in the first question as a choice.

In a multiple-choice question, it is very challenging to provide a list of graph queries and computations from which participants can select, as there is no consensus on what constitutes a graph computation, let alone a reasonable taxonomy of graph computations. We decided to select a list of graph queries and computations that appeared in the publications of six conferences, as described in Sect. 2.2. We use the term graph computation here to refer to a query, a problem, or an algorithm.

For each of the 90 papers, we identified each graph computation, if (i) it was directly studied in the paper; or (ii) for papers describing a software, it was used to evaluate the software. We used our best judgment to categorize the computations that were variants of each other or appeared as different names under a single category. For example, we identified motif finding, subgraph finding, and subgraph matching as subgraph matching. When reviewing papers studying linear algebra operations, e.g., a matrix–vector multiplication, for solving a graph problem such as BFS traversal, we identified the graph problem and not the linear algebra operation as a computation.

Finally, for each identified and categorized computation, we counted the number of papers that study it and selected the ones that appeared in at least 2 papers. In the end, we provided the participants with the 13 choices that are shown in Table 10.

Choices of machine learning computations

Similar to graph computation, machine learning computation is a very general term. Instead of providing a list of ad hoc computations as choices, we reviewed each machine learning computation that appeared in the 90 graph papers we had selected. Specifically, the list of machine learning computations we identified included the following: (i) high-level classes of machine learning techniques, such as clustering, classification, and regression; (ii) specific algorithms and techniques, such as stochastic gradient descent and alternating least squares that can be used as part of multiple higher-level techniques; and (iii) problems that are commonly solved using a machine learning technique, such as community detection, link prediction, and recommendations. We then selected the computations, i.e., high-level techniques, specific techniques, or problems, that appeared in at least 2 papers. As in the graph computations question, we used our best judgment to identify and categorize similar computations under the same name.

Storage in multiple formats

We asked the 33 participants who said that they store their data in multiple formats, which formats they use as a short-answer question. Out of the 33 participants, 25 responded. Their responses contained explicit data storage formats as well as the internal formats of different software. Table 19 shows the number of responses we received for the main formats. A relational database and a graph database format combination was the most popular combination. Other combinations varied significantly, examples of which include HBase and Hive, GraphML and CSV, and XML and triplestore.

Table 19 Data storage formats
Table 20 Graph sizes in user emails and issues
Table 21 Challenges found in user emails and issues

Other tables from the survey

Table 20 shows the sizes of graphs we found in user emails and issues. Table 21 shows the number of emails and issues we identified for each specific challenge we discussed in Sect. 3.4.2. Table 22 shows the total number of emails and issues we reviewed for each software product from January to September of 2017. The table also shows the number of commits in the code repositories of each software product during the same period.

Table 22 Number of reviewed emails and issues, and the code commits in the repositories of each software product

Other applications from interviews

Large-Scale Data Integration for Analysis of Turbines: Our interviewee from Siemens described an application that Siemens engineers use to analyze different properties of gas turbines Siemens produces using a knowledge graph. The application emphasized the advantage of using graphs to integrate different sources of corporate data, in this case mainly about where turbines are installed, measurements from the installed turbines’ sensors, and information about maintenance activity on the turbines. The knowledge graph is stored in an RDF engine and engineers ask queries, such as “What is the mean time failure of turbines with coating loss?” through a visual interface where they navigate the different types of nodes in the knowledge graph and express aggregations. The visually expressed queries get translated to SPARQL queries.

Contact Deduplication: One of our interviews was with two engineers from a company called FullContact that manages contact information about individuals by integrating public and manually curated information, which is sold to other businesses. An over 10 B-edge and 4 B-vertex graph models this contact information as follows: nodes represent different pieces of information, such as addresses, phone numbers, and edges between nodes indicate the likelihood that the information belongs to the same individual. The company uses GraphX to run a connected component-like algorithm to finding the contacts that are likely to belong to the same individual.

Other applications using knowledge graphs: One of our interviewees was a consultant to a chemical company specializing in agricultural chemicals. The company has an over 30 billion-edge knowledge graph on pesticides, seeds, chemicals that are stored in a commercial RDF system. This graph is used by many applications, such as, to track the evolution of seeds, to power internal wiki pages and tools used by analysts. Another interviewee was the founder of a start-up that works on tools that can be used by biologists to publish biological knowledge. Our interviewee described examples of knowledge graphs that model the cellular activity in the context of different species’ different tissues. Triples included facts about which genes transcribe which protein and which proteins’ presence decreases other proteins’ presence, etc. [71]. The example applications were similar broadly to our interviewee from the chemical company and power wikis and Web site which biologists use to analyze these interactions.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sahu, S., Mhedhbi, A., Salihoglu, S. et al. The ubiquity of large graphs and surprising challenges of graph processing: extended survey. The VLDB Journal 29, 595–618 (2020).

Download citation


  • User survey
  • Graph processing
  • Graph databases
  • RDF systems