HiNode: an asymptotically space-optimal storage model for historical queries on graphs

Abstract

Most modern networks are perpetually evolving and can be modeled by graph data structures. By collecting and indexing the state of a graph at various time instances we are able to perform queries on its entire history and thus gain insight into its fundamental features and attributes. This calls for advanced solutions for graph history storing and indexing that are capable of supporting application queries efficiently while coping with the aggravated space requirements. To this end, we advocate a purely vertex-centric storage model that is asymptotically space-optimal and more space efficient than any other proposal to date. In addition to space efficiency, the model’s purely vertex-centric approach shows great promise with respect to the efficiency and functionality of update and query operations. Furthermore, we make a qualitative comparison with other general methods for graph history storage identifying the pros and cons of our approach. Finally, we implement and incorporate our technique in the \(G^*\) parallel graph processing system, we conduct thorough experimental evaluation and we show that we can yield time and space improvements up to an order of magnitude when compared to \(G^*\).

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    Given a query point \(p \in \mathcal {R}\) and a set of N intervals on the real line, a stabbing query returns all intervals that overlap p.

  2. 2.

    For \(G^*\), TGI and our solution (and to a lesser extent for the other methods), one could indeed describe the complexity w.r.t. a variety of parameters and provide a more detailed description. However, doing so would certainly not permit the direct comparison between the methods and would thus invalidate the very reason for which this table is provided.

  3. 3.

    The source code is available at https://github.com/hinodeauthors/hinode.

  4. 4.

    Dataset—Undirected Barabási-Albert graph: Starting vertices = 1M, edges per newly inserted vertex = 5, vertex insertions per snapshot = 2K, snapshots = 100.

  5. 5.

    Dataset - Undirected Barabási-Albert graph: starting vertices = 1M, edges per newly inserted vertex = 5, vertex insertions per snapshot = 20K, snapshots = 100.

  6. 6.

    In the case of querying the 40% of the sequence for the two-hop neighborhood of the vertex with the largest degree, \(G^*\) was unable to finish since it run out of memory.

  7. 7.

    We would like to thank an anonymous reviewer for pointing out this issue.

References

  1. 1.

    Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)

    Article  MathSciNet  Google Scholar 

  2. 2.

    Ahmed, N.K., Neville, J., Kompella, R.: Network sampling: from static to streaming graphs. ACM Trans. Knowl. Discov. Data 8(2), 7 (2014)

    Google Scholar 

  3. 3.

    Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  4. 4.

    Cassovary. “big graph” processing library. https://github.com/twitter/cassovary

  5. 5.

    Brisaboa, N.R., Caro, D., Fariña, A., Rodríguez, M.A.: A compressed suffix-array strategy for temporal-graph indexing. In: SPIRE, pp. 77–88 (2014)

  6. 6.

    Brodal, G.S., Katajainen, J.: Worst-case external-memory priority queues. In: SWAT, pp. 107–118 (1998)

  7. 7.

    Brodal, G.S., Tsakalidis, K., Sioutas, S., Tsichlas, K.: Fully persistent B-trees. In: SODA, pp. 602–614 (2012)

  8. 8.

    Caro, D., Rodríguez, M.A., Brisaboa, N.R.: Data structures for temporal graphs based on compact sequence representations. Inf. Syst. 51, 1–26 (2015)

    Article  Google Scholar 

  9. 9.

    Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959)

    Article  MATH  MathSciNet  Google Scholar 

  10. 10.

    Erdős, P., Rényi, A.: On random graphs. I. Publ. Math. Debr. 6, 290–297 (1959)

    MATH  Google Scholar 

  11. 11.

    Gao, J., Zhou, C., Yu, J.X.: Toward continuous pattern detection over evolving large graph with snapshot isolation. VLDB J. 25(2), 269–290 (2016)

    Article  Google Scholar 

  12. 12.

    Gehrke, J., Ginsparg, P., Kleinberg, J.M.: Overview of the 2003 KDD cup. SIGKDD Explor. 5(2), 149–151 (2003)

    Article  Google Scholar 

  13. 13.

    Giraph, A. http://giraph.apache.org/

  14. 14.

    Hu, P., Lau, W.C.: A survey and taxonomy of graph sampling. CoRR. arXiv:1308.5865 (2013)

  15. 15.

    Huo, W., Tsotras, V.J.: Efficient temporal shortest path queries on evolving social graphs. In: SSDBM, pp. 38:1–38:4 (2014)

  16. 16.

    Kang, U., Tong, H., Sun, J., Lin, C., Faloutsos, C.: GBASE: a scalable and general graph management system. In: SIGKDD, pp. 1091–1099 (2011)

  17. 17.

    Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowl. Inf. Syst. 27(2), 303–325 (2011)

    Article  Google Scholar 

  18. 18.

    Khurana, U., Deshpande, A.: Efficient snapshot retrieval over historical graph data. In: ICDE, pp .997–1008 (2013)

  19. 19.

    Khurana, U., Deshpande, A.: Storing and analyzing historical graph data at scale. In: EDBT, pp. 77–88 (2016)

  20. 20.

    Koloniari, G., Souravlias, D., Pitoura, E.: On graph deltas for historical queries. In: WOSS (2012)

  21. 21.

    Kosmatopoulos, A., Giannakopoulou, K., Papadopoulos, A.N., Tsichlas, K.: An overview of methods for handling evolving graph sequences. In: ALGOCLOUD, pp. 181–192 (2015)

  22. 22.

    Labouseur, A.G., Birnbaum, J., Olsen, P.W., Spillane, S.R., Vijayan, J., Hwang, J., Han, W.: The G* graph database: efficiently managing large distributed dynamic graphs. Distrib. Parallel Databases 33(4), 479–514 (2015)

    Article  Google Scholar 

  23. 23.

    Leskovec, J., Krevl, A.: SNAP datasets: Stanford large network dataset collection. http://snap.stanford.edu/data (2014)

  24. 24.

    Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)

  25. 25.

    Mondal, J., Deshpande, A.: Managing large dynamic graphs efficiently. In: SIGMOD, pp. 145–156 (2012)

  26. 26.

    Pagh, R.: Basic external memory data structures. In: Algorithms for Memory Hierarchies, pp. 14–35 (2002)

  27. 27.

    Ren, C., Lo, E., Kao, B., Zhu, X., Cheng, R.: On querying historical evolving graph sequences. PVLDB 4(11), 726–737 (2011)

    Google Scholar 

  28. 28.

    Ribeiro, B.F., Towsley, D.: On the estimation accuracy of degree distributions from graph sampling. In: CDC, pp. 5240–5247 (2012)

  29. 29.

    Salzberg, B., Tsotras, V.J.: Comparison of access methods for time-evolving data. ACM Comput. Surv. 31(2), 158–221 (1999)

    Article  Google Scholar 

  30. 30.

    Semertzidis, K., Pitoura, E., Lillis, K.: Timereach: historical reachability queries on evolving graphs. In: EDBT, pp. 121–132 (2015)

  31. 31.

    Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: SIGMOD, pp. 505–516 (2013)

  32. 32.

    Spillane, S.R., Birnbaum, J., Bokser, D., Kemp, D., Labouseur, A.G., Olsen, P.W., Vijayan, J., Hwang, J., Yoon, J.: A demonstration of the G* graph database system. In: ICDE, pp. 1356–1359 (2013)

  33. 33.

    Yang, Y., Yu, J.X., Gao, H., Pei, J., Li, J.: Mining most frequently changing component in evolving graphs. World Wide Web 17(3), 351–376 (2014)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Andreas Kosmatopoulos.

Appendix: The WriteAttribute cases

Appendix: The WriteAttribute cases

We analyze the two possible cases in WriteAttribute. In the first case, the field f does not have any values associated with it in the time interval \([t_s,t_e]\). In that case we proceed as follows: We insert a quadruple \((f,\{\ell _1, \ell _2, \ldots \},t_s,t_e)\) in \({\mathcal {I}}_{v}\). In addition, a record \((\{\ell _1, \ell _2, \ldots \},t_s,t_e)\) is stored in f’s respective B-tree \(A_v^f\).

In the second case, the field f has values associated with it in the time interval \([t_s,t_e]\), i.e. there exist (up to) two intervals \([t'_s,t'_e]\) and \([t''_s,t''_e]\) in the data structure, such that either (a) \(t'_s<t_s<t'_e<t_e\), (b) \(t_s<t'_s<t_e<t'_e\), (c) \(t'_s<t_s<t_e<t'_e\) or (d) \(t'_s<t_s<(t'_e=t''_s)<t_e<t''_e\) is true (Fig. 12). In that case, we search \({\mathcal {I}}_{v}\) for \([t'_s,t'_e]\) corresponding to the field f (and \([t''_s,t''_e]\) if it exists) by simulating an insertion of this interval in \({\mathcal {I}}_{v}\). Let \(v_{t'}\) be the node of \({\mathcal {I}}_{v}\) that interval \([t'_s,t'_e]\) is to be stored. After locating the at most three lists in which it is to be stored we search these lists based on the endpoints of \([t'_s,t'_e]\). If there are more than one such intervals then we use the identifier of \([t'_s,t'_e]\) to search among them and locate this interval. The same procedure is applied for \([t''_s,t''_e]\).

Fig. 12
figure12

Cases of existing intervals for the field f

Afterwards, we perform a series of interval insertions and deletions in \({\mathcal {I}}_{v}\) and the corresponding \(A_v^f\) B-tree depending on the subcases presented below (the resulting intervals end up with the appropriate set of values based on their original intervals):

Subcase (a):

Deletion of \([t'_s,t'_e]\) followed by the insertion of \([t'_s,t_s)\), \([t_s,t'_e)\) and \([t'_e,t_e]\)

Subcase (b):

Deletion of \([t'_s,t'_e]\) followed by the insertion of \([t_s,t'_s)\), \([t'_s,t_e)\) and \([t_e,t_e']\)

Subcase (c):

Deletion of \([t'_s,t'_e]\) followed by the insertion of \([t'_s,t_s)\), \([t_s,t_e)\) and \([t_e,t'_e]\)

Subcase (d):

Deletion of \([t'_s,t'_e]\) and \([t''_s,t''_e]\) followed by the insertion of \([t'_s,t_s)\), \([t_s,t'_e)\), \([t''_s,t_e)\) and \([t_e,t''_e]\)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kosmatopoulos, A., Tsichlas, K., Gounaris, A. et al. HiNode: an asymptotically space-optimal storage model for historical queries on graphs. Distrib Parallel Databases 35, 249–285 (2017). https://doi.org/10.1007/s10619-017-7207-z

Download citation

Keywords

  • Historical queries
  • Evolving graphs
  • Indexing
  • Space efficiency