Skip to main content
Log in

\(\varvec{\textsc {Orpheus}}\)DB: bolt-on versioning for relational databases (extended version)

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free.” We develop and evaluate multiple data models for representing versioned data, as well as a lightweight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average \(10^3\times \) faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to \(20\times \) relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by \(10\times \) on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. Orpheus is a musician and poet from ancient Greek mythology with the ability to raise the dead with his music, much like OrpheusDB has the ability to retrieve old (“dead”) dataset versions on demand.

  2. We also tried alternative join methods—the findings were unchanged; we will discuss this further in Sect. 4.1. We also tried using an additional secondary index for vlist for split-by-vlist which reduced the time for checkout but increased the time for commit even further.

  3. Table 3 shows the commit and checkout time for split-by-rid-vid without building an index on vid for the versioning table. When built an index on vid for the versioning table, the checkout time for split-by-rid-vid is reduced to 69.382 s, while the commit time is increased to 21.235 s.

  4. A lyre was the musical instrument of choice for Orpheus.

  5. In each iteration r, topological sorting algorithm finds vertices \(V'\) with in-degree equals 0, removes \(V'\), and updates in-degree of other vertices. \(l(v_i) = r, \forall v_i\in V'\).

  6. If the version graph is a DAG instead, we first transform it into a version tree as discussed in Sect. 5.1.

  7. If each attribute is of different size, we can simply replace “the number of attributes” with “the number of bytes” in the whole algorithm.

  8. PostgreSQL ’s version 9.5 added the feature of dynamically adjusting the number of buckets for hash-join.

  9. Shingles are calculated as signatures of each partition based on a min-hashing based technique.

References

  1. For big-data scientists, ‘janitor work’ is key hurdle to insights. http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

  2. Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A.J., Madden, S., Parameswaran, A.G.: Datahub: collaborative data science & dataset version management at scale. In: CIDR (2015)

  3. Consortium, G.O., et al.: Gene ontology consortium: going forward. Nucleic Acids Res. 43(D1), D1049–D1056 (2015)

    Article  Google Scholar 

  4. Szklarczyk, D., et al.: The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39(1), D561–D568. https://academic.oup.com/nar/article/39/suppl_1/D561/2509054

    Article  Google Scholar 

  5. Maddox, M., Goehring, D., Elmore, A.J., Madden, S., Parameswaran, A., Deshpande, A.: Decibel: the relational dataset branching system. Proc. VLDB Endow. 9(9), 624–635 (2016)

    Article  Google Scholar 

  6. Bhattacherjee, S., Chavan, A., Huang, S., Deshpande, A., Parameswaran, A.: Principles of dataset versioning: exploring the recreation/storage tradeoff. Proc. VLDB Endow. 8(12), 1346–1357 (2015)

    Article  Google Scholar 

  7. Tansel, A.U., Clifford, J., Gadia, S., Jajodia, S., Segev, A., Snodgrass, R.: Temporal databases: theory, design, and implementation. Benjamin-Cummings Publishing Co., Inc (1993)

  8. Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Trans. Knowl. Data Eng. 11(1), 36–44 (1999)

    Article  Google Scholar 

  9. Ozsoyoglu, G., Snodgrass, R.T.: Temporal and real-time databases: a survey. IEEE Trans. Knowl. Data Eng. 7(4), 513–532 (1995)

    Article  Google Scholar 

  10. Kulkarni, K., Michels, J.-E.: Temporal features in sql: 2011. ACM Sigmod Record 41(3), 34–43 (2012)

    Article  Google Scholar 

  11. Huang, S., Xu, L., Liu, J., Elmore, A.J., Parameswaran, A.: O rpheus db: bolt-on versioning for relational databases. Proc. VLDB Endow. 10(10), 1130–1141 (2017)

    Article  Google Scholar 

  12. Xu, L., Huang, S., Hui, S., Elmore, A., Parameswaran, A.: (2017) OrpheusDB : A lightweight approach to relational dataset versioning. Technical Report

  13. Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L.-L., Ho, C.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Record 30(1), 78–83 (2001)

    Article  Google Scholar 

  14. Fisher, K., Walker, D., Zhu, K.Q., White, P.: From dirt to shovels: fully automatic tool generation from ad hoc data. In: ACM SIGPLAN Notices, vol. 43, pp. 421–434. ACM (2008)

  15. PostgreSQL 9.5 intarray. https://www.postgresql.org/docs/current/static/intarray.html

  16. DB2 9.7 array. https://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.sql.ref.doc/doc/r0050497.html

  17. Add array data type in MySql 7.1. https://dev.mysql.com/worklog/task/?id=2081

  18. Buneman, P., Khanna, S., Tajima, K., Tan, W.-C.: Archiving scientific data. ACM Trans. Database Syst. (TODS) 29(1), 2–42 (2004)

    Article  Google Scholar 

  19. Jain, S., Moritz, D., Halperin, D., Howe, B., Lazowska, E.: Sqlshare: Results from a multi-year sql-as-a-service experiment. In: Proceedings of the 2016 International Conference on Management of Data, pp. 281–293. ACM (2016)

  20. De Castro, C., Grandi, F., Scalas, M.R.: (1995) On schema versioning in temporal databases. In: Recent Advances in Temporal Databases, pp. 272–291. Springer

  21. De Castro, C., Grandi, F., Scalas, M.R.: Schema versioning for multitemporal relational databases. Inf. Syst. 22(5), 249–290 (1997)

    Article  Google Scholar 

  22. Moon, H.J., Curino, C.A., Deutsch, A., Hou, C.-Y., Zaniolo, C.: Managing and querying transaction-time databases under schema evolution. Proc. VLDB Endow. 1(1), 882–895 (2008)

    Article  Google Scholar 

  23. Moon, H.J., Curino, C.A., Zaniolo, C.: Scalable architecture and query optimization fortransaction-time dbs with evolving schemas. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 207–218. ACM (2010)

  24. Quamar, A., Deshpande, A., Lin, J.: (2014) Nscale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J. 1–26

  25. Wang, S., Dinh, T.T.A., Lin, Q., Xie, Z., Zhang, M., Cai, Q., Chen, G., Fu, W., Ooi, B.C., Ruan, P.: (2018) Forkbase: an efficient storage engine for blockchain and forkable applications. arXiv preprint arXiv:1802.04949

  26. Chavan, A., Huang, S., Deshpande, A., Elmore, A., Madden, S., Parameswaran, A.: Towards a unified query language for provenance and versioning. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15) (2015)

  27. Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A generic provenance middleware for queries, updates, and transactions. In: 6th \(\{\)USENIX\(\}\) Workshop on the Theory and Practice of Provenance (TaPP 2014) (2014)

  28. Curino, C.A., Moon, H.J., Zaniolo, C.: Graceful database schema evolution: the prism workbench. Proc. VLDB Endow. 1(1), 761–772 (2008)

    Article  Google Scholar 

  29. Ahmad, Y., Kennedy, O., Koch, C., Nikolic, M.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow. 5(10), 968–979 (2012)

    Article  Google Scholar 

  30. Ahn, I., Snodgrass, R.: Performance evaluation of a temporal database management system. In: ACM SIGMOD Record, vol. 15, pp. 96–107. ACM (1986)

  31. Snodgrass, R., Ahn, I.: A taxonomy of time databases. ACM Sigmod Record 14(4), 236–246 (1985)

    Article  Google Scholar 

  32. Snodgrass, R.T., Ahn, I., Ariav, G., Batory, D.S., Clifford, J., Dyreson, C.E., Elmasri, R., Grandi, F., Jensen, C.S., Käfer, W., et al.: Tsql2 language specification. Sigmod Record 23(1), 65–86 (1994)

    Article  Google Scholar 

  33. Torp, K., Jensen, C.S., Snodgrass, R.T.: Stratum approaches to temporal DBMS implementation. In: Database Engineering and Applications Symposium, 1998. Proceedings. IDEAS’98. International, pp. 4–13. IEEE (1998)

  34. Chen, C.X., Kong, J., Zaniolo, C.: Design and implementation of a temporal extension of sql. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 689–691. IEEE (2003)

  35. Saracco, C.M., Nicola, M., Gandhi, L.: A matter of time: temporal data management in db2 for z. IBM Corporation, New York (2010)

    Google Scholar 

  36. Al-Kateb, M., Ghazal, A., Crolotte, A., Bhashyam, R., Chimanchode, J.,. Pakala, S.P.: Temporal query processing in teradata. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 573–578. ACM (2013)

  37. Kaufmann, M., Manjili, A.A., Vagenas, P., Fischer, P.M., Kossmann, D., Färber, F., May, N.: Timeline index: a unified data structure for processing queries on temporal data in sap hana. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1173–1184. ACM (2013)

  38. Färber, F., May, N., Lehner, W., Große, P., Müller, I., Rauhe, H., Dees, J.: The sap hana database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)

    Google Scholar 

  39. Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking bitemporal database systems: ready for the future or stuck in the past? In: EDBT, pp. 738–749 (2014)

  40. Salzberg, B., Tsotras, V.J.: Comparison of access methods for time-evolving data. ACM Comput. Surv. (CSUR) 31(2), 158–221 (1999)

    Article  Google Scholar 

  41. Lee, J.W., Loaiza, J., Stewart, M.J., Hu, W.-M., Bridge Jr, W.H.: Flashback database, Feb. 20 2007. US Patent 7,181,476

  42. Gao, D., Jensen, S., Snodgrass, T., Soo, D.: Join operations in temporal databases. VLDB J. Int. J. Very Large Data Bases 14(1), 2–29 (2005)

    Article  Google Scholar 

  43. Landau, G.M., Schmidt, J.P., Tsotras, V.J.: Historical queries along multiple lines of time evolution. VLDB J. Int. J. Very Large Data Bases 4(4), 703–726 (1995)

    Article  Google Scholar 

  44. Salzberg, B.J., Lomet, D.B.: Branched and Temporal Index Structures. Northeastern University, College of Computer Science (1995)

  45. Lanka, S., Mays, E.: Fully Persistent B+-trees, vol. 20. ACM, New York (1991)

    Google Scholar 

  46. Jiang, L., Salzberg, B., Lomet, D.B., García, M.B.: (2000) The bt-tree: a branched and temporal access method. In: VLDB, pp. 451–460

  47. Liquibase. http://www.liquibase.org/

  48. dbv. https://dbv.vizuina.com/

  49. Dat. http://datproject.org/

  50. Mode. https://about.modeanalytics.com/

  51. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)

    Article  MathSciNet  Google Scholar 

  52. Liu, D.-R., Shekhar, S.: Partitioning similarity graphs: a framework for declustering problems. Inf. Syst. 21(6), 475–496 (1996)

    Article  Google Scholar 

  53. Feige, U., Peleg, D., Kortsarz, G.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)

    Article  MathSciNet  Google Scholar 

  54. Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. VLSI Des. 11(3), 285–300 (2000)

    Article  Google Scholar 

  55. Kumar, K.A., Quamar, A., Deshpande, A., Khuller, S.: Sword: workload-aware data placement and replica selection for cloud data management systems. VLDB J. 23(6), 845–870 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their valuable feedback. We acknowledge support from ISTC for Big Data, Grant IIS-1513407, IIS-1633755, and IIS-1652750, awarded by the National Science Foundation, Grant 1U54GM114838 awarded by NIGMS and 3U54EB020406-02S1 awarded by NIBIB through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov), and funds from Adobe, Google, and the Siebel Energy Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies and organizations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Silu Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, S., Xu, L., Liu, J. et al. \(\varvec{\textsc {Orpheus}}\)DB: bolt-on versioning for relational databases (extended version). The VLDB Journal 29, 509–538 (2020). https://doi.org/10.1007/s00778-019-00594-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-019-00594-5

Keywords

Navigation