Abstract
Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free.” We develop and evaluate multiple data models for representing versioned data, as well as a lightweight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average \(10^3\times \) faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to \(20\times \) relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by \(10\times \) on average.
Similar content being viewed by others
Notes
Orpheus is a musician and poet from ancient Greek mythology with the ability to raise the dead with his music, much like OrpheusDB has the ability to retrieve old (“dead”) dataset versions on demand.
We also tried alternative join methods—the findings were unchanged; we will discuss this further in Sect. 4.1. We also tried using an additional secondary index for vlist for split-by-vlist which reduced the time for checkout but increased the time for commit even further.
Table 3 shows the commit and checkout time for split-by-rid-vid without building an index on vid for the versioning table. When built an index on vid for the versioning table, the checkout time for split-by-rid-vid is reduced to 69.382 s, while the commit time is increased to 21.235 s.
A lyre was the musical instrument of choice for Orpheus.
In each iteration r, topological sorting algorithm finds vertices \(V'\) with in-degree equals 0, removes \(V'\), and updates in-degree of other vertices. \(l(v_i) = r, \forall v_i\in V'\).
If the version graph is a DAG instead, we first transform it into a version tree as discussed in Sect. 5.1.
If each attribute is of different size, we can simply replace “the number of attributes” with “the number of bytes” in the whole algorithm.
PostgreSQL ’s version 9.5 added the feature of dynamically adjusting the number of buckets for hash-join.
Shingles are calculated as signatures of each partition based on a min-hashing based technique.
References
For big-data scientists, ‘janitor work’ is key hurdle to insights. http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0
Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A.J., Madden, S., Parameswaran, A.G.: Datahub: collaborative data science & dataset version management at scale. In: CIDR (2015)
Consortium, G.O., et al.: Gene ontology consortium: going forward. Nucleic Acids Res. 43(D1), D1049–D1056 (2015)
Szklarczyk, D., et al.: The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39(1), D561–D568. https://academic.oup.com/nar/article/39/suppl_1/D561/2509054
Maddox, M., Goehring, D., Elmore, A.J., Madden, S., Parameswaran, A., Deshpande, A.: Decibel: the relational dataset branching system. Proc. VLDB Endow. 9(9), 624–635 (2016)
Bhattacherjee, S., Chavan, A., Huang, S., Deshpande, A., Parameswaran, A.: Principles of dataset versioning: exploring the recreation/storage tradeoff. Proc. VLDB Endow. 8(12), 1346–1357 (2015)
Tansel, A.U., Clifford, J., Gadia, S., Jajodia, S., Segev, A., Snodgrass, R.: Temporal databases: theory, design, and implementation. Benjamin-Cummings Publishing Co., Inc (1993)
Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Trans. Knowl. Data Eng. 11(1), 36–44 (1999)
Ozsoyoglu, G., Snodgrass, R.T.: Temporal and real-time databases: a survey. IEEE Trans. Knowl. Data Eng. 7(4), 513–532 (1995)
Kulkarni, K., Michels, J.-E.: Temporal features in sql: 2011. ACM Sigmod Record 41(3), 34–43 (2012)
Huang, S., Xu, L., Liu, J., Elmore, A.J., Parameswaran, A.: O rpheus db: bolt-on versioning for relational databases. Proc. VLDB Endow. 10(10), 1130–1141 (2017)
Xu, L., Huang, S., Hui, S., Elmore, A., Parameswaran, A.: (2017) OrpheusDB : A lightweight approach to relational dataset versioning. Technical Report
Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L.-L., Ho, C.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Record 30(1), 78–83 (2001)
Fisher, K., Walker, D., Zhu, K.Q., White, P.: From dirt to shovels: fully automatic tool generation from ad hoc data. In: ACM SIGPLAN Notices, vol. 43, pp. 421–434. ACM (2008)
PostgreSQL 9.5 intarray. https://www.postgresql.org/docs/current/static/intarray.html
DB2 9.7 array. https://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.sql.ref.doc/doc/r0050497.html
Add array data type in MySql 7.1. https://dev.mysql.com/worklog/task/?id=2081
Buneman, P., Khanna, S., Tajima, K., Tan, W.-C.: Archiving scientific data. ACM Trans. Database Syst. (TODS) 29(1), 2–42 (2004)
Jain, S., Moritz, D., Halperin, D., Howe, B., Lazowska, E.: Sqlshare: Results from a multi-year sql-as-a-service experiment. In: Proceedings of the 2016 International Conference on Management of Data, pp. 281–293. ACM (2016)
De Castro, C., Grandi, F., Scalas, M.R.: (1995) On schema versioning in temporal databases. In: Recent Advances in Temporal Databases, pp. 272–291. Springer
De Castro, C., Grandi, F., Scalas, M.R.: Schema versioning for multitemporal relational databases. Inf. Syst. 22(5), 249–290 (1997)
Moon, H.J., Curino, C.A., Deutsch, A., Hou, C.-Y., Zaniolo, C.: Managing and querying transaction-time databases under schema evolution. Proc. VLDB Endow. 1(1), 882–895 (2008)
Moon, H.J., Curino, C.A., Zaniolo, C.: Scalable architecture and query optimization fortransaction-time dbs with evolving schemas. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 207–218. ACM (2010)
Quamar, A., Deshpande, A., Lin, J.: (2014) Nscale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J. 1–26
Wang, S., Dinh, T.T.A., Lin, Q., Xie, Z., Zhang, M., Cai, Q., Chen, G., Fu, W., Ooi, B.C., Ruan, P.: (2018) Forkbase: an efficient storage engine for blockchain and forkable applications. arXiv preprint arXiv:1802.04949
Chavan, A., Huang, S., Deshpande, A., Elmore, A., Madden, S., Parameswaran, A.: Towards a unified query language for provenance and versioning. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15) (2015)
Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A generic provenance middleware for queries, updates, and transactions. In: 6th \(\{\)USENIX\(\}\) Workshop on the Theory and Practice of Provenance (TaPP 2014) (2014)
Curino, C.A., Moon, H.J., Zaniolo, C.: Graceful database schema evolution: the prism workbench. Proc. VLDB Endow. 1(1), 761–772 (2008)
Ahmad, Y., Kennedy, O., Koch, C., Nikolic, M.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow. 5(10), 968–979 (2012)
Ahn, I., Snodgrass, R.: Performance evaluation of a temporal database management system. In: ACM SIGMOD Record, vol. 15, pp. 96–107. ACM (1986)
Snodgrass, R., Ahn, I.: A taxonomy of time databases. ACM Sigmod Record 14(4), 236–246 (1985)
Snodgrass, R.T., Ahn, I., Ariav, G., Batory, D.S., Clifford, J., Dyreson, C.E., Elmasri, R., Grandi, F., Jensen, C.S., Käfer, W., et al.: Tsql2 language specification. Sigmod Record 23(1), 65–86 (1994)
Torp, K., Jensen, C.S., Snodgrass, R.T.: Stratum approaches to temporal DBMS implementation. In: Database Engineering and Applications Symposium, 1998. Proceedings. IDEAS’98. International, pp. 4–13. IEEE (1998)
Chen, C.X., Kong, J., Zaniolo, C.: Design and implementation of a temporal extension of sql. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 689–691. IEEE (2003)
Saracco, C.M., Nicola, M., Gandhi, L.: A matter of time: temporal data management in db2 for z. IBM Corporation, New York (2010)
Al-Kateb, M., Ghazal, A., Crolotte, A., Bhashyam, R., Chimanchode, J.,. Pakala, S.P.: Temporal query processing in teradata. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 573–578. ACM (2013)
Kaufmann, M., Manjili, A.A., Vagenas, P., Fischer, P.M., Kossmann, D., Färber, F., May, N.: Timeline index: a unified data structure for processing queries on temporal data in sap hana. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1173–1184. ACM (2013)
Färber, F., May, N., Lehner, W., Große, P., Müller, I., Rauhe, H., Dees, J.: The sap hana database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)
Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking bitemporal database systems: ready for the future or stuck in the past? In: EDBT, pp. 738–749 (2014)
Salzberg, B., Tsotras, V.J.: Comparison of access methods for time-evolving data. ACM Comput. Surv. (CSUR) 31(2), 158–221 (1999)
Lee, J.W., Loaiza, J., Stewart, M.J., Hu, W.-M., Bridge Jr, W.H.: Flashback database, Feb. 20 2007. US Patent 7,181,476
Gao, D., Jensen, S., Snodgrass, T., Soo, D.: Join operations in temporal databases. VLDB J. Int. J. Very Large Data Bases 14(1), 2–29 (2005)
Landau, G.M., Schmidt, J.P., Tsotras, V.J.: Historical queries along multiple lines of time evolution. VLDB J. Int. J. Very Large Data Bases 4(4), 703–726 (1995)
Salzberg, B.J., Lomet, D.B.: Branched and Temporal Index Structures. Northeastern University, College of Computer Science (1995)
Lanka, S., Mays, E.: Fully Persistent B+-trees, vol. 20. ACM, New York (1991)
Jiang, L., Salzberg, B., Lomet, D.B., García, M.B.: (2000) The bt-tree: a branched and temporal access method. In: VLDB, pp. 451–460
Liquibase. http://www.liquibase.org/
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)
Liu, D.-R., Shekhar, S.: Partitioning similarity graphs: a framework for declustering problems. Inf. Syst. 21(6), 475–496 (1996)
Feige, U., Peleg, D., Kortsarz, G.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. VLSI Des. 11(3), 285–300 (2000)
Kumar, K.A., Quamar, A., Deshpande, A., Khuller, S.: Sword: workload-aware data placement and replica selection for cloud data management systems. VLDB J. 23(6), 845–870 (2014)
Acknowledgements
We thank the anonymous reviewers for their valuable feedback. We acknowledge support from ISTC for Big Data, Grant IIS-1513407, IIS-1633755, and IIS-1652750, awarded by the National Science Foundation, Grant 1U54GM114838 awarded by NIGMS and 3U54EB020406-02S1 awarded by NIBIB through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov), and funds from Adobe, Google, and the Siebel Energy Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies and organizations.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, S., Xu, L., Liu, J. et al. \(\varvec{\textsc {Orpheus}}\)DB: bolt-on versioning for relational databases (extended version). The VLDB Journal 29, 509–538 (2020). https://doi.org/10.1007/s00778-019-00594-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-019-00594-5