$$\varvec{\textsc {Orpheus}}$$ DB: bolt-on versioning for relational databases (extended version)

Huang, Silu; Xu, Liqi; Liu, Jialin; Elmore, Aaron J.; Parameswaran, Aditya

doi:10.1007/s00778-019-00594-5

$\varvec{\textsc {Orpheus}}$DB: bolt-on versioning for relational databases (extended version)

Special Issue Paper
Published: 20 December 2019

Volume 29, pages 509–538, (2020)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Silu Huang ORCID: orcid.org/0000-0002-5291-0167¹,
Liqi Xu²,
Jialin Liu²,
Aaron J. Elmore³ &
…
Aditya Parameswaran⁴

478 Accesses
2 Citations
Explore all metrics

Abstract

Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free.” We develop and evaluate multiple data models for representing versioned data, as well as a lightweight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average $10^3\times $ faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to $20\times $ relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by $10\times $ on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Article Open access 12 April 2024

Notes

Orpheus is a musician and poet from ancient Greek mythology with the ability to raise the dead with his music, much like OrpheusDB has the ability to retrieve old (“dead”) dataset versions on demand.
We also tried alternative join methods—the findings were unchanged; we will discuss this further in Sect. 4.1. We also tried using an additional secondary index for vlist for split-by-vlist which reduced the time for checkout but increased the time for commit even further.
Table 3 shows the commit and checkout time for split-by-rid-vid without building an index on vid for the versioning table. When built an index on vid for the versioning table, the checkout time for split-by-rid-vid is reduced to 69.382 s, while the commit time is increased to 21.235 s.
A lyre was the musical instrument of choice for Orpheus.
In each iteration r, topological sorting algorithm finds vertices $V'$ with in-degree equals 0, removes $V'$, and updates in-degree of other vertices. $l(v_i) = r, \forall v_i\in V'$.
If the version graph is a DAG instead, we first transform it into a version tree as discussed in Sect. 5.1.
If each attribute is of different size, we can simply replace “the number of attributes” with “the number of bytes” in the whole algorithm.
PostgreSQL ’s version 9.5 added the feature of dynamically adjusting the number of buckets for hash-join.
Shingles are calculated as signatures of each partition based on a min-hashing based technique.

References

For big-data scientists, ‘janitor work’ is key hurdle to insights. http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0
Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A.J., Madden, S., Parameswaran, A.G.: Datahub: collaborative data science & dataset version management at scale. In: CIDR (2015)
Consortium, G.O., et al.: Gene ontology consortium: going forward. Nucleic Acids Res. 43(D1), D1049–D1056 (2015)
Article Google Scholar
Szklarczyk, D., et al.: The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39(1), D561–D568. https://academic.oup.com/nar/article/39/suppl_1/D561/2509054
Article Google Scholar
Maddox, M., Goehring, D., Elmore, A.J., Madden, S., Parameswaran, A., Deshpande, A.: Decibel: the relational dataset branching system. Proc. VLDB Endow. 9(9), 624–635 (2016)
Article Google Scholar
Bhattacherjee, S., Chavan, A., Huang, S., Deshpande, A., Parameswaran, A.: Principles of dataset versioning: exploring the recreation/storage tradeoff. Proc. VLDB Endow. 8(12), 1346–1357 (2015)
Article Google Scholar
Tansel, A.U., Clifford, J., Gadia, S., Jajodia, S., Segev, A., Snodgrass, R.: Temporal databases: theory, design, and implementation. Benjamin-Cummings Publishing Co., Inc (1993)
Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Trans. Knowl. Data Eng. 11(1), 36–44 (1999)
Article Google Scholar
Ozsoyoglu, G., Snodgrass, R.T.: Temporal and real-time databases: a survey. IEEE Trans. Knowl. Data Eng. 7(4), 513–532 (1995)
Article Google Scholar
Kulkarni, K., Michels, J.-E.: Temporal features in sql: 2011. ACM Sigmod Record 41(3), 34–43 (2012)
Article Google Scholar
Huang, S., Xu, L., Liu, J., Elmore, A.J., Parameswaran, A.: O rpheus db: bolt-on versioning for relational databases. Proc. VLDB Endow. 10(10), 1130–1141 (2017)
Article Google Scholar
Xu, L., Huang, S., Hui, S., Elmore, A., Parameswaran, A.: (2017) OrpheusDB : A lightweight approach to relational dataset versioning. Technical Report
Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L.-L., Ho, C.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Record 30(1), 78–83 (2001)
Article Google Scholar
Fisher, K., Walker, D., Zhu, K.Q., White, P.: From dirt to shovels: fully automatic tool generation from ad hoc data. In: ACM SIGPLAN Notices, vol. 43, pp. 421–434. ACM (2008)
PostgreSQL 9.5 intarray. https://www.postgresql.org/docs/current/static/intarray.html
DB2 9.7 array. https://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.sql.ref.doc/doc/r0050497.html
Add array data type in MySql 7.1. https://dev.mysql.com/worklog/task/?id=2081
Buneman, P., Khanna, S., Tajima, K., Tan, W.-C.: Archiving scientific data. ACM Trans. Database Syst. (TODS) 29(1), 2–42 (2004)
Article Google Scholar
Jain, S., Moritz, D., Halperin, D., Howe, B., Lazowska, E.: Sqlshare: Results from a multi-year sql-as-a-service experiment. In: Proceedings of the 2016 International Conference on Management of Data, pp. 281–293. ACM (2016)
De Castro, C., Grandi, F., Scalas, M.R.: (1995) On schema versioning in temporal databases. In: Recent Advances in Temporal Databases, pp. 272–291. Springer
De Castro, C., Grandi, F., Scalas, M.R.: Schema versioning for multitemporal relational databases. Inf. Syst. 22(5), 249–290 (1997)
Article Google Scholar
Moon, H.J., Curino, C.A., Deutsch, A., Hou, C.-Y., Zaniolo, C.: Managing and querying transaction-time databases under schema evolution. Proc. VLDB Endow. 1(1), 882–895 (2008)
Article Google Scholar
Moon, H.J., Curino, C.A., Zaniolo, C.: Scalable architecture and query optimization fortransaction-time dbs with evolving schemas. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 207–218. ACM (2010)
Quamar, A., Deshpande, A., Lin, J.: (2014) Nscale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J. 1–26
Wang, S., Dinh, T.T.A., Lin, Q., Xie, Z., Zhang, M., Cai, Q., Chen, G., Fu, W., Ooi, B.C., Ruan, P.: (2018) Forkbase: an efficient storage engine for blockchain and forkable applications. arXiv preprint arXiv:1802.04949
Chavan, A., Huang, S., Deshpande, A., Elmore, A., Madden, S., Parameswaran, A.: Towards a unified query language for provenance and versioning. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15) (2015)
Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A generic provenance middleware for queries, updates, and transactions. In: 6th $\{$USENIX$\}$ Workshop on the Theory and Practice of Provenance (TaPP 2014) (2014)
Curino, C.A., Moon, H.J., Zaniolo, C.: Graceful database schema evolution: the prism workbench. Proc. VLDB Endow. 1(1), 761–772 (2008)
Article Google Scholar
Ahmad, Y., Kennedy, O., Koch, C., Nikolic, M.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow. 5(10), 968–979 (2012)
Article Google Scholar
Ahn, I., Snodgrass, R.: Performance evaluation of a temporal database management system. In: ACM SIGMOD Record, vol. 15, pp. 96–107. ACM (1986)
Snodgrass, R., Ahn, I.: A taxonomy of time databases. ACM Sigmod Record 14(4), 236–246 (1985)
Article Google Scholar
Snodgrass, R.T., Ahn, I., Ariav, G., Batory, D.S., Clifford, J., Dyreson, C.E., Elmasri, R., Grandi, F., Jensen, C.S., Käfer, W., et al.: Tsql2 language specification. Sigmod Record 23(1), 65–86 (1994)
Article Google Scholar
Torp, K., Jensen, C.S., Snodgrass, R.T.: Stratum approaches to temporal DBMS implementation. In: Database Engineering and Applications Symposium, 1998. Proceedings. IDEAS’98. International, pp. 4–13. IEEE (1998)
Chen, C.X., Kong, J., Zaniolo, C.: Design and implementation of a temporal extension of sql. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 689–691. IEEE (2003)
Saracco, C.M., Nicola, M., Gandhi, L.: A matter of time: temporal data management in db2 for z. IBM Corporation, New York (2010)
Google Scholar
Al-Kateb, M., Ghazal, A., Crolotte, A., Bhashyam, R., Chimanchode, J.,. Pakala, S.P.: Temporal query processing in teradata. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 573–578. ACM (2013)
Kaufmann, M., Manjili, A.A., Vagenas, P., Fischer, P.M., Kossmann, D., Färber, F., May, N.: Timeline index: a unified data structure for processing queries on temporal data in sap hana. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1173–1184. ACM (2013)
Färber, F., May, N., Lehner, W., Große, P., Müller, I., Rauhe, H., Dees, J.: The sap hana database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)
Google Scholar
Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking bitemporal database systems: ready for the future or stuck in the past? In: EDBT, pp. 738–749 (2014)
Salzberg, B., Tsotras, V.J.: Comparison of access methods for time-evolving data. ACM Comput. Surv. (CSUR) 31(2), 158–221 (1999)
Article Google Scholar
Lee, J.W., Loaiza, J., Stewart, M.J., Hu, W.-M., Bridge Jr, W.H.: Flashback database, Feb. 20 2007. US Patent 7,181,476
Gao, D., Jensen, S., Snodgrass, T., Soo, D.: Join operations in temporal databases. VLDB J. Int. J. Very Large Data Bases 14(1), 2–29 (2005)
Article Google Scholar
Landau, G.M., Schmidt, J.P., Tsotras, V.J.: Historical queries along multiple lines of time evolution. VLDB J. Int. J. Very Large Data Bases 4(4), 703–726 (1995)
Article Google Scholar
Salzberg, B.J., Lomet, D.B.: Branched and Temporal Index Structures. Northeastern University, College of Computer Science (1995)
Lanka, S., Mays, E.: Fully Persistent B+-trees, vol. 20. ACM, New York (1991)
Google Scholar
Jiang, L., Salzberg, B., Lomet, D.B., García, M.B.: (2000) The bt-tree: a branched and temporal access method. In: VLDB, pp. 451–460
Liquibase. http://www.liquibase.org/
dbv. https://dbv.vizuina.com/
Dat. http://datproject.org/
Mode. https://about.modeanalytics.com/
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)
Article MathSciNet Google Scholar
Liu, D.-R., Shekhar, S.: Partitioning similarity graphs: a framework for declustering problems. Inf. Syst. 21(6), 475–496 (1996)
Article Google Scholar
Feige, U., Peleg, D., Kortsarz, G.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Article MathSciNet Google Scholar
Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. VLSI Des. 11(3), 285–300 (2000)
Article Google Scholar
Kumar, K.A., Quamar, A., Deshpande, A., Khuller, S.: Sword: workload-aware data placement and replica selection for cloud data management systems. VLDB J. 23(6), 845–870 (2014)
Article Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their valuable feedback. We acknowledge support from ISTC for Big Data, Grant IIS-1513407, IIS-1633755, and IIS-1652750, awarded by the National Science Foundation, Grant 1U54GM114838 awarded by NIGMS and 3U54EB020406-02S1 awarded by NIBIB through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov), and funds from Adobe, Google, and the Siebel Energy Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies and organizations.

Author information

Authors and Affiliations

Microsoft Research, Redmond, WA, USA
Silu Huang
University of Illinois (UIUC), Champaign, IL, USA
Liqi Xu & Jialin Liu
University of Chicago, Chicago, IL, USA
Aaron J. Elmore
University of California, Berkeley, CA, USA
Aditya Parameswaran

Authors

Silu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Liqi Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jialin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Aaron J. Elmore
View author publications
You can also search for this author in PubMed Google Scholar
Aditya Parameswaran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Silu Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, S., Xu, L., Liu, J. et al. $\varvec{\textsc {Orpheus}}$DB: bolt-on versioning for relational databases (extended version). The VLDB Journal 29, 509–538 (2020). https://doi.org/10.1007/s00778-019-00594-5

Download citation

Received: 01 December 2018
Revised: 12 November 2019
Accepted: 06 December 2019
Published: 20 December 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s00778-019-00594-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

\(\varvec{\textsc {Orpheus}}\)DB: bolt-on versioning for relational databases (extended version)

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

\(\varvec{\textsc {Orpheus}}\)DB: bolt-on versioning for relational databases (extended version)

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation