Exploiting SSDs in operational multiversion databases

Sadoghi, Mohammad; Ross, Kenneth A.; Canim, Mustafa; Bhattacharjee, Bishwaranjan

doi:10.1007/s00778-015-0410-5

Exploiting SSDs in operational multiversion databases

Special Issue Paper
Published: 17 November 2015

Volume 25, pages 651–672, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Mohammad Sadoghi¹,
Kenneth A. Ross^1,2,
Mustafa Canim¹ &
…
Bishwaranjan Bhattacharjee¹

1795 Accesses
12 Citations
Explore all metrics

Abstract

Multiversion databases store both current and historical data. Rows are typically annotated with timestamps representing the period when the row is/was valid. We develop novel techniques to reduce index maintenance in multiversion databases, so that indexes can be used effectively for analytical queries over current data without being a heavy burden on transaction throughput. To achieve this end, we re-design persistent index data structures in the storage hierarchy to employ an extra level of indirection. The indirection level is stored on solid-state disks that can support very fast random I/Os, so that traversing the extra level of indirection incurs a relatively small overhead. The extra level of indirection dramatically reduces the number of magnetic disk I/Os that are needed for index updates and localizes maintenance to indexes on updated attributes. Additionally, we batch insertions within the indirection layer in order to reduce physical disk I/Os for indexing new records. In this work, we further exploit SSDs by introducing novel DeltaBlock techniques for storing the recent changes to data on SSDs. Using our DeltaBlock, we propose an efficient method to periodically flush the recently changed data from SSDs to HDDs such that, on the one hand, we keep track of every change (or delta) for every record, and, on the other hand, we avoid redundantly storing the unchanged portion of updated records. By reducing the index maintenance overhead on transactions, we enable operational data stores to create more indexes to support queries. We have developed a prototype of our indirection proposal by extending the widely used generalized search tree open-source project, which is also employed in PostgreSQL. Our working implementation demonstrates that we can significantly reduce index maintenance and/or query processing cost by a factor of 3. For the insertion of new records, our novel batching technique can save up to 90 % of the insertion time. For updates, our prototype demonstrates that we can significantly reduce the database size by up to 80 % even with a modest space allocated for DeltaBlocks on SSDs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

In the case of the modified attribute, the position of the record in the index may also change.
If one wanted to cluster the main table by a temporal attribute to improve temporal locality, then updates would not be in-place and additional indexes would need to be updated. Our proposed solution would reduce the burden of such index updates.
Frequently accessed LIDs would naturally be cached in RAM by the database bufferpool manager, further reducing the overhead.
One can shift the burden of LIDBlock flushing outside of running transactions by triggering an asychronous flush once a LIDBlock becomes full (or nearly full), rather than waiting until it overflows.
Notably, the cost of additional SSD I/Os is negligible with respect to the cost of HDD I/O.
A single DeltaBlock may hold several small updates (e.g., updating only one column at a time) or may hold a single large update (e.g., updating many columns in one transaction).
This might be too pessimistic, particularly for LIDBlocks and DeltaBlocks that could be small enough to be cached in RAM.
For DeltaBlock techniques with the indirect chaining method, only one pointer is needed to point to the last delta, and this pointer can be included as part of the indirection mapping; thus, no additional read is required.
Amortization is expected to be high on the tail of a table.
For example, the average record size in the LINEITEM table in TPC-H is around 150 bytes when start/end timestamps for tracking versions are also accounted for [60].
Toshiba Enterprise SSD (eSSD): http://toshiba.semicon-storage.com/us/product/storage-products/enterprise-ssd/px02ssb-px02ssfxxx.html
Device characteristics are based on an 80 GB SLC FusionIO device.
Again, this estimate is pessimistic because it assumes no RAM caching of SSD pages.
We based our calculation on Toshiba PX02SSF010 eMLC 100 GB, but we limit ourselves to only 80 GB for comparable analysis with our FusionIO device.
The recovery of LIDBlocks and DeltaBlocks follows the same principle.
Our SLC FusionIO can support up to 88,000 I/O operations per second at $50\mu s$ latency. While the FusionIO devices are high-end devices that are relatively expensive, we remark that recent SSDs such as the Intel 520 series can store about 500 GB, cost about $500, and can support 50,000 I/O operations per second at $85\mu s$ latency, more than sufficient for the workloads described here.
We do not include old versions of records for these experiments; in a true multiversion database the base tables would contain some old record versions that would need to be filtered out during the scan, using the valid time attributes.
The starting value of the range in our range queries is chosen randomly.
This behavior can be avoided if optimizer hints are provided.
If an indexed attribute is updated, then extra I/O is needed for that index to relocate the index entry. Indexes that are neither traversed nor updated in the Indirection method are said to be “unaffected.”
Among our DeltaBlock techniques, Indirect Chained DeltaBlock provides the best balance between performance and flexibility. Thus, we consider it as our DeltaBlock representative approach for our empirical studies.
Except when a very small weight is assigned to update time.
In fact, our indirection mapping can be seen as yet another unique index that is accessed by every query. Thus, the objects placement optimization in [17] can be utilized to determine the globally optimized placement of our indirection object.

References

BioPostgres: Data management for computational biology. http://www.biopostgres.org/
IBM DB2 Database for Linux, UNIX, and Windows. www.ibm.com/software/data/db2/linux-unix-windows/
IBM DB2 with BLU Acceleration. www.ibm.com/software/data/db2/linux-unix-windows/db2-blu-acceleration/
OpenFTS: Open source full text search engine. http://openfts.sourceforge.net/
PostGIS: Geographic information systems. http://postgis.refractions.net/
PostgreSQL: Open source object-relational database system. http://www.postgresql.org/
YAGO2: High-quality knowledge base. http://www.mpi-inf.mpg.de/yago-naga/yago/
Agrawal, D., Ganesan, D., Sitaraman, R.K., Diao, Y., Singh, S.: Lazy-adaptive tree: an optimized index structure for flash devices. PVLDB 2(1), 361–372 (2009)
Google Scholar
Ang, C.-H., Tan, K.-P.: The interval B-tree. Inf. Process. Lett. 53(2), 85–89 (1995)
Article MATH Google Scholar
Arpaci-Dusseau, R., Arpaci-Dusseau, A.: Operating Systems: Three Easy Pieces. Arpaci-Dusseau Books, 0.5 edition (2012)
Athanassoulis, M., Chen, S., Ailamaki, A., Gibbons, P.B., Stoica, R.: MaSM: efficient online updates in data warehouses. In: SIGMOD Conference, pp. 865–876 (2011)
Becker, B., Gschwind, S., Ohler, T., Seeger, B., Widmayer, P.: An asymptotically optimal multiversion B-Tree. VLDB J. 5(4), 264–275 (1996)
Article Google Scholar
Bhattacharjee, B., Lim, L., Malkemus, T., Mihaila, G., Ross, K., Lau, S., McArthur, C., Toth, Z., Sherkat, R.: Efficient index compression in DB2 LUW. Proc. VLDB Endow. 2(2), 1462–1473 (2009)
Article Google Scholar
Bhattacharjee, B., Malkemus, T., Lau, S., Mckeough, S., Kirton, J.-A., Boeschoten, R.V., Kennedy, J.: Efficient bulk deletes for multi dimensionally clustered tables in DB2. In: VLDB, pp. 1197–1206 (2007)
Bhattacharjee, B., Ross, K.A., Lang, C.A., Mihaila, G.A., Banikazemi, M.: Enhancing recovery using an SSD buffer pool extension. In: DaMoN, pp. 10–16 (2011)
Bozkaya, T., Özsoyoğlu, M.: Indexing valid time intervals. Lect. Notes Comput. Sci. 1460, 541–550 (1998)
Article Google Scholar
Canim, M., Bhattacharjee, B., Mihaila, G.A., Lang, C.A., Ross, K.A.: An object placement advisor for DB2 using solid state storage. PVLDB 2(2), 1318–1329 (2009)
Google Scholar
Canim, M., Mihaila, G.A., Bhattacharjee, B., Ross, K.A., Lang, C.A.: SSD bufferpool extensions for database systems. PVLDB 3(2), 1435–1446 (2010)
Google Scholar
Chaudhuri, S., Narasayya, V.: Automating statistics management for query optimizers. IEEE Trans. Knowl. Data Eng 13(1), 7–20 (2001)
Article Google Scholar
Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for microsoft SQL server. In: Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB ’97, pp. 146–155. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Chen, F., Luo, T., Zhang, X.: CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: FAST, pp. 77–90 (2011)
Chen, S.: Time travel query or bi-temporal. In: DB2 for z/OS Technical Forum (2010)
Do, J., Zhang, D., Patel, J.M., DeWitt, D.J., Naughton, J.F., Halverson, A.: Turbocharging DBMS buffer pool using SSDs. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD ’11, pp. 1113–1124. ACM, New York (2011)
Dou, A.J., Lin, S., Kalogeraki, V.: Real-time querying of historical data in flash-equipped sensor devices. In: IEEE Real-Time Systems Symposium, pp. 335–344 (2008)
Drossel, G.: Methodologies for calculating SSD usable life. In: Storage Developer Conference (2009)
Elmasri, R., Wuu, G.T.J., Kouramajian, V.: The time index and the monotonic B+-tree. In: Temporal Databases, pp. 433–456 (1993)
Fusion-io breaks one billion IOPS barrier. http://www.fusionio.com/press-releases/fusion-io-breaks-one-billion-iops-barrier/
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River, NJ (2008)
Google Scholar
The GiST indexing project. http://gist.cs.berkeley.edu/
Gunadhi, H., Segev, A.: Efficient indexing methods for temporal relations. IEEE Trans. Knowl. Data Eng. 5(3), 496 (1993)
Article Google Scholar
Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: Proceedings of the 21th International Conference on Very Large Data Bases, VLDB ’95, pp. 562–573. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Hinshaw, F.D., Harris, C.S., Sarin, S.K.: Controlling visibility in multi-version database systems. US 7305386 Patent, Netezza Corporation (2007)
Hitz, D., Lau, J., Malcolm, M.: File system design for an NFS file server appliance. In: Proceedings of the USENIX Winter 1994 Technical Conference, WTEC’94, pp. 19–19. USENIX Association, Berkeley (1994)
DB2 10 for z/OS. ftp://public.dhe.ibm.com/software/systemz/whitepapers/DB210_for_zOS_Upgrade_ebook
Inmon, W.H.: Building the Operational Data Store, 2nd edn. Wiley, New York (1999)
Google Scholar
Jouini, K., Jomier, G.: Indexing multiversion databases. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM ’07, pp. 915–918. ACM, New York (2007)
Kang, W.-H., Lee, S.-W., Moon, B.: Flash-based extended cache for higher throughput and faster recovery. PVLDB 5(11), 1615–1626 (2012)
Google Scholar
Larson, P.-A., Blanas, S., Diaconu, C., Freedman, C., Patel, J.M., Zwilling, M.: High-performance concurrency control mechanisms for main-memory databases. Proc. VLDB Endow. 5(4), 298–309 (2011)
Article Google Scholar
Levandoski, J.J., Lomet, D.B., Sengupta, S.: The Bw-Tree: a B-tree for new hardware platforms. In: Proceedings of the 2013 IEEE 29th International Conference on Data Engineering, ICDE ’13. IEEE Computer Society, Washington (2013)
Leventhal, A.: Flash storage memory. Commun. ACM 51(7), 47–51 (2008)
Article Google Scholar
Li, Y., He, B., Luo, Q., Yi, K.: Tree indexing on flash disks. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 1303–1306. IEEE Computer Society, Washington (2009)
Lomet, D., Barga, R., Mokbel, M.F., Shegalov, G., Wang, R., Zhu, Y.: Immortal DB: transaction time support for SQL server. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD ’05, pp. 939–941. ACM, New York (2005)
Lomet, D., Hong, M., Nehme, R., Zhang, R.: Transaction time indexing with version compression. Proc. VLDB Endow. 1(1), 870–881 (2008)
Article Google Scholar
Menon, P., Rabl, T., Sadoghi, M., Jacobsen, H.: CaSSanDra: an SSD boosted key-value store. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp. 1162–1167 (2014)
Murphy, G., Compher, D.: DB2 storage observations (2011)
Omiecinski, E., Liu, W., Akyildiz, I.F.: Analysis of a deferred and incremental update strategy for secondary indexes. Inf. Syst. 16(3), 345–356 (1991)
Article Google Scholar
O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (LSM-Tree). Acta Inf. 33(4), 351–385 (1996)
Article MATH Google Scholar
Oracle database 11g workspace manager overview. http://www.oracle.com/technetwork/database/twp-appdev-workspace-manager-11g-128289
Oracle total recall/flashback data archive. http://www.oracle.com/technetwork/issue-archive/2008/08-jul/flashback-data-archive-whitepaper-129145
Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proc. VLDB Endow. 5(12), 1724–1735 (2012)
Article Google Scholar
Rosenblum, M., Ousterhout, J.K.: The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10(1), 26–52 (1992)
Sadoghi, M., Canim, M., Bhattacharjee, B., Nagel, F., Ross, K.A.: Reducing database locking contention through multi-version concurrency. Proc. VLDB Endow. 7(13), 1331–1342 (2014)
Article Google Scholar
Sadoghi, M., Ross, K.A., Canim, M., Bhattacharjee, B.: Making updates disk-I/O friendly using SSDs. Proc. VLDB Endow. 6(11), 997–1008 (2013)
Article Google Scholar
Salzberg and Tsotras: Comparison of access methods for time-evolving data. CSURV. Comput. Surv. 31(2), 158–221 (1999). doi:10.1145/319806.319816
Samy, V., Lu, W., Rada, A., Punit, S., Srinivasan, S.: Best practices physical database design for online transaction processing (OLTP) environments (2011)
Saracco, C.M., Nicola, M., Gandhi, L.: A matter of time: temporal data management in DB2 for z/OS (2010)
Sears, R., Ramakrishnan, R.: bLSM: a general purpose log structured merge tree. In: SIGMOD Conference, pp. 217–228 (2012)
Shen, H., Chin, B., Lu, O.H.: The TP-Index: a dynamic and efficient indexing mechanism for temporal databases. In: Proceedings of the Tenth International Conference on Data Engineering, pp. 274–281. IEEE (1994)
Snodgrass, R.T.: A case study of temporal data. Teradata Corporation, Dayton (2010)
Google Scholar
TPC-H, decision support benchmark. http://www.tpc.org/tpch/
Vo, H.T., Wang, S., Agrawal, D., Chen, G., Ooi, B.C.: LogBase: a scalable log-structured database system in the cloud. PVLDB 5(10), 1004–1015 (2012)
Google Scholar
Volos, H., Tack, A.J., Swift, M.M.: Mnemosyne: lightweight persistent memory. In: Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pp. 91–104. ACM, New York (2011)
Wu, C.-H., Kuo, T.-W., Chang, L.-P.: An efficient B-tree layer implementation for flash-memory storage systems. ACM Trans. Embedded Comput. Syst. 6(3) (2007). doi:10.1145/1275986.1275991

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim & Bishwaranjan Bhattacharjee
Columbia University, New York, NY, USA
Kenneth A. Ross

Authors

Mohammad Sadoghi
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth A. Ross
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Canim
View author publications
You can also search for this author in PubMed Google Scholar
Bishwaranjan Bhattacharjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mustafa Canim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sadoghi, M., Ross, K.A., Canim, M. et al. Exploiting SSDs in operational multiversion databases. The VLDB Journal 25, 651–672 (2016). https://doi.org/10.1007/s00778-015-0410-5

Download citation

Received: 29 January 2015
Revised: 18 September 2015
Accepted: 26 October 2015
Published: 17 November 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s00778-015-0410-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting SSDs in operational multiversion databases

Abstract

Access this article

Similar content being viewed by others

Append Storage in Multi-Version Databases on Flash

Concurrent Queries and Updates in Summary Views and Their Indexes

Scaling HDFS with a Strongly Consistent Relational Model for Metadata

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting SSDs in operational multiversion databases

Abstract

Access this article

Similar content being viewed by others

Append Storage in Multi-Version Databases on Flash

Concurrent Queries and Updates in Summary Views and Their Indexes

Scaling HDFS with a Strongly Consistent Relational Model for Metadata

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation