Skip to main content
Log in

Exploiting SSDs in operational multiversion databases

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Multiversion databases store both current and historical data. Rows are typically annotated with timestamps representing the period when the row is/was valid. We develop novel techniques to reduce index maintenance in multiversion databases, so that indexes can be used effectively for analytical queries over current data without being a heavy burden on transaction throughput. To achieve this end, we re-design persistent index data structures in the storage hierarchy to employ an extra level of indirection. The indirection level is stored on solid-state disks that can support very fast random I/Os, so that traversing the extra level of indirection incurs a relatively small overhead. The extra level of indirection dramatically reduces the number of magnetic disk I/Os that are needed for index updates and localizes maintenance to indexes on updated attributes. Additionally, we batch insertions within the indirection layer in order to reduce physical disk I/Os for indexing new records. In this work, we further exploit SSDs by introducing novel DeltaBlock techniques for storing the recent changes to data on SSDs. Using our DeltaBlock, we propose an efficient method to periodically flush the recently changed data from SSDs to HDDs such that, on the one hand, we keep track of every change (or delta) for every record, and, on the other hand, we avoid redundantly storing the unchanged portion of updated records. By reducing the index maintenance overhead on transactions, we enable operational data stores to create more indexes to support queries. We have developed a prototype of our indirection proposal by extending the widely used generalized search tree open-source project, which is also employed in PostgreSQL. Our working implementation demonstrates that we can significantly reduce index maintenance and/or query processing cost by a factor of 3. For the insertion of new records, our novel batching technique can save up to 90 % of the insertion time. For updates, our prototype demonstrates that we can significantly reduce the database size by up to 80 % even with a modest space allocated for DeltaBlocks on SSDs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Notes

  1. In the case of the modified attribute, the position of the record in the index may also change.

  2. If one wanted to cluster the main table by a temporal attribute to improve temporal locality, then updates would not be in-place and additional indexes would need to be updated. Our proposed solution would reduce the burden of such index updates.

  3. Frequently accessed LIDs would naturally be cached in RAM by the database bufferpool manager, further reducing the overhead.

  4. One can shift the burden of LIDBlock flushing outside of running transactions by triggering an asychronous flush once a LIDBlock becomes full (or nearly full), rather than waiting until it overflows.

  5. Notably, the cost of additional SSD I/Os is negligible with respect to the cost of HDD I/O.

  6. A single DeltaBlock may hold several small updates (e.g., updating only one column at a time) or may hold a single large update (e.g., updating many columns in one transaction).

  7. This might be too pessimistic, particularly for LIDBlocks and DeltaBlocks that could be small enough to be cached in RAM.

  8. For DeltaBlock techniques with the indirect chaining method, only one pointer is needed to point to the last delta, and this pointer can be included as part of the indirection mapping; thus, no additional read is required.

  9. Amortization is expected to be high on the tail of a table.

  10. For example, the average record size in the LINEITEM table in TPC-H is around 150 bytes when start/end timestamps for tracking versions are also accounted for [60].

  11. Toshiba Enterprise SSD (eSSD): http://toshiba.semicon-storage.com/us/product/storage-products/enterprise-ssd/px02ssb-px02ssfxxx.html

  12. Device characteristics are based on an 80 GB SLC FusionIO device.

  13. Again, this estimate is pessimistic because it assumes no RAM caching of SSD pages.

  14. We based our calculation on Toshiba PX02SSF010 eMLC 100 GB, but we limit ourselves to only 80 GB for comparable analysis with our FusionIO device.

  15. The recovery of LIDBlocks and DeltaBlocks follows the same principle.

  16. Our SLC FusionIO can support up to 88,000 I/O operations per second at \(50\mu s\) latency. While the FusionIO devices are high-end devices that are relatively expensive, we remark that recent SSDs such as the Intel 520 series can store about 500 GB, cost about $500, and can support 50,000 I/O operations per second at \(85\mu s\) latency, more than sufficient for the workloads described here.

  17. We do not include old versions of records for these experiments; in a true multiversion database the base tables would contain some old record versions that would need to be filtered out during the scan, using the valid time attributes.

  18. The starting value of the range in our range queries is chosen randomly.

  19. This behavior can be avoided if optimizer hints are provided.

  20. If an indexed attribute is updated, then extra I/O is needed for that index to relocate the index entry. Indexes that are neither traversed nor updated in the Indirection method are said to be “unaffected.”

  21. Among our DeltaBlock techniques, Indirect Chained DeltaBlock provides the best balance between performance and flexibility. Thus, we consider it as our DeltaBlock representative approach for our empirical studies.

  22. Except when a very small weight is assigned to update time.

  23. In fact, our indirection mapping can be seen as yet another unique index that is accessed by every query. Thus, the objects placement optimization in [17] can be utilized to determine the globally optimized placement of our indirection object.

References

  1. BioPostgres: Data management for computational biology. http://www.biopostgres.org/

  2. IBM DB2 Database for Linux, UNIX, and Windows. www.ibm.com/software/data/db2/linux-unix-windows/

  3. IBM DB2 with BLU Acceleration. www.ibm.com/software/data/db2/linux-unix-windows/db2-blu-acceleration/

  4. OpenFTS: Open source full text search engine. http://openfts.sourceforge.net/

  5. PostGIS: Geographic information systems. http://postgis.refractions.net/

  6. PostgreSQL: Open source object-relational database system. http://www.postgresql.org/

  7. YAGO2: High-quality knowledge base. http://www.mpi-inf.mpg.de/yago-naga/yago/

  8. Agrawal, D., Ganesan, D., Sitaraman, R.K., Diao, Y., Singh, S.: Lazy-adaptive tree: an optimized index structure for flash devices. PVLDB 2(1), 361–372 (2009)

    Google Scholar 

  9. Ang, C.-H., Tan, K.-P.: The interval B-tree. Inf. Process. Lett. 53(2), 85–89 (1995)

    Article  MATH  Google Scholar 

  10. Arpaci-Dusseau, R., Arpaci-Dusseau, A.: Operating Systems: Three Easy Pieces. Arpaci-Dusseau Books, 0.5 edition (2012)

  11. Athanassoulis, M., Chen, S., Ailamaki, A., Gibbons, P.B., Stoica, R.: MaSM: efficient online updates in data warehouses. In: SIGMOD Conference, pp. 865–876 (2011)

  12. Becker, B., Gschwind, S., Ohler, T., Seeger, B., Widmayer, P.: An asymptotically optimal multiversion B-Tree. VLDB J. 5(4), 264–275 (1996)

    Article  Google Scholar 

  13. Bhattacharjee, B., Lim, L., Malkemus, T., Mihaila, G., Ross, K., Lau, S., McArthur, C., Toth, Z., Sherkat, R.: Efficient index compression in DB2 LUW. Proc. VLDB Endow. 2(2), 1462–1473 (2009)

    Article  Google Scholar 

  14. Bhattacharjee, B., Malkemus, T., Lau, S., Mckeough, S., Kirton, J.-A., Boeschoten, R.V., Kennedy, J.: Efficient bulk deletes for multi dimensionally clustered tables in DB2. In: VLDB, pp. 1197–1206 (2007)

  15. Bhattacharjee, B., Ross, K.A., Lang, C.A., Mihaila, G.A., Banikazemi, M.: Enhancing recovery using an SSD buffer pool extension. In: DaMoN, pp. 10–16 (2011)

  16. Bozkaya, T., Özsoyoğlu, M.: Indexing valid time intervals. Lect. Notes Comput. Sci. 1460, 541–550 (1998)

    Article  Google Scholar 

  17. Canim, M., Bhattacharjee, B., Mihaila, G.A., Lang, C.A., Ross, K.A.: An object placement advisor for DB2 using solid state storage. PVLDB 2(2), 1318–1329 (2009)

    Google Scholar 

  18. Canim, M., Mihaila, G.A., Bhattacharjee, B., Ross, K.A., Lang, C.A.: SSD bufferpool extensions for database systems. PVLDB 3(2), 1435–1446 (2010)

    Google Scholar 

  19. Chaudhuri, S., Narasayya, V.: Automating statistics management for query optimizers. IEEE Trans. Knowl. Data Eng 13(1), 7–20 (2001)

    Article  Google Scholar 

  20. Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for microsoft SQL server. In: Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB ’97, pp. 146–155. Morgan Kaufmann Publishers Inc., San Francisco (1997)

  21. Chen, F., Luo, T., Zhang, X.: CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: FAST, pp. 77–90 (2011)

  22. Chen, S.: Time travel query or bi-temporal. In: DB2 for z/OS Technical Forum (2010)

  23. Do, J., Zhang, D., Patel, J.M., DeWitt, D.J., Naughton, J.F., Halverson, A.: Turbocharging DBMS buffer pool using SSDs. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD ’11, pp. 1113–1124. ACM, New York (2011)

  24. Dou, A.J., Lin, S., Kalogeraki, V.: Real-time querying of historical data in flash-equipped sensor devices. In: IEEE Real-Time Systems Symposium, pp. 335–344 (2008)

  25. Drossel, G.: Methodologies for calculating SSD usable life. In: Storage Developer Conference (2009)

  26. Elmasri, R., Wuu, G.T.J., Kouramajian, V.: The time index and the monotonic B+-tree. In: Temporal Databases, pp. 433–456 (1993)

  27. Fusion-io breaks one billion IOPS barrier. http://www.fusionio.com/press-releases/fusion-io-breaks-one-billion-iops-barrier/

  28. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River, NJ (2008)

    Google Scholar 

  29. The GiST indexing project. http://gist.cs.berkeley.edu/

  30. Gunadhi, H., Segev, A.: Efficient indexing methods for temporal relations. IEEE Trans. Knowl. Data Eng. 5(3), 496 (1993)

    Article  Google Scholar 

  31. Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: Proceedings of the 21th International Conference on Very Large Data Bases, VLDB ’95, pp. 562–573. Morgan Kaufmann Publishers Inc., San Francisco (1995)

  32. Hinshaw, F.D., Harris, C.S., Sarin, S.K.: Controlling visibility in multi-version database systems. US 7305386 Patent, Netezza Corporation (2007)

  33. Hitz, D., Lau, J., Malcolm, M.: File system design for an NFS file server appliance. In: Proceedings of the USENIX Winter 1994 Technical Conference, WTEC’94, pp. 19–19. USENIX Association, Berkeley (1994)

  34. DB2 10 for z/OS. ftp://public.dhe.ibm.com/software/systemz/whitepapers/DB210_for_zOS_Upgrade_ebook

  35. Inmon, W.H.: Building the Operational Data Store, 2nd edn. Wiley, New York (1999)

    Google Scholar 

  36. Jouini, K., Jomier, G.: Indexing multiversion databases. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM ’07, pp. 915–918. ACM, New York (2007)

  37. Kang, W.-H., Lee, S.-W., Moon, B.: Flash-based extended cache for higher throughput and faster recovery. PVLDB 5(11), 1615–1626 (2012)

    Google Scholar 

  38. Larson, P.-A., Blanas, S., Diaconu, C., Freedman, C., Patel, J.M., Zwilling, M.: High-performance concurrency control mechanisms for main-memory databases. Proc. VLDB Endow. 5(4), 298–309 (2011)

    Article  Google Scholar 

  39. Levandoski, J.J., Lomet, D.B., Sengupta, S.: The Bw-Tree: a B-tree for new hardware platforms. In: Proceedings of the 2013 IEEE 29th International Conference on Data Engineering, ICDE ’13. IEEE Computer Society, Washington (2013)

  40. Leventhal, A.: Flash storage memory. Commun. ACM 51(7), 47–51 (2008)

    Article  Google Scholar 

  41. Li, Y., He, B., Luo, Q., Yi, K.: Tree indexing on flash disks. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 1303–1306. IEEE Computer Society, Washington (2009)

  42. Lomet, D., Barga, R., Mokbel, M.F., Shegalov, G., Wang, R., Zhu, Y.: Immortal DB: transaction time support for SQL server. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD ’05, pp. 939–941. ACM, New York (2005)

  43. Lomet, D., Hong, M., Nehme, R., Zhang, R.: Transaction time indexing with version compression. Proc. VLDB Endow. 1(1), 870–881 (2008)

    Article  Google Scholar 

  44. Menon, P., Rabl, T., Sadoghi, M., Jacobsen, H.: CaSSanDra: an SSD boosted key-value store. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp. 1162–1167 (2014)

  45. Murphy, G., Compher, D.: DB2 storage observations (2011)

  46. Omiecinski, E., Liu, W., Akyildiz, I.F.: Analysis of a deferred and incremental update strategy for secondary indexes. Inf. Syst. 16(3), 345–356 (1991)

    Article  Google Scholar 

  47. O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (LSM-Tree). Acta Inf. 33(4), 351–385 (1996)

    Article  MATH  Google Scholar 

  48. Oracle database 11g workspace manager overview. http://www.oracle.com/technetwork/database/twp-appdev-workspace-manager-11g-128289

  49. Oracle total recall/flashback data archive. http://www.oracle.com/technetwork/issue-archive/2008/08-jul/flashback-data-archive-whitepaper-129145

  50. Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proc. VLDB Endow. 5(12), 1724–1735 (2012)

    Article  Google Scholar 

  51. Rosenblum, M., Ousterhout, J.K.: The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10(1), 26–52 (1992)

  52. Sadoghi, M., Canim, M., Bhattacharjee, B., Nagel, F., Ross, K.A.: Reducing database locking contention through multi-version concurrency. Proc. VLDB Endow. 7(13), 1331–1342 (2014)

    Article  Google Scholar 

  53. Sadoghi, M., Ross, K.A., Canim, M., Bhattacharjee, B.: Making updates disk-I/O friendly using SSDs. Proc. VLDB Endow. 6(11), 997–1008 (2013)

    Article  Google Scholar 

  54. Salzberg and Tsotras: Comparison of access methods for time-evolving data. CSURV. Comput. Surv. 31(2), 158–221 (1999). doi:10.1145/319806.319816

  55. Samy, V., Lu, W., Rada, A., Punit, S., Srinivasan, S.: Best practices physical database design for online transaction processing (OLTP) environments (2011)

  56. Saracco, C.M., Nicola, M., Gandhi, L.: A matter of time: temporal data management in DB2 for z/OS (2010)

  57. Sears, R., Ramakrishnan, R.: bLSM: a general purpose log structured merge tree. In: SIGMOD Conference, pp. 217–228 (2012)

  58. Shen, H., Chin, B., Lu, O.H.: The TP-Index: a dynamic and efficient indexing mechanism for temporal databases. In: Proceedings of the Tenth International Conference on Data Engineering, pp. 274–281. IEEE (1994)

  59. Snodgrass, R.T.: A case study of temporal data. Teradata Corporation, Dayton (2010)

    Google Scholar 

  60. TPC-H, decision support benchmark. http://www.tpc.org/tpch/

  61. Vo, H.T., Wang, S., Agrawal, D., Chen, G., Ooi, B.C.: LogBase: a scalable log-structured database system in the cloud. PVLDB 5(10), 1004–1015 (2012)

    Google Scholar 

  62. Volos, H., Tack, A.J., Swift, M.M.: Mnemosyne: lightweight persistent memory. In: Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pp. 91–104. ACM, New York (2011)

  63. Wu, C.-H., Kuo, T.-W., Chang, L.-P.: An efficient B-tree layer implementation for flash-memory storage systems. ACM Trans. Embedded Comput. Syst. 6(3) (2007). doi:10.1145/1275986.1275991

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mustafa Canim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadoghi, M., Ross, K.A., Canim, M. et al. Exploiting SSDs in operational multiversion databases. The VLDB Journal 25, 651–672 (2016). https://doi.org/10.1007/s00778-015-0410-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-015-0410-5

Keywords

Navigation