Skip to main content
Log in

Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Emerging non-volatile memory (NVM) technologies like 3DXpoint promise significant performance potential for OLTP databases. However, transactional databases need to be redesigned because the key assumptions that non-volatile storage is orders of magnitude slower than DRAM and only supports blocked-oriented accesses have changed. NVMs are byte-addressable and almost as fast as DRAM. The capacity of NVM is much (4-16x) larger than DRAM. Such NVM characteristics make it possible to build OLTP databases entirely in NVM main memory. This paper studies the structure of OLTP engines with hybrid NVM and DRAM memory. We observe three challenges to design an OLTP engine for NVM: tuple metadata modifications, NVM write redundancy, and NVM space management. We propose Zen, a high-throughput log-free OLTP engine for NVM. Zen addresses the three design challenges with three novel techniques: metadata-enhanced tuple cache, log-free persistent transactions, and light-weight NVM space management. We further propose Zen+ by extending Zen with two mechanisms, i.e., MVCC-based adaptive execution and NUMA-aware soft partition, to robustly and effectively support long-running transactions and NUMA architectures. Experimental results on a real machine equipped with Intel Optane DC Persistent Memory show that compared with existing solutions that run an OLTP database as large as the size of NVM, Zen achieves 1.0x-10.1x improvement while attaining fast failure recovery, and supports ten types of concurrency control methods. Experiments also demonstrate that Zen+ robustly supports long-running transactions and efficiently exploits NUMA architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31

Similar content being viewed by others

Notes

  1. For simplicity, Zen assumes that the tuple size is fixed. For example, varchar(n) can be regarded as char(n). We discuss how to support variable-sized tuples in Sect. 6.

  2. Only 48 bits in a 64-bit address are used in current systems. The highest bit is always 0 in user-mode programs.

  3. T must have committed. If T were running, then E’s active bit should be 1 and it could not be chosen as the victim. If T had aborted, then T would have cleared E’s copy bit.

  4. Please note that the choice of MVCC-style concurrency control method is only required by Zen+’s support for long-running transactions. Other techniques in this paper can flexibly support a wide range of concurrency control methods.

  5. A per-page counter can be kept in the NVM-tuple manager to keep track of the number of allocated slots in the page. The counter is updated for tuple allocations and frees. When the counter decreases to 0, we can return the page to the NVM page manager.

  6. This is similar to the interleaved NUMA allocation policy in the operating system. However, when NVM is in the App Direct mode, the OS policy cannot be directly applied to NVM.

References

  1. Intel Optane DC persistent memory architecture and technology. https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html (2019)

  2. TPC benchmark C. http://www.tpc.org/tpcc/ (2020)

  3. Apalkov, D., Khvalkovskiy, A., Watts, S., Nikitin, V., Tang, X., Lottis, D., Moon, K., Luo, X., Chen, E., Ong, A., Driskill-Smith, A., Krounbi, M.: Spin-transfer torque magnetic random access memory (STT-MRAM). ACM J. Emerg. Technol. Comput. Syst. 9(2), 1–35 (2013)

  4. Arulraj, J., Levandoski, J.J., Minhas, U.F., Larson, P.: Bztree: A high-performance latch-free range index for non-volatile memory. Proc. VLDB Endow. 11(5), 553–565 (2018)

    Article  Google Scholar 

  5. Arulraj, J., Pavlo, A., Dulloor, S.: Let’s talk about storage & recovery methods for non-volatile memory database systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 707–722. Melbourne, Victoria, Australia, 31 May–4 June (2015)

  6. Arulraj, J., Perron, M., Pavlo, A.: Write-behind logging. Proc. VLDB Endow. 10(4), 337–348 (2016)

    Article  Google Scholar 

  7. Bernstein, P.A., Goodman, N.: Concurrency control in distributed database systems. ACM Comput. Surv. 13(2), 185–221 (1981)

    Article  Google Scholar 

  8. Blagodurov, S., Zhuravlev, S., Dashti, M., Fedorova, A.: A case for numa-aware contention management on multicore systems. In: 2011 USENIX Annual Technical Conference. 15-17 June, Portland, OR, USA, (2011)

  9. Böttcher, J., Leis, V., Neumann, T., Kemper, A.: Scalable garbage collection for in-memory MVCC systems. Proc. VLDB Endow. 13(2), 128–141 (2019)

    Article  Google Scholar 

  10. Cao, T., Salles, M.A.V., Sowell, B., Yue, Y., Demers, A.J., Gehrke, J., White, W.M.: Fast checkpoint recovery algorithms for frequently consistent applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 265–276, Athens, Greece, 12-16 June, (2011)

  11. Chen, S., Gibbons, P.B., Nath, S.: Rethinking database algorithms for phase change memory. In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research. pp. 21–31, Asilomar, CA, USA, 9-12 January, Online Proceedings, (2011)

  12. Chen, S., Jin, Q.: Persistent b+-trees in non-volatile main memory. Proc. VLDB Endow. 8(7), 786–797 (2015)

    Article  Google Scholar 

  13. Coburn, J., Caulfield, A.M., Akel, A., Grupp, L.M., Gupta, R.K., Jhala, R., Swanson, S.: Nv-heaps: making persistent objects fast and safe with next-generation, non-volatile memories. In: Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, Newport Beach, pp. 105–118, CA, USA, 5-11 March, (2011)

  14. Condit, J., Nightingale, E.B., Frost, C., Ipek, E., Lee, B.C., Burger, D., Coetzee, D.: Better I/O through byte-addressable, persistent memory. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, pp. 133–146, Montana, USA, 11-14 October, (2009)

  15. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, Indianapolis, pp. 143–154, Indiana, USA, 10-11 June, (2010)

  16. DeWitt, D.J., Katz, R.H., Olken, F., Shapiro, L.D., Stonebraker, M., Wood, D.A.: Implementation techniques for main memory database systems. In: SIGMOD’84, Proceedings of Annual Meeting. pp. 1–8, Boston, Massachusetts, USA, 18-21 June, (1984)

  17. Diaconu, C., Freedman, C., Ismert, E., Larson, P., Mittal, P., Stonecipher, R., Verma, N., Zwilling, M.: Hekaton: SQL server’s memory-optimized OLTP engine. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1243–1254 New York, NY, USA, 22-27 June, (2013)

  18. Eswaran, K.P., Gray, J., Lorie, R.A., Traiger, I.L.: The notions of consistency and predicate locks in a database system. Commun. ACM 19(11), 624–633 (1976)

    Article  MATH  Google Scholar 

  19. Fang, R., Hsiao, H., He, B., Mohan, C., Wang, Y.: High performance database logging using storage class memory. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, pp. 1221–1231, 11-16 April, Hannover, Germany, (2011)

  20. Gao, S., Xu, J., Härder, T., He, B., Choi, B., Hu, H.: Pcmlogging: Optimizing transaction logging and recovery performance with PCM. IEEE Trans. Knowl. Data Eng. 27(12), 3332–3346 (2015)

    Article  Google Scholar 

  21. Graham, D.H.: Intel optane technology products - what’s available and what’s coming soon. https://software.intel.com/en-us/articles/3d-xpointtechnology-products (2019)

  22. Hasanzadeh-Mofrad, M., Melhem, R.G., Ahmad, M.Y., Hammoud, M.: Graphite: A numa-aware HPC system for graph analytics based on a new MPI * X parallelism model. Proc. VLDB Endow. 13(6), 783–797 (2020)

    Article  Google Scholar 

  23. Haubenschild, M., Sauer, C., Neumann, T., Leis, V.: Rethinking logging, checkpoints, and recovery for high-performance storage engines. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference, pp. 877–892 [Portland, OR, USA], 14-19 June, (2020)

  24. Huang, J., Schwan, K., Qureshi, M.K.: Nvram-aware logging in transaction systems. Proc. VLDB Endow. 8(4), 389–400 (2014)

    Article  Google Scholar 

  25. Kim, J., Cho, H., Kim, K., Yu, J., Kang, S., Jung, H.: Long-lived transactions made less harmful. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference, pp. 495–510 [Portland, OR, USA], 14-19 June, (2020)

  26. Kim, W., Kim, J., Baek, W., Nam, B., Won, Y.: NVWAL: exploiting NVRAM in write-ahead logging. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016, pp. 385–398, Atlanta, GA, USA, 2-6 April, (2016)

  27. Kimura, H.: FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, pp. 691–706 Victoria, Australia, 31 May - 4 June, (2015)

  28. Kung, H.T., Robinson, J.T.: On optimistic methods for concurrency control. ACM Trans. Database Syst. 6(2), 213–226 (1981)

    Article  Google Scholar 

  29. Lee, J., Kim, K., Cha, S.K.: Differential logging: A commutative and associative logging scheme for highly parallel main memory databases. In: Proceedings of the 17th International Conference on Data Engineering, pp. 173–182, 2-6 April, Heidelberg, Germany, (2001)

  30. Lehman, T.J., Carey, M.J.: A recovery algorithm for A high-performance memory-resident database system. In: Proceedings of the Association for Computing Machinery Special Interest Group on Management of Data 1987 Annual Conference, pp. 104–117, San Francisco, CA, USA, 27-29 May, (1987)

  31. Leis, V., Boncz, P.A., Kemper, A., Neumann, T.: Morsel-driven parallelism: a numa-aware query evaluation framework for the many-core age. In: International Conference on Management of Data, SIGMOD 2014, pp. 743–754, Snowbird, UT, USA, 22-27 June, ACM (2014)

  32. Lepers, B., Quéma, V., Fedorova, A.: Thread and memory placement on NUMA systems: Asymmetry matters. In: 2015 USENIX Annual Technical Conference, USENIX ATC ’15, pp. 277–289, 8-10 July, Santa Clara, CA, USA, (2015)

  33. Lim, H., Kaminsky, M., Andersen, D.G.: Cicada: Dependably fast multi-core in-memory transactions. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference, pp. 21–352017, Chicago, IL, USA, 14-19 May, (2017)

  34. Liu, J., Chen, S., Wang, L.: Lb+-trees: Optimizing persistent index performance on 3dxpoint memory. Proc. VLDB Endow. 13(7), 1078–1090 (2020)

    Article  Google Scholar 

  35. Liu, M., Zhang, M., Chen, K., Qian, X., Wu, Y., Zheng, W., Ren, J.: Dudetm: Building durable transactions with decoupling for persistent memory pp. 329–343 (2017)

  36. Maas, L.M., Kissinger, T., Habich, D., Lehner, W.: BUZZARD: a numa-aware in-memory indexing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1285–1286, New York, NY, USA, 22-27 June, ACM (2013)

  37. Memarzia, P., Ray, S., Bhavsar, V.C.: The art of efficient in-memory query processing on NUMA systems: a systematic approach. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, pp. 781–792, Dallas, TX, USA, 20-24 April, IEEE (2020)

  38. Neumann, T., Mühlbauer, T., Kemper, A.: Fast serializable multi-version concurrency control for main-memory database systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 677–689, Melbourne, Victoria, Australia, 31 May - 4 June, (2015)

  39. Oukid, I., Lasperas, J., Nica, A., Willhalm, T., Lehner, W.: Fptree: A hybrid SCM-DRAM persistent and concurrent b-tree for storage class memory. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 371–386, San Francisco, CA, USA, 26 June - 01 July, (2016)

  40. Oukid, I., Lehner, W., Kissinger, T., Willhalm, T., Bumbulis, P.: Instant recovery for main memory databases. In: Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, 4-7 January, Online Proceedings (2015)

  41. Pelley, S., Wenisch, T.F., Gold, B.T., Bridge, B.: Storage management in the NVRAM era. Proc. VLDB Endow. 7(2), 121–132 (2013)

    Article  Google Scholar 

  42. Psaroudakis, I., Scheuer, T., May, N., Sellami, A., Ailamaki, A.: Scaling up concurrent main-memory column-store scans: Towards adaptive numa-aware data and task placement. Proc. VLDB Endow. 8(12), 1442–1453 (2015)

    Article  Google Scholar 

  43. Psaroudakis, I., Scheuer, T., May, N., Sellami, A., Ailamaki, A.: Adaptive numa-aware data placement and task scheduling for analytical workloads in main-memory column-stores. Proc. VLDB Endow. 10(2), 37–48 (2016)

    Article  Google Scholar 

  44. Raoux, S., Burr, G.W., Breitwisch, M.J., Rettner, C.T., Chen, Y., Shelby, R.M., Salinga, M., Krebs, D., Chen, S., Lung, H., Lam, C.H.: Phase-change random access memory: A scalable technology. IBM J. Res. Dev. 52(4–5), 465–480 (2008)

    Article  Google Scholar 

  45. Ren, K., Diamond, T., Abadi, D.J., Thomson, A.: Low-overhead asynchronous checkpointing in main-memory database systems. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 1539–1551, San Francisco, CA, USA, 26 June - 01 July, (2016)

  46. van Renen, A., Leis, V., Kemper, A., Neumann, T., Hashida, T., Oe, K., Doi, Y., Harada, L., Sato, M.: Managing non-volatile memory in database systems. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, pp. 1541–1555, Houston, TX, USA, 10-15 June, (2018)

  47. Stonebraker, M., Madden, S., Abadi, D.J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era (it’s time for a complete rewrite). In: Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, pp. 1150–1160, Austria, 23-27 September, (2007)

  48. Tu, S., Zheng, W., Kohler, E., Liskov, B., Madden, S.: Speedy transactions in multicore in-memory databases. In: ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, pp. 18–32, Farmington, PA, USA, 3-6 November, (2013)

  49. Volos, H., Tack, A.J., Swift, M.M.: Mnemosyne: lightweight persistent memory. In: Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, pp. 91–104, Newport Beach, CA, USA, 5-11 March, (2011)

  50. Wang, T., Johnson, R.: Scalable logging through emerging non-volatile memory. Proc. VLDB Endow. 7(10), 865–876 (2014)

    Article  Google Scholar 

  51. Wang, T., Kimura, H.: Mostly-optimistic concurrency control for highly contended dynamic workloads on a thousand cores. Proc. VLDB Endow. 10(2), 49–60 (2016)

    Article  Google Scholar 

  52. Wang, Y., Jiang, D., Xiong, J.: Numa-aware thread migration for high performance nvmm file systems. In: 36th Symposium on Mass Storage Systems and Technologies, MSST 2020, Santa Clara, CA, USA, 29-30 October, (2020)

  53. Xia, F., Jiang, D., Xiong, J., Sun, N.: Hikv: A hybrid index key-value store for DRAM-NVM memory systems. In: 2017 USENIX Annual Technical Conference, USENIX ATC 2017, pp. 349–362, Santa Clara, CA, USA, 12-14 July, (2017)

  54. Yang, J., Wei, Q., Chen, C., Wang, C., Yong, K.L., He, B.: Nv-tree: Reducing consistency cost for nvm-based single level systems. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015, pp. 167–181, Santa Clara, CA, USA, 16-19 February, (2015)

  55. Yang, J.J., Williams, R.S.: Memristive devices in computing system: Promises and challenges. ACM J. Emerg. Technol. Comput. Syst. 9(2), 11:1-11:20 (2013)

    Article  Google Scholar 

  56. Yu, X., Bezerra, G., Pavlo, A., Devadas, S., Stonebraker, M.: Staring into the abyss: An evaluation of concurrency control with one thousand cores. Proc. VLDB Endow. 8(3), 209–220 (2014)

    Article  Google Scholar 

  57. Yu, X., Pavlo, A., Sánchez, D., Devadas, S.: Tictoc: Time traveling optimistic concurrency control. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 1629–1642, San Francisco, CA, USA, 26 June - 01 July, (2016)

  58. Zhang, H., Andersen, D.G., Pavlo, A., Kaminsky, M., Ma, L., Shen, R.: Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 1567–1581, San Francisco, CA, USA, 26 June - 01 July, (2016)

  59. Zheng, W., Tu, S., Kohler, E., Liskov, B.: Fast databases with fast durability and recovery through multicore parallelism. In: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, pp. 465–477, Broomfield, CO, USA, 6-8 October, (2014)

  60. Zhou, X., Arulraj, J., Pavlo, A., Cohen, D.: Spitfire: A three-tier buffer manager for volatile and non-volatile memory. In: SIGMOD ’21: International Conference on Management of Data, Virtual Event, pp. 2195–2207, China, 20–25 June (2021)

Download references

Acknowledgements

This work is partially supported by National Key R&D Program of China (2018YFB1003303) and Natural Science Foundation of China (62172390). Shimin Chen is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shimin Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, G., Chen, L. & Chen, S. Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory. The VLDB Journal 32, 123–148 (2023). https://doi.org/10.1007/s00778-022-00737-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00737-1

Keywords

Navigation