Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory

Liu, Gang; Chen, Leying; Chen, Shimin

doi:10.1007/s00778-022-00737-1

Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory

Regular Paper
Published: 06 April 2022

Volume 32, pages 123–148, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

778 Accesses
1 Citation
Explore all metrics

Abstract

Emerging non-volatile memory (NVM) technologies like 3DXpoint promise significant performance potential for OLTP databases. However, transactional databases need to be redesigned because the key assumptions that non-volatile storage is orders of magnitude slower than DRAM and only supports blocked-oriented accesses have changed. NVMs are byte-addressable and almost as fast as DRAM. The capacity of NVM is much (4-16x) larger than DRAM. Such NVM characteristics make it possible to build OLTP databases entirely in NVM main memory. This paper studies the structure of OLTP engines with hybrid NVM and DRAM memory. We observe three challenges to design an OLTP engine for NVM: tuple metadata modifications, NVM write redundancy, and NVM space management. We propose Zen, a high-throughput log-free OLTP engine for NVM. Zen addresses the three design challenges with three novel techniques: metadata-enhanced tuple cache, log-free persistent transactions, and light-weight NVM space management. We further propose Zen+ by extending Zen with two mechanisms, i.e., MVCC-based adaptive execution and NUMA-aware soft partition, to robustly and effectively support long-running transactions and NUMA architectures. Experimental results on a real machine equipped with Intel Optane DC Persistent Memory show that compared with existing solutions that run an OLTP database as large as the size of NVM, Zen achieves 1.0x-10.1x improvement while attaining fast failure recovery, and supports ten types of concurrency control methods. Experiments also demonstrate that Zen+ robustly supports long-running transactions and efficiently exploits NUMA architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

In-memory transaction processing: efficiency and scalability considerations

Article 13 February 2019

Micro-architectural analysis of in-memory OLTP: Revisited

Article Open access 31 March 2021

Parallel replication across formats for scaling out mixed OLTP/OLAP workloads in main-memory databases

Article 16 April 2018

Notes

For simplicity, Zen assumes that the tuple size is fixed. For example, varchar(n) can be regarded as char(n). We discuss how to support variable-sized tuples in Sect. 6.
Only 48 bits in a 64-bit address are used in current systems. The highest bit is always 0 in user-mode programs.
T must have committed. If T were running, then E’s active bit should be 1 and it could not be chosen as the victim. If T had aborted, then T would have cleared E’s copy bit.
Please note that the choice of MVCC-style concurrency control method is only required by Zen+’s support for long-running transactions. Other techniques in this paper can flexibly support a wide range of concurrency control methods.
A per-page counter can be kept in the NVM-tuple manager to keep track of the number of allocated slots in the page. The counter is updated for tuple allocations and frees. When the counter decreases to 0, we can return the page to the NVM page manager.
This is similar to the interleaved NUMA allocation policy in the operating system. However, when NVM is in the App Direct mode, the OS policy cannot be directly applied to NVM.

References

Intel Optane DC persistent memory architecture and technology. https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html (2019)
TPC benchmark C. http://www.tpc.org/tpcc/ (2020)
Apalkov, D., Khvalkovskiy, A., Watts, S., Nikitin, V., Tang, X., Lottis, D., Moon, K., Luo, X., Chen, E., Ong, A., Driskill-Smith, A., Krounbi, M.: Spin-transfer torque magnetic random access memory (STT-MRAM). ACM J. Emerg. Technol. Comput. Syst. 9(2), 1–35 (2013)
Arulraj, J., Levandoski, J.J., Minhas, U.F., Larson, P.: Bztree: A high-performance latch-free range index for non-volatile memory. Proc. VLDB Endow. 11(5), 553–565 (2018)
Article Google Scholar
Arulraj, J., Pavlo, A., Dulloor, S.: Let’s talk about storage & recovery methods for non-volatile memory database systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 707–722. Melbourne, Victoria, Australia, 31 May–4 June (2015)
Arulraj, J., Perron, M., Pavlo, A.: Write-behind logging. Proc. VLDB Endow. 10(4), 337–348 (2016)
Article Google Scholar
Bernstein, P.A., Goodman, N.: Concurrency control in distributed database systems. ACM Comput. Surv. 13(2), 185–221 (1981)
Article Google Scholar
Blagodurov, S., Zhuravlev, S., Dashti, M., Fedorova, A.: A case for numa-aware contention management on multicore systems. In: 2011 USENIX Annual Technical Conference. 15-17 June, Portland, OR, USA, (2011)
Böttcher, J., Leis, V., Neumann, T., Kemper, A.: Scalable garbage collection for in-memory MVCC systems. Proc. VLDB Endow. 13(2), 128–141 (2019)
Article Google Scholar
Cao, T., Salles, M.A.V., Sowell, B., Yue, Y., Demers, A.J., Gehrke, J., White, W.M.: Fast checkpoint recovery algorithms for frequently consistent applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 265–276, Athens, Greece, 12-16 June, (2011)
Chen, S., Gibbons, P.B., Nath, S.: Rethinking database algorithms for phase change memory. In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research. pp. 21–31, Asilomar, CA, USA, 9-12 January, Online Proceedings, (2011)
Chen, S., Jin, Q.: Persistent b+-trees in non-volatile main memory. Proc. VLDB Endow. 8(7), 786–797 (2015)
Article Google Scholar
Coburn, J., Caulfield, A.M., Akel, A., Grupp, L.M., Gupta, R.K., Jhala, R., Swanson, S.: Nv-heaps: making persistent objects fast and safe with next-generation, non-volatile memories. In: Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, Newport Beach, pp. 105–118, CA, USA, 5-11 March, (2011)
Condit, J., Nightingale, E.B., Frost, C., Ipek, E., Lee, B.C., Burger, D., Coetzee, D.: Better I/O through byte-addressable, persistent memory. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, pp. 133–146, Montana, USA, 11-14 October, (2009)
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, Indianapolis, pp. 143–154, Indiana, USA, 10-11 June, (2010)
DeWitt, D.J., Katz, R.H., Olken, F., Shapiro, L.D., Stonebraker, M., Wood, D.A.: Implementation techniques for main memory database systems. In: SIGMOD’84, Proceedings of Annual Meeting. pp. 1–8, Boston, Massachusetts, USA, 18-21 June, (1984)
Diaconu, C., Freedman, C., Ismert, E., Larson, P., Mittal, P., Stonecipher, R., Verma, N., Zwilling, M.: Hekaton: SQL server’s memory-optimized OLTP engine. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1243–1254 New York, NY, USA, 22-27 June, (2013)
Eswaran, K.P., Gray, J., Lorie, R.A., Traiger, I.L.: The notions of consistency and predicate locks in a database system. Commun. ACM 19(11), 624–633 (1976)
Article MATH Google Scholar
Fang, R., Hsiao, H., He, B., Mohan, C., Wang, Y.: High performance database logging using storage class memory. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, pp. 1221–1231, 11-16 April, Hannover, Germany, (2011)
Gao, S., Xu, J., Härder, T., He, B., Choi, B., Hu, H.: Pcmlogging: Optimizing transaction logging and recovery performance with PCM. IEEE Trans. Knowl. Data Eng. 27(12), 3332–3346 (2015)
Article Google Scholar
Graham, D.H.: Intel optane technology products - what’s available and what’s coming soon. https://software.intel.com/en-us/articles/3d-xpointtechnology-products (2019)
Hasanzadeh-Mofrad, M., Melhem, R.G., Ahmad, M.Y., Hammoud, M.: Graphite: A numa-aware HPC system for graph analytics based on a new MPI * X parallelism model. Proc. VLDB Endow. 13(6), 783–797 (2020)
Article Google Scholar
Haubenschild, M., Sauer, C., Neumann, T., Leis, V.: Rethinking logging, checkpoints, and recovery for high-performance storage engines. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference, pp. 877–892 [Portland, OR, USA], 14-19 June, (2020)
Huang, J., Schwan, K., Qureshi, M.K.: Nvram-aware logging in transaction systems. Proc. VLDB Endow. 8(4), 389–400 (2014)
Article Google Scholar
Kim, J., Cho, H., Kim, K., Yu, J., Kang, S., Jung, H.: Long-lived transactions made less harmful. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference, pp. 495–510 [Portland, OR, USA], 14-19 June, (2020)
Kim, W., Kim, J., Baek, W., Nam, B., Won, Y.: NVWAL: exploiting NVRAM in write-ahead logging. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016, pp. 385–398, Atlanta, GA, USA, 2-6 April, (2016)
Kimura, H.: FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, pp. 691–706 Victoria, Australia, 31 May - 4 June, (2015)
Kung, H.T., Robinson, J.T.: On optimistic methods for concurrency control. ACM Trans. Database Syst. 6(2), 213–226 (1981)
Article Google Scholar
Lee, J., Kim, K., Cha, S.K.: Differential logging: A commutative and associative logging scheme for highly parallel main memory databases. In: Proceedings of the 17th International Conference on Data Engineering, pp. 173–182, 2-6 April, Heidelberg, Germany, (2001)
Lehman, T.J., Carey, M.J.: A recovery algorithm for A high-performance memory-resident database system. In: Proceedings of the Association for Computing Machinery Special Interest Group on Management of Data 1987 Annual Conference, pp. 104–117, San Francisco, CA, USA, 27-29 May, (1987)
Leis, V., Boncz, P.A., Kemper, A., Neumann, T.: Morsel-driven parallelism: a numa-aware query evaluation framework for the many-core age. In: International Conference on Management of Data, SIGMOD 2014, pp. 743–754, Snowbird, UT, USA, 22-27 June, ACM (2014)
Lepers, B., Quéma, V., Fedorova, A.: Thread and memory placement on NUMA systems: Asymmetry matters. In: 2015 USENIX Annual Technical Conference, USENIX ATC ’15, pp. 277–289, 8-10 July, Santa Clara, CA, USA, (2015)
Lim, H., Kaminsky, M., Andersen, D.G.: Cicada: Dependably fast multi-core in-memory transactions. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference, pp. 21–352017, Chicago, IL, USA, 14-19 May, (2017)
Liu, J., Chen, S., Wang, L.: Lb+-trees: Optimizing persistent index performance on 3dxpoint memory. Proc. VLDB Endow. 13(7), 1078–1090 (2020)
Article Google Scholar
Liu, M., Zhang, M., Chen, K., Qian, X., Wu, Y., Zheng, W., Ren, J.: Dudetm: Building durable transactions with decoupling for persistent memory pp. 329–343 (2017)
Maas, L.M., Kissinger, T., Habich, D., Lehner, W.: BUZZARD: a numa-aware in-memory indexing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1285–1286, New York, NY, USA, 22-27 June, ACM (2013)
Memarzia, P., Ray, S., Bhavsar, V.C.: The art of efficient in-memory query processing on NUMA systems: a systematic approach. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, pp. 781–792, Dallas, TX, USA, 20-24 April, IEEE (2020)
Neumann, T., Mühlbauer, T., Kemper, A.: Fast serializable multi-version concurrency control for main-memory database systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 677–689, Melbourne, Victoria, Australia, 31 May - 4 June, (2015)
Oukid, I., Lasperas, J., Nica, A., Willhalm, T., Lehner, W.: Fptree: A hybrid SCM-DRAM persistent and concurrent b-tree for storage class memory. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 371–386, San Francisco, CA, USA, 26 June - 01 July, (2016)
Oukid, I., Lehner, W., Kissinger, T., Willhalm, T., Bumbulis, P.: Instant recovery for main memory databases. In: Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, 4-7 January, Online Proceedings (2015)
Pelley, S., Wenisch, T.F., Gold, B.T., Bridge, B.: Storage management in the NVRAM era. Proc. VLDB Endow. 7(2), 121–132 (2013)
Article Google Scholar
Psaroudakis, I., Scheuer, T., May, N., Sellami, A., Ailamaki, A.: Scaling up concurrent main-memory column-store scans: Towards adaptive numa-aware data and task placement. Proc. VLDB Endow. 8(12), 1442–1453 (2015)
Article Google Scholar
Psaroudakis, I., Scheuer, T., May, N., Sellami, A., Ailamaki, A.: Adaptive numa-aware data placement and task scheduling for analytical workloads in main-memory column-stores. Proc. VLDB Endow. 10(2), 37–48 (2016)
Article Google Scholar
Raoux, S., Burr, G.W., Breitwisch, M.J., Rettner, C.T., Chen, Y., Shelby, R.M., Salinga, M., Krebs, D., Chen, S., Lung, H., Lam, C.H.: Phase-change random access memory: A scalable technology. IBM J. Res. Dev. 52(4–5), 465–480 (2008)
Article Google Scholar
Ren, K., Diamond, T., Abadi, D.J., Thomson, A.: Low-overhead asynchronous checkpointing in main-memory database systems. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 1539–1551, San Francisco, CA, USA, 26 June - 01 July, (2016)
van Renen, A., Leis, V., Kemper, A., Neumann, T., Hashida, T., Oe, K., Doi, Y., Harada, L., Sato, M.: Managing non-volatile memory in database systems. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, pp. 1541–1555, Houston, TX, USA, 10-15 June, (2018)
Stonebraker, M., Madden, S., Abadi, D.J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era (it’s time for a complete rewrite). In: Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, pp. 1150–1160, Austria, 23-27 September, (2007)
Tu, S., Zheng, W., Kohler, E., Liskov, B., Madden, S.: Speedy transactions in multicore in-memory databases. In: ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, pp. 18–32, Farmington, PA, USA, 3-6 November, (2013)
Volos, H., Tack, A.J., Swift, M.M.: Mnemosyne: lightweight persistent memory. In: Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, pp. 91–104, Newport Beach, CA, USA, 5-11 March, (2011)
Wang, T., Johnson, R.: Scalable logging through emerging non-volatile memory. Proc. VLDB Endow. 7(10), 865–876 (2014)
Article Google Scholar
Wang, T., Kimura, H.: Mostly-optimistic concurrency control for highly contended dynamic workloads on a thousand cores. Proc. VLDB Endow. 10(2), 49–60 (2016)
Article Google Scholar
Wang, Y., Jiang, D., Xiong, J.: Numa-aware thread migration for high performance nvmm file systems. In: 36th Symposium on Mass Storage Systems and Technologies, MSST 2020, Santa Clara, CA, USA, 29-30 October, (2020)
Xia, F., Jiang, D., Xiong, J., Sun, N.: Hikv: A hybrid index key-value store for DRAM-NVM memory systems. In: 2017 USENIX Annual Technical Conference, USENIX ATC 2017, pp. 349–362, Santa Clara, CA, USA, 12-14 July, (2017)
Yang, J., Wei, Q., Chen, C., Wang, C., Yong, K.L., He, B.: Nv-tree: Reducing consistency cost for nvm-based single level systems. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015, pp. 167–181, Santa Clara, CA, USA, 16-19 February, (2015)
Yang, J.J., Williams, R.S.: Memristive devices in computing system: Promises and challenges. ACM J. Emerg. Technol. Comput. Syst. 9(2), 11:1-11:20 (2013)
Article Google Scholar
Yu, X., Bezerra, G., Pavlo, A., Devadas, S., Stonebraker, M.: Staring into the abyss: An evaluation of concurrency control with one thousand cores. Proc. VLDB Endow. 8(3), 209–220 (2014)
Article Google Scholar
Yu, X., Pavlo, A., Sánchez, D., Devadas, S.: Tictoc: Time traveling optimistic concurrency control. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 1629–1642, San Francisco, CA, USA, 26 June - 01 July, (2016)
Zhang, H., Andersen, D.G., Pavlo, A., Kaminsky, M., Ma, L., Shen, R.: Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 1567–1581, San Francisco, CA, USA, 26 June - 01 July, (2016)
Zheng, W., Tu, S., Kohler, E., Liskov, B.: Fast databases with fast durability and recovery through multicore parallelism. In: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, pp. 465–477, Broomfield, CO, USA, 6-8 October, (2014)
Zhou, X., Arulraj, J., Pavlo, A., Cohen, D.: Spitfire: A three-tier buffer manager for volatile and non-volatile memory. In: SIGMOD ’21: International Conference on Management of Data, Virtual Event, pp. 2195–2207, China, 20–25 June (2021)

Download references

Acknowledgements

This work is partially supported by National Key R&D Program of China (2018YFB1003303) and Natural Science Foundation of China (62172390). Shimin Chen is the corresponding author.

Author information

Authors and Affiliations

SKL of Computer Architecture, ICT, CAS, University of Chinese Academy of Sciences, Bejing, China
Gang Liu, Leying Chen & Shimin Chen

Authors

Gang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Leying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shimin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shimin Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, G., Chen, L. & Chen, S. Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory. The VLDB Journal 32, 123–148 (2023). https://doi.org/10.1007/s00778-022-00737-1

Download citation

Received: 08 July 2021
Revised: 15 February 2022
Accepted: 22 February 2022
Published: 06 April 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00778-022-00737-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory

Abstract

Access this article

Similar content being viewed by others

In-memory transaction processing: efficiency and scalability considerations

Micro-architectural analysis of in-memory OLTP: Revisited

Parallel replication across formats for scaling out mixed OLTP/OLAP workloads in main-memory databases

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory

Abstract

Access this article

Similar content being viewed by others

In-memory transaction processing: efficiency and scalability considerations

Micro-architectural analysis of in-memory OLTP: Revisited

Parallel replication across formats for scaling out mixed OLTP/OLAP workloads in main-memory databases

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation