1 Introduction

The development of data management and analysis systems largely benefits from the advancement of the hardware and software technologies. Hardware and software are the two major elements that constitute a computing system. The software performance could directly benefit from the advancement of hardware, but it is also limited by the characteristics of the hardware. Therefore, there has to be trade-offs in the design of software frameworks and systems. In the meantime, the demands for software performance also promote the advancements and innovations of hardware technologies. For data management and analysis systems, the hardware determines the performance limit of data access and query processing. To fully utilize the hardware, the software should optimize the design of algorithms and data structure according to the hardware. The traditional compute-intensive applications and recent data-intensive applications both have placed high demands on hardware performance such as access latency, capacity, bandwidth, energy consumption and cost performance. Under the various application workloads, the traditional data management and analysis techniques are facing unprecedented challenges. Essentially, the challenges posed by big data are caused by the contradiction where the existing data processing infrastructures are incapable of meeting the diverse demands in data processing.

Data processing infrastructure generally includes the underlying hardware environment and the upper software system. The design principles, architecture selection, core functions, strategy mode, and optimization techniques of the upper software system largely depend on the computer hardware. Today’s hardware technologies and environments are undergoing dramatic changes. Specifically, the advent of high-performance processors and hardware accelerators, new nonvolatile memory (NVM), and a high-speed network are rapidly changing the base support for traditional data management and analysis systems. These new hardware is expected to break through the architecture of the entire computer system and convert the assumptions of the upper software. They also require that the architecture of data management and analysis software and related technologies have hardware awareness.

2 The Trend of Hardware

In recent years, storage, processor, and network technologies have made a great breakthrough. As shown in Fig. 1, a growing set of new hardware, architecture, and features are becoming the foundation of the future computing platforms. The current trends indicate that these techniques are significantly changing the underlying environment of traditional data management and analysis systems, including high-performance processors and hardware accelerators, NVM, RDMA-capable (remote direct memory access) networks. Significantly, the ongoing underlying environments, marked by heterogeneous multi-core architecture and hybrid storage hierarchy, make the already complicated software design space become more sophisticated [1,2,3,4].

Fig. 1
figure 1

New hardware and environment

2.1 The Trend of Processor Technologies

The development of processor technology has gone through for more than 40 years. Its development roadmap has shifted from scale-up to scale-out, and the aim dramatically shifts away from pursuing higher clock speed and instead focuses on creating more cores per processor. According to Moore’s law, pushing the computing frequency of the processor continuously is one of the most important ways to improve the performance of the computer in the era of serial computing. At the same time, lots of optimization techniques, such as the instruction-level parallelism (ILP), pipeline, prefetching, branch prediction, out-of-order instruction execution, multi-level cache, and hyper-threading, can be automatically identified and utilized by the processor and the compiler. Therefore, software can consistently and transparently enjoy free and regular performance gains. However, limited by the heat, power consumption, instruction-level parallelism, manufacturing processes, and other factors, the scale-up approach reaches the ceiling.

After 2005, high-performance processor technology has entered the multi-core era and multi-core parallel processing technology has become the mainstream. But although data processing capability has been significantly enhanced in multi-core architectures, software cannot automatically gain the benefits. Instead, programmers have to transform the traditional serial programs into parallel programs, and optimize the algorithm performance for the LLC (Last Level Cache) of multi-core processors. Nowadays, the performance of multi-core processors has been significantly improved with the semiconductor technology. For example, the 14-nm Xeon processor currently integrates up to 24 cores, supporting up to 3.07 TB memory and 85 GB/s memory bandwidth. However, x86 processor still has the disadvantages of low integration, high power consumption, and high price. Also the general-purpose multi-core processors can hardly to meet the demands of the highly concurrent applications. The development of the processor is going to be specifically optimized for an application, i.e., specialized hardware accelerators.

GPU, Xeon Phi, field programmable gate array (FPGA), and the like are representative of dedicated hardware accelerators. By exploiting GPUs, Xeon Phi coprocessors, and FPGAs, parts of compute-intensive and data-intensive workload can be offloaded from the CPU efficiently. Some fundamental hardware characteristics of these accelerators are given in Table 1. There is no doubt that the processing environment within the computer system becomes more and more complicated, and correspondingly, the data management and analysis systems might try to seek diversified ways to actively adapt to new situations.

Table 1 Processor characters

2.2 The Trend of Storage Technologies

As high-performance processors and hardware accelerator technologies develop rapidly, the performance gap between CPU and storage keeps widening year by year [5]. The “memory wall” makes the data access become a non-negligible performance bottleneck. Faced with the slow I/O capabilities of traditional secondary storage devices, data management and analysis systems have had to adopt some design strategies such as cache pools, concurrency control, and disk-oriented algorithms and data structure to mitigate or hide I/O performance gap. However, I/O bottlenecks still severely constrain the processing power of data-intensive computing.

It is especially interesting that the new storage medium represented by NVM [6] provides a potential avenue to break the I/O bottleneck. The NVM is actually a general term for a type of storage technology which does not represent a specific storage technology or medium. It is also referred to as storage class memory (SCM) in some research literature [7]. Typically, NVMs include phase change memory (PCM), magnetoresistive random access memory (MRAM), resistive random access memory (RRAM), and ferroelectronic RAM (FeRAM). Although the characteristics and manufacturing processes of these memories are obviously different, they generally have some common features, including durability, high storage density, low-latency random read/write, and fine-grained byte addressing. The specifications are given in Table 2. From a performance point of view, NVM is close to the DDR memory, but also has a nonvolatile feature. Therefore, it may gradually become the main storage device, while DDR memory is used as a temporary data cache. At present, flash memory technology is today a mature technology. Take a single PCIe flash memory for example. Its capacity can reach up to 12.8 TB, and read/write performance is also high. Based on this, it can be a cache between RAM and hard disk and also can be an alternative of the hard drive as a persistent storage device. In terms of energy consumption, DRAM consumes less energy under high load. On the contrary, it consumes more energy under low load than other storage devices because refreshing the entire DRAM is required. The common feature of the NVM is that they have dual capabilities of both DRAM-like high-speed access and disk-like persistence, effectively breaking the “performance wall” of traditional storage medium that cannot overcome.

Table 2 Performance metrics of different storage devices

At the same time, the development of new storage technologies has also had a significant impact on processor technology. The 3D stacking technology that enhances higher bandwidth can be applied to the on-board storage of many-core processors, delivering high-performance data cache support for the powerful parallel processing. With the NVM technology, the multi-level hybrid storage environment will certainly break the balance among the CPU, main memory, system bus, and external memory in the traditional computer architecture. It will also change the existing storage hierarchy and optimize data access critical paths to bridge the performance gap between storage tiers, providing new opportunities for data management and analytics [8].

2.3 The Trend of Network Technologies

In addition to the local storage I/O bottleneck, the network I/O bottleneck is also the main performance issue in the datacenter. Under traditional Ethernet network, the limited data transmission capability and the non-trivial CPU overhead of the TPC/IP stack have severely impacted the performance of distributed data processing. Therefore, the overall throughput of distributed database system is sharply reduced under the influence of the high proportion of distributed transactions, which lead to potentially heavy network IO. Based on this, the existing data management systems have to resort to some specific strategies such as coordinated partitioning, relaxed consistency assurance, and deterministic execution scheme to control or reduce the ratio of distributed transactions. However, most of these measures suffer from unreasonable assumptions and applicable conditions, or opaqueness to application developers. In particular, the scalability of the system is still greatly limited, especially when the workload does not have the distinguishable characteristics to be split independently.

It is important to note that the increased contention likelihood is the most cited reasons when discussing the scalability issue of distributed transactions, but in [9], the author showed that the most important factor is the CPU overhead of the TCP/IP stack incurred by traditional Ethernet network. In other words, software-oriented optimization will not fundamentally address the scalability issue within distributed environments. In recent years, the high-performance RDMA-enabled network is dramatically improving the network latency and ensures that users can bypass the CPU when transferring data on the network. InfiniBand, iWARP, and RoCE are all RDMA-enabled network protocols, with appropriate hardware that can accelerate operations to increase the value of application. With price reduction in RDMA-related hardware, more and more emerging industry clusters working on the RDMA-related network environment, requiring a fundamental rethinking of the design of data management and analysis systems, include but not limited to distributed query, transaction processing, and other core functions [10].

Although the development of new hardware exhibits complicated variety and the composition of the hardware environment is also uncertain, it is foreseeable that they eventually will become the standard hardware components in the future. Data management and analysis on modern hardware will become a new research hotspots field.

3 Research Status

The new hardware will change the traditional computing, storage, and network systems and put a directly impact on the architecture and design scheme of the data management and analysis systems. It will also pose challenges on the core functionalities and related key technologies including indexing, analysis, and transaction processing. In the next part, we will introduce the present state of domestic and international relevant research.

3.1 System Architecture and Design Schemes in Platforms with New Hardware

The advent of high-performance processors and new accelerators has led to the shift from single-CPU architectures systems to heterogeneous, hybrid processing architectures. Data processing strategies and optimization techniques have evolved from standardization to customization and from software optimization to hardware optimization. From the classical iterative pipeline processing model [11] to the column processing model [12] and finally to the vector processing model optimization [13] that is the combination of the former two, the traditional software-based data processing model has reached a mature stage. Though, the advent of JIT real-time compilation techniques [14] and optimization techniques combined with vector processing models [15] provides new optimization space at the register level. However, with the deepening and development of research, software-based optimization techniques are gradually touching their “ceiling.” Academics and industry are beginning to explore some new ways to accelerate the performance of data processing through software/hardware co-design. Instruction-level optimization [16], coprocessor query optimization [17, 18], hardware customization [19], workload hardware migration [20], increasing hardware-level parallelism [21], hardware-level operators [22], and so on are used to provide hardware-level performance optimization. However, the differences between the new processor and x86 processor fundamentally change the assumptions of traditional database software design on hardware. Databases are facing with more complicated architectural issues on heterogeneous computing platforms. In the future, the researcher needs to break the software-centric design idea which is effective for the traditional database systems.

The design of traditional databases has to trade-off among a number of important factors such as latency, storage capacity, cost effectiveness, and the choice between volatile and nonvolatile storage devices. The unique properties of NVM bring the opportunity to the development of data management systems, but it also introduces some new constraints. The literature [23] conducted a forward-looking research and exploration in this area. The results show that neither the disk-oriented systems nor the memory-oriented systems are ideally suited for NVM-based storage hierarchy, especially when the skew in the workload is high. The authors also found that storage I/O is no longer the main performance bottleneck in the NVM storage environment [24]; instead, a significant amount of cost on how to organize, update, and replace data will become new performance bottlenecks. For NVM-based storage hierarchy, WAL (Write-Ahead Logging) and logical logs that are common in the traditional database also have a large variety of unnecessary operations [25]. These issues indicate that diversified NVM-based storage hierarchy will lead to the new requirements for cache replacement [26], data distribution [27], data migration [28], metadata management [29], query execution plan [30], fault recovery [25], and other aspects to explore corresponding design strategies to better adapt to new environments. Therefore, NVM-specific or NVM-oriented architecture designed to utilize the nonvolatile property of NVM is necessary. The research about this area is just beginning; CMU’s N-Store [31] presents exploratory research on how to build prototype database system on NVM-based storage hierarchy.

RDMA-enabled network is changing the assumption in traditional distributed data management systems in which network I/O is the primary performance bottleneck. Some systems [32, 33] have introduced RDMA inside, but they just carry out some add-on optimizations for RDMA later; the original architecture is obsolete. As a result, they cannot take full advantage of the opportunities presented by RDMA. It has been demonstrated [34] that migrating a legacy system to an RDMA-enabled network simply cannot fully exploit the benefits of the high-performance network; neither the shared-nothing architecture nor the distributed shared-memory architecture can bring out the full potential of RDMA. For a shared-nothing architecture, the optimization goal is to maximize data localization, but the static partitioning technique [35] or dynamic partitioning strategy does not fundamentally resolve the problem of frequent network communication [36]. Even with IPoIB (IP over Infiniband) support, for shared-nothing architecture, it is difficult to gain the most improvement on data-flow model and control-flow model simultaneously [9]. Similarly, for distributed shared-memory architectures, there is no built-in support for cache coherency; accessing cache via RDMA could have significant performance side effects if mismanaged from client [37]. At the same time, garbage collection and cache management might also be affected by it. In the RDMA network, the uniform abstraction for remote and local memory has proven to be inefficient [38]. Research [9] shows that the emerging RDMA-enabled high-performance network technologies necessitate a fundamental redesign of the way we build distributed data management system in the future.

3.2 Storage and Indexing Techniques in Platforms with New Hardware

Since NVM can serve as both internal and external memory, the boundaries between the original internal and external memory are obscured, making the composition of the underlying NVM storage diverse. Because different NVMs have their own features on access delay, durability, etc., it is theoretically possible to replace traditional storage medium without changing the storage hierarchy [39,40,41] or mix with them [42, 43]. In addition, NVMs also can enrich the storage hierarchies as a LLC [44] or as a new cache layer between RAM and HDD [45], which further reduce read/write latency across storage tiers. The variety of the storage environment also places considerable complexity in implementing data management and analysis techniques.

From data management perspective, how to integrate NVM into the I/O stack is a very important research topic. There are two typical ways to abstract NVM, as persistence heap [46] or as file system [47]. Because NVM can exchange data directly with the processor using a memory bus or a dedicated bus, memory objects that do not need to be serialized into disk can be directly created with heap. Typical research works include NV-heap [46], Mnemosyne [48], and HEAPO [49]. In addition, the NVM memory environment also brings a series of new problems to be solved, such as ordering [47], atomic operations [50], and consistency guarantee [51]. In contrast, file-based NVM abstraction can take advantage of the semantics of existing file systems in namespaces, access control, read–write protection, and so on. But, in the design, in addition to take full advantage of the fine-grained addressing and in-place update capability of NVM, the impact of frequent writes on NVM lifetime needs to be considered [52]. Besides, file abstraction has a long data access path which implies some unnecessary software overhead [53]. PRAMFS [54], BPFS [47], and SIMFS [55] are all typical NVM-oriented file systems. Whether abstracting the NVM in the way of persisting memory heap or file system can become the basic building block for the upper data processing technology [56]. However, they can only provide low-level assurance on atomicity. Some of the high-level features (such as transaction semantics [57], nonvolatile data structures [58], etc.) also require corresponding change and improvement in upper data management system.

Different performance characteristics of NVMs will affect the data access and processing strategies of the heterogeneous processor architecture. On the contrary, the computing characteristics of the processors also affect the storage policy. For example, under the coupled CPU/GPU processor architecture, the data distribution and the exchange should be designed according to the characteristics of low-latency CPU data access and the large-granularity GPU data processing. In addition, if NVM can be used to add a large-capacity, low-cost, high-performance storage tier under traditional DRAM, hardware accelerators such as GPU can access NVM storage directly across the memory through dedicated data channel, reducing the data transfer cost in traditional storage hierarchies [59]. We need to realize that hybrid storage and heterogeneous computing architecture will exist for a long time. The ideal technical route in the future is to divide the data processing into different processing stages according to the type of workload and to concentrate the computing on the smaller data set to achieve the goal of accelerating critical workloads through hardware accelerators. It is the ideal state to concentrate 80% of the computational load on 20% of the data [60], which simplifies data distribution and computing distribution strategies across hybrid storage and heterogeneous computing platforms.

In addition to improving data management and analytics support at the storage level, indexes are also key technologies for efficiently organizing data to accelerate the performance of upper-level data analytics. Traditional local indexing and optimization techniques based on B+ tree, R-tree, or KD-tree [61,62,63] are designed for the block-level storage. Due to the significant difference between NVM and disk, the effectiveness of the existing indexes can be severely affected in NVM storage environments. For NVM, the large number of writes caused by index updates not only reduces their lifespan but also degrades their performance. To reduce frequent updates and writes of small amounts of data, merge updates [64] or late updates [65], which are frequently used on flash indexes, are typical approaches. Future indexing technologies for NVM should be more effective in controlling the read and write paths and impact areas that result from index updates, as well as enabling layered indexing techniques for the NVM storage hierarchies. At the same time, concurrency control on indexes such as B+ tree also shows obvious scalability bottlenecks in highly concurrent heterogeneous computing architectures. The coarse-grained range locks of traditional index structures and the critical regions corresponding to latches are the main reasons for limiting the degree of concurrency. Some levels of optimization techniques such as multi-granularity locking [66] and latch avoidance [67] increase the degree of concurrency of indexed access updates, but they also unavoidably introduce the issue of consistency verification, increased transaction failure rates, and higher overhead. In the future, the indexing technology for highly concurrent heterogeneous computing infrastructure needs a more lightweight and flexible lock protocol to balance the consistency, maximum concurrency, and lock protocol simplicity.

3.3 Query Processing and Optimization in Platforms with New Hardware

The basic assumption of the traditional query algorithms and data structures on the underlying storage environment does not stand in NVM storage environment. Therefore, the traditional algorithms and data structures are difficult to obtain the ideal effect in the NVM storage environment.

Reducing NVM-oriented writes is a major strategy in previous studies. A number of technologies, which include unnecessary write avoiding [39], write cancelation and write pausing strategies [68], dead write prediction [69], cache coherence enabled refresh policy [70], and PCM-aware swap algorithm [71], are used to optimize the NVM writes. With these underlying optimizations for drivers, FTL, and memory controller, the algorithms can directly benefit, but algorithms can also be optimized from a higher level. In this level, there are two ways to control or reduce NVM writes: One is to take advantage of extra caches [72, 73] to mitigate NVM write requests with the help of DRAM; the other is to utilize the low-overhead NVM reads and in-time calculations to waive the costly NVM writes [58, 74]. For further reducing NVM writes, even parts of constraints on data structures or algorithms [74,75,76] can be appropriately relaxed. In the future, how to design, organize, and operate write-limited algorithms and data structures is an urgent question. However, it is important to note that with NVM asymmetric read/write costs, the major performance bottlenecks have shifted from the ratio of sequential and random disk I/O to the ratio of NVM read and write. As a result, previous cost models [77] inevitably fail to characterize access patterns of NVM accurately. Furthermore, heterogeneous computing architectures and hybrid storage hierarchy will further complicate the cost estimation in new hardware environments [30, 73]. Therefore, how to ensure the validity and correctness of the cost model under the new hardware environment is also a challenging issue. In the NVM storage environment, the basic design principle for NVM-oriented algorithms and data structures is to reduce the NVM write operations as much as possible. Put another way, write-limited (NVM-aware, NVM-friendly) algorithms and data structures are the possible strategy.

From the point of view of processor development, query optimization technologies have gone through several different stages with the evolution of hardware. During different development stages, there are significant differences in query optimization goals, from mitigating disk-oriented I/O to designing cache-conscious data structures and access methods [78,79,80,81] and developing efficient parallel algorithm [40, 82, 83]. Nowadays, the processor technology moves from multi-core to many-core which greatly differs from the multi-core processor in terms of core integration, number of threads, cache structure, and memory access. The object that should be optimized has been turned into SIMD [84, 85], GPUs, APUs, Xeon Phi coprocessors, and FPGAs [18, 86,87,88]. The query optimization is becoming more and more dependent on the underlying hardware. But current query optimization techniques for new hardware are in an awkward position: lacking the holistic consideration for evolving hardware, the algorithm requires constant changes to accommodate different hardware features, from predicate processing [89] to join [90] to index [91]. From a perspective of the overall architecture, the difficulty of query optimization is further increased under new hardware architecture.

In optimization techniques for join algorithm, a hot research topic in recent years is to explore whether hardware-conscious or hardware-oblivious algorithm designs are the best choices for new hardware environments. The goal of hardware-conscious algorithms is the pursuit of the highest performance, whose guiding ideology is to fully consider the hardware-specific characteristics to optimize join algorithms; instead, the goal of hardware-oblivious algorithms is the pursuit of generalizing and simplify, whose guiding principle is to design the join algorithm based on the common characteristics of the hardware. The competition between the two technology routes has intensified in recent years, from the basic CPU platform [92] to the NUMA platform [93], and it will certainly be extended to the new processor platforms in the future. The underlying reason behind this phenomenon is that it is difficult to quantify the optimization techniques, as in the field of in-memory database technology, although there are numerous hash structures and join algorithms currently [94,95,96], the simple question of which is the best in-memory hash join algorithm is still unknown [97]. In the future, when the new hardware environment is more and more complicated, performance should not be the only indicator to evaluate the advantages and disadvantages of algorithm. More attention should be paid to improving the adaptability and scalability of algorithms on heterogeneous platforms.

3.4 Transaction Processing in Platforms with New Hardware

Recovery and concurrency control are the core functions of transaction processing in DBMS. They are closely related to the underlying storage and computing environment. Distributed transaction processing is also closely related to the network environment.

WAL-based recovery methods [98] can be significantly affected in NVM environments. First, because the data written on the NVM are persistent, transactions are not forced to be stored to disk when submitted, and the design rules for flush-before-commit in the WAL are broken. Moreover, because of NVM high-speed random read/write capabilities, the advantages of the cache turn into disadvantages. In extreme cases, transactional update data can also have significant redundancies in different locations (log buffers, swap areas, disks) [25]. The NVM environment not only has an impact on the assumptions and strategies of WAL, but also brings some new issues. The way to ensure the atomic NVM write operation is the most fundamental problem. Through some hardware-level primitives [25] and optimization on processor cache [50], this problem can be guaranteed partially. In addition, due to the effect of out-of-order optimization in modern processor environments, there is a new demand in serializing the data written into NVM to ensure the order of log records. Memory barriers [47] became the main solution, but the read/write latencies caused by memory barriers in turn degrade the transactional throughput based on WAL. Thus, the introduction of NVM changes the assumptions in the log design, which will inevitably introduce new technical issues.

There is a tight coupling between the log technology and NVM environment. In the case of directly replacing the external storage by NVM, the goal of log technology optimization is to reduce the software side effects incurred by ARIES-style logs [99]. In the NVM memory environment, it can be further subdivided into different levels, including hybrid DRAM/NVM and NVM only. Currently there are many different optimization technologies on NVM-oriented logging, including two-layer logging architecture [56], log redundancies elimination [100], cost-effective approaches [101], decentralized logging [102] and others. In some sense, the existing logging technologies for NVM are actually stop-gap solutions [102]. For future NVM-based systems, the ultimate solution is that to develop logging technology on pure-NVM environment, where the entire storage system, including the multi-level cache on processor, will consist of NVM. Current research generally lacks attention on the durability of NVM; thus, finding a more appropriate trade-off among high-speed I/O, nonvolatility, and poor write tolerance is the focus of future NVM logging research.

Concurrent control effectively protects the isolation property of transactions. From the execution logic of the upper layers of different concurrency control protocols, it seems that the underlying storage is isolated and transparent. But in essence, the specific implementation of concurrency control and its overhead ratio in system are closely related to the underlying storage environment. Under the traditional two-tier storage hierarchy, the lock manager for concurrency control in memory is almost negligible, because the disk I/O bottlenecks are predominant. But in NVM storage environment, with the decrease in the cost of disk I/O, the memory overhead incurred by lock manager becomes a new bottleneck [103]. In addition, with the multi-core processors, the contradiction between the high parallelism brought by rich hardware context and the consistency of data maintenance will further aggravate the complexity of lock concurrency control [104,105,106]. The traditional blocking strategies, such as blocking synchronization and busy-waiting, are difficult to apply [107]. To reduce the overhead of concurrency control, it is necessary to control the lock manager’s resource competition. The research mainly focuses on reducing the number of locks with three main approaches, including latch-free data structures [108], lightweight concurrency primitives [109], and distributed lock manager [110]. In addition, for MVCC, there is a tight location coupling between the index and the multi-version records in the physical storage. This will result in serious performance degradation when updating the index. In a hybrid NVM environment, an intermediate layer constructed by low-latency NVM can be used to decouple the relationship between the physical representation and the logical representation of the multi-version log [111]. This study that builds a new storage layer to ease the bottleneck of reading and writing is worth learning.

The approach to improving the extensibility of distributed transactions has always been the central question in building distributed transactional systems. On the basis of a large number of previous studies [112,113,114], the researchers have already formed a basic consensus that it is difficult to guarantee the scalability of systems with a large number of distributed transactions. Therefore, research focuses on how to avoid distributed transactions [115, 116] and to control and optimize the proportion of distributed transactions [117,118,119]. Most of these technologies are not transparent to application developers and need to be carefully controlled or preprocessed at the application layer [120]. On the other hand, a large number of studies are also exploring how to deregulate strict transactions semantic. Under this background, the paradigm of data management is also shifting from SQL, NoSQL [121] to NewSQL [122, 123]. This development once again shows that, for a large number of critical applications, it is impossible to forego the transaction mechanism even with requirements of scalability [124]. However, these requirements are hard to meet in the traditional network. In traditional network environments, limited bandwidth, high latency, and overhead make distributed transactions not scalable. With RDMA-enabled high-performance network, the previously unmanageable hardware limitations and software costs are expected to be fully mitigated [38]. The RDMA-enabled network addresses the two most important difficulties encountered with traditional distributed transaction scaling: limited bandwidth and high CPU overhead in data transfers. Some RDMA-aware data management systems [32] have also emerged, but such systems are primarily concerned with direct RDMA support at the storage tier, and transaction processing is not the focus. Other RDMA-aware systems [125] focus on transactional processing, but they still adopt centralized managers that affect the scalability of distributed transactions. In addition, although some data management systems that fully adopt RDMA support distributed transactions, they have only have limited consistent isolation levels, such as serialization [38] and snapshot isolation [9, 34]. Relevant research [9] shows that the underlying architecture of data management should to be redesigned, such as the separation of storage and computing. Only in this way, it is possible to fully exploit all the advantages of RDMA-enabled networks to achieve fully scalable distributed transactions.

The new hardware and environment has many encouraging and superior features, but it is impossible to automatically enjoy the “free lunch” by simply migrating existing technologies onto the new platform. Traditional data management and analysis techniques are based on the x86 architecture, two-tier storage hierarchy, and TCP/IP-Ethernet. The huge differences from heterogeneous computing architectures, nonvolatile hybrid storage environments, and high-performance networking systems determine that the traditional design principles and rules of thumb are difficult to apply. In addition, in the new hardware environment, inefficient or useless components, and technologies in traditional data management and analysis systems also largely limit the efficiency of the hardware. Meanwhile, under the trend of diversified development of the hardware environment, there lacks the corresponding architectures and technologies. In the future, it is necessary to research with the overall system and core functions. In view of the coexistence of traditional hardware and new hardware as well as the common problems of extracting and abstracting differentiated hardware, the future research will be based on perception, customization, integration, adaptation, and even reconstruction in studying the appropriate architecture, components, strategies, and technologies, in order to release the computing and storage capacity brought by the new hardware.

4 Research Challenges and Future Research Directions

4.1 Challenges

Software’s chances of being any more successful depend on whether they can accurately insight into holistic impacts on the system design; define performance bounds of the hardware; put forward new assumptions on new environment; and seek for the good trade-off. These are all the challenges that data management and analytics systems must cope with.

  1. 1.

    Firstly, at the system level, new hardware and environment have a systemic impact on existing technologies. This may introduce new performance bottlenecks after eliminating existing ones. Therefore, it is necessary to examine their impact in a higher-level context. In the heterogeneous computing environment composed by new processors and accelerators, although the insufficiency of large-scale parallel capabilities can be offset, the problems of the memory wall, von-Neumann bottleneck, and energy wall may become even worse in the new heterogeneous computing environment. The communication delay between heterogeneous processing units, limited cache capacity, and the non-uniform storage access cost may become a new performance problem. In the environment of new nonvolatile storage, the restrictions in disk I/O stack can be eliminated, but the new NVM I/O stack will significantly magnify the software overhead that is typically ignored in traditional I/O stack. Therefore, redesigning the software stack to reduce its overhead ratio has become a more important design principle. In high-performance network architectures, while network I/O latency is no longer a major constriction in system design, the efficiency of processor caches and local memory becomes more important.

  2. 2.

    Secondly, the design philosophy of the algorithm and data structure in the new hardware environment needs to be changed. Directly migrating or partially tuning algorithms and data structures cannot fully exploit the characteristics of the new hardware. At the processor level, data structures and cache-centric algorithms designed for x86-based processors are not designed to match the hardware features of the compute-centric many-core processor. Many mature query processing techniques may fail in platforms with many-core processors. Database has long been designed based on the ideas of serial and small-scale parallel processing-based programming. This makes the traditional query algorithms difficult to convert to a large-scale parallel processing mode. At the storage level, although the NVM has both the advantages of internal and external memory, NVM also has some negative features such as asymmetric read/write costs and low write endurance. These features are significantly different with traditional storage environment, so previous studies on memory, disk, and flash cannot achieve ideal performance within a new storage hierarchy containing NVM. At the network level, the RDMA-enabled cluster is a new hybrid architecture, which is distinct from message-passing architecture or shared-memory architecture. Therefore, the technologies in the non-uniform memory access architecture cannot be directly applied to the RDMA cluster environment.

  3. 3.

    Thirdly, the impact of new hardware and environments on data management and analytics technologies is comprehensive, deep, and crosscutting. Due to the new features of the new hardware environment, the functionalities of data management systems cannot be tailored to adapt to the new hardware environment. In heterogeneous computing environments with new processors and accelerators, the parallel processing capacity is greatly improved. However, the richer hardware contexts also pose stringent challenges on achieving high throughput and maintaining data consistency. The impact of NVM on the logging will fundamentally change the length of the critical path of transactions. The reduction in the transaction submission time will result in lock competition, affecting the overall system concurrency capacity and the throughput. With low-latency and high-bandwidth, high-performance networks will change the system’s basic assumptions that distributed transactions are difficult to extend. This is also true with the optimization objective of minimizing network latency when designing distributed algorithms. Cache utilization in multi-core architectures will become the new optimization direction. In addition, some of the existing data management components have composite functions. For example, the existing buffer not only relieves the I/O bottleneck of the entire system, but also reduces the overhead of the recovery mechanism. A more complicated scenario is the crosscutting effect between the new hardware and the environment. For example, the out-of-order instruction execution techniques in processors can cause the cached data to be not accessed and executed in the application’s logical order. If a single NVM is used to simplify traditional storage tiers, the data serializing on the NVM must be addressed.

  4. 4.

    Finally, under the new hardware environment, hardware/software co-design is the inevitable way for the data management and analysis system. The new hardware technology has its inherent advantages and disadvantages, which cannot completely replace the original hardware. For a long time, it is inevitable that the traditional hardware and new hardware coexist. While providing diversified hardware choices, this also leads to more complicated design strategies, more difficult optimization techniques and more difficult performance tuning space. In a heterogeneous computing environment, using coprocessor or co-placement to achieve customized data processing acceleration has created significant differences in system architecture. Moreover, the threshold of parallel programming has become increasingly high, and the gap between software and hardware is becoming larger than ever before. In some cases, the development of software lags behind hardware. In many applications, the actual utilization rate of hardware is well below the upper limit of the performance [126]. However, the new memory devices have significant differentiation and diversification. There is great flexibility and uncertainty in making use of the new nonvolatile memory to construct a NVM environment. Whether the components constitute a simple or a mixed form, and whether the status is equivalent or undetermined, the upper data management and analysis technology has also brought great challenges to research. In high-performance network system, although InfiniBand has considered RDMA from the beginning, the traditional Ethernet has also proposed the solution to support RDMA. There is no exact answer at present on which kind of plan can form the intact industry ecology at the end. Therefore, it is even more necessary to conduct cutting-edge research as soon as possible to explore a new data management architecture suitable for high-performance network environments.

4.2 Future Research

New hardware environments such as heterogeneous computing architectures, hybrid storage environments, and high-performance network will surely change the base support of traditional data management and analysis systems. It will bring significant opportunities for the development of key technologies. The future of research can start from the following aspects.

  1. 1.

    Lightly coupled system architecture and collaborative design scheme The computing, storage, and networking environments have heterogeneity, diversity, and hybridism. The different environmental components have a significant impact on the design of the upper data management system architecture. To effectively leverage the power of new hardware, the seamless integration of new hardware into the data management stack is an important fundamental research topic. To be compatible with diverse hardware environments and to reduce the risk of failure in highly coupled optimization techniques with specific hardware, the different heterogeneous, hybrid hardware environments must be effectively abstracted and virtualized. Abstraction technology can extract common features for hardware, reduce the low-level over coupling while ensuring hardware awareness, and provide flexible customization and service support for upper-layer technologies. In the meantime, the execution cost and the proportions of different operations will change under the new hardware environment, and the bottleneck of the system is also shifting. As a result, the negligible overhead in traditional software stacks would be significantly magnified. Therefore, based on this, new performance bottlenecks need to be found, a reasonable software stack needs to be redesigned, and the software overhead in the new hardware environment needs to be reduced. In addition, the new hardware environment has advantages including low latency, high capacity, high bandwidth, and high speed read and write. This has brought new opportunities for development of the integration of OLTP and OLAP system and achieving convergent OLTAP system.

  2. 2.

    Storage and index management with mixed heterogeneous hardware environments Because of both internal and external memory capabilities, the new nonvolatile memory obscures the clear boundaries between existing storages. This also offers a considerable degree of freedom for the construction of the new nonvolatile storage environment and storage methods. The method can provide a powerful guarantee for accelerating the processing of data in the upper layer. Although the high-speed I/O capabilities of NVM offer opportunities for enhancing data access performance, NVM only ensures nonvolatility only at the device level. At the system lever, however, the caching mechanisms may introduce the inconsistency issue. Therefore, in the future, collaborative technology needs to be studied at different levels such as architecture, strategy, and implementation. In addition, as a dedicated acceleration hardware, FPGA has its own unique advantages in accelerating data processing. In particular, the combination of NVM features can further enhance the data processing efficiency. Therefore, with the data storage engine optimization and reconstruction techniques as well as the data access accelerating and data filtering technology in the FPGA, the preprocessing on the original data can be effectively completed. This will reduce the amount of data to be transferred, thereby alleviating bottlenecks of data access in large-scale data processing. Moreover, the NVM environment has a richer storage hierarchy, while new processor technology also provides additional data processing capabilities for indexing. Therefore, multi-level and processor-conscious indexing technology is also a future research direction.

  3. 3.

    Hardware-aware query processing and performance optimization Query processing is the core operation in data analysis, involving a series of complex activities in the data extraction process. The high degree of parallelism and customizable capabilities provided by heterogeneous computing architectures, as well as the new I/O features of NVM environments, make previous query processing and optimization mechanisms difficult to apply. Future research may focus on two aspects. One is query optimization technology in NVM environment: high-speed NVM read and write, byte addressable, asymmetric read and write, and other features. It will exert a significant impact on traditional query operations such as join, sort, and aggregation. At the same time, NVM has changed the composition of traditional storage hierarchy and also affected the traditional measurement hypothesis of estimating query cost, which is based on the cost of disk I/O. Therefore, it is necessary to study the cost model in NVM environment and the design and optimization of write-limited algorithms and data structures, so that the negative impact of NVM write operations is alleviated. On the other hand, query optimization technology under heterogeneous processor platforms: the introduction of a new processor increases the dimension of heterogeneous computing platforms, resulting in the increased complexity of the query optimization techniques. This poses a huge challenge to the design of query optimizers. Collaborative query processing technology, query optimization technology, and hybrid query execution plan generation technology are all ways to improve the query efficiency in the heterogeneous computing platform.

  4. 4.

    New hardware-enabled transaction processing technologies Concurrency control and recovery are core functions in data management which ensure transaction isolation and persistence. Their design and implementation are tightly related to the underlying computing and storage environment. At the same time, high-performance network environments also provide new opportunities for distributed transaction processing which were difficult to scale out. First of all, the storage hierarchy and the new access characteristics have the most significant impact on transaction recovery technology. Database recovery technology needs to be optimized according to the features of NVM. Recovery technology for NVM, partitioning technology, and concurrent control protocol based on NVM are all urgent research topics. Second, transactions typically involve multiple types of operations as well as synchronization among them. However, general-purpose processors and specialized accelerators have different data processing modes. Separating the transactional processing performed by the general-purpose CPU, and offloading part of the workload to specialized processor, is capable of enhancing the performance of transaction processing. Therefore, studying the load balancing and I/O optimization technologies that accelerate the transaction processing is an effective way to solve the performance bottlenecks in transaction processing. Furthermore, in RDMA-enabled network, the assumption about distributed transactions cannot scale will cease to exist. RDMA-enabled distributed commit protocols and pessimistic and optimistic concurrency control approaches for RDMA may all be potential research directions.

5 Conclusion

The new hardware and its constructed environment will deeply influence the architecture of the entire computing system and change the previous assumptions of the upper-layer software. While providing higher physical performance, software architectures and related technologies for data management and analysis are also required to sense and adapt to the new hardware features. The new hardware environment has made the trade-off between data management and analytics system design space more complex, posing multidimensional research challenges. In the future, there is an urgent need to break the oldness of the traditional data management and analysis software. It is also necessary to manage and analyze the core system functions based on the characteristics of hardware environments and explore and research new data processing modes, architectures, and technologies from the bottom up.