- 368 Downloads
KeywordsActual Storage Pushdown Hybrid Memory Cube (HMC) Modern Workloads Powerful Processing Elements
In brief, Active Storage refers to an architectural hardware and software paradigm, based on co-location storage and compute units. Ideally, it will allow to execute application-defined data- or compute-intensive operations in situ, i.e., within (or close to) the physical data storage. Thus Active Storage seeks to minimize expensive data movement, improving performance, scalability, and resource efficiency. The effective use of Active Storage mandates new architectures, algorithms, interfaces, and development toolchains.
Over the last decade, we are witnessing a clear trend toward the fusion of the compute-intensive and the data-intensive paradigms on architectural, system, and application level. On the one hand, large computational tasks (e.g., simulations) tend to feed growing amounts of data into their complex computational models; on the other hand, database applications execute computationally intensive ML and analytics-style workloads on increasingly large data sets. Both result in massive data transfers across the memory hierarchy, which block the CPU, causing unnecessary CPU waits and thus impair performance, scalability, and resource efficiency. The root cause for this phenomenon lies in the generally low data locality as well as in traditional architectures and algorithms, which operate on the data-to-code principle. It requires data and program code to be transferred to the processing elements to be executed. Although data-to-code simplifies development and system architectures, it is inherently bounded by the von Neumann bottleneck.
These trends are impacted by the following recent developments: (a) Moore’s law is said to be cooling down for different types of semiconductor elements, and Dennard scaling is coming to an end. The latter postulates that performance per watt grows at approximately the rate mandated by Moore’s law. (Besides the scalability of cache coherence protocols, Dennard scaling is among the frequently quoted reasons as to why modern many-core CPUs do not have the 128 cores that would otherwise be technically possible by now – see also Muramatsu et al. (2004) and Hardavellas et al. (2011)) As a result compute performance improvements cannot be based on the expectation of increasing clock frequencies and therefore mandate changes in the hardware and software architectures. (b) Modern systems can offer much higher levels of parallelism, yet scalability and the effective use of parallelism are limited by the programming models as well as by amount and type of data transfers. (c) Access gap and Memory Wall storage (DRAM, Flash, HDD) is getting larger and cheaper; however access latencies decrease at much lower rates. This trend also contributes to slow data transfers and to blocking processing at the CPU. (d) Modern data sets are large in volume (machine data, scientific data, text) and are growing fast (Szalay and Gray 2006). (e) Modern workloads (hybrid/HTAP or analytics-based such as OLAP or ML) tend to have low data locality and incur large scans (sometimes iterative) that result in massive data transfers.
In essence, due to system architectures and processing principles, current workloads require transferring growing volumes of large data through the virtual memory hierarchy, from the physical storage location to the processing elements, which limits performance and scalability and worsens resource and energy efficiency.
Nowadays, three important technological developments open an opportunity to counter these drawbacks. Firstly, hardware manufactures are able to fabricate combinations of storage and compute elements at reasonable costs and package them within the same device. Secondly, the fact that this trend covers virtually all levels of the memory hierarchy: (a) CPU and caches, (b) memory and compute, (c) storage and compute, (d) accelerators – specialized CPUs and storage, and eventually (e) network and compute. Thirdly, as magnetic/mechanical storage is being replaced with semiconductor nonvolatile technologies (Flash, Non-Volatile Memories– NVM), another key trend emerges: the device internal bandwidth, parallelism, and access latencies are significantly better than the external ones (device-to-host). This is due to various reasons: interfaces, interconnect, physical design, and architectures.
Interfaces: hardware and software interfaces need to be extended, and new abstractions need to be introduced. This includes device and storage interfaces and operating systems and I/O abstractions: operations and conditions, records/objects vs. blocks, atomic primitives, and transactional support.
Heterogeneity in terms of storage and computer hardware interfaces is to be addressed.
Toolchain: extensive tool support is necessary to utilize Active Storage: compilers, hardware generators, debugging and monitoring tools, and advisors.
Placement of data and computation across the hierarchy is crucial to efficiency.
Workload adaptivity is a major goal as static assignments lower the placement and collocation effects.
These challenges already attract research focus, as todays accelerators exhibit Active Storage alike characteristics in a simplistic manner, i.e., GPUs and FPGAs are defined by a considerably higher level of internal parallelism and bandwidth in contrast to their connection to the host system. Specialized computation already uses hardware programmable with high-level synthesis toolchains like TaPaSCo (Korinth et al. 2015) and co-location with storage elements and often necessitate for shared virtual memory. Classical research questions, e.g., about a dynamic workload distribution, data dependence, and flexible data placement are approached by Fan et al. (2016), Hsieh et al. (2016), and Chen and Chen (2012), respectively. The significant potential arising with Near-Data Processing is investigated under perfect conditions by Kotra et al. (2017) stating performance boosts of about 75%.
Key Research Findings
The concept of Active Storage is not new. Historically it is deeply rooted in the concept of database machines (DeWitt and Gray 1992; Boral and DeWitt 1983) developed in the 1970s and 1980s. Boral and DeWitt (1983) discusses approaches such as processor-per-track or processor-per-head as an early attempt to combine storage and simple computing elements to accelerate data processing. Existing I/O bandwidth and parallelism are claimed to be the limiting factor to justify parallel DBMS. While this conclusion is not surprising given the characteristics of magnetic/mechanical storage combined with Amdahl’s balanced systems law, it is revised with modern technologies. Modern semiconductor storage technologies (NVM, Flash) are offering high raw bandwidth and levels of parallelism. Boral and DeWitt (1983) also raises the issue of temporal locality in database applications, which has been questioned back then and is considered to be low in modern workloads, causing unnecessary data transfers. Near-Data Processing and Active Storage present an opportunity to address it.
The concept of Active Disk emerged toward the end of the 1990s and early 2000s. It is most prominently represented by systems such as Active Disk (Acharya et al. 1998), IDISK (Keeton et al. 1998), and Active Storage/Disk (Riedel et al. 1998). While database machines attempted to execute fixed primitive access operations, Active Disk targets executing application-specific code on the drive. Active Storage/Disk (Riedel et al. 1998) relies on processor-per-disk architecture. It yields significant performance benefits for I/O-bound scans in terms of bandwidth, parallelism, and reduction of data transfers. IDISK (Keeton et al. 1998) assumed a higher complexity of data processing operations compared to Riedel et al. (1998) and targeted mainly analytical workloads and business intelligence and DSS systems. Active Disk (Acharya et al. 1998) targets an architecture based on on-device processors and pushdown of custom data processing operations. Acharya et al. (1998) focusses on programming models and explores a streaming programming model, expressing data-intensive operations as so-called disklets, which are pushed down and executed on the disk processor.
An extension of the above ideas (Sivathanu et al. 2005) investigates executing operations on the RAID controller. Yet, classical RAID technologies rely on general-purpose CPUs that operate well with slow mechanical HDDs, are easily overloaded, and turn into a bottleneck with modern storage technologies (Petrov et al. 2010).
Although in the Active Disk, concept increases the scope and applicability, it is equally impacted by bandwidth limitations and high manufacturing costs. Nowadays two trends have important impact. On the one hand, semiconductor storage technologies (NVM, Flash) offer significantly higher bandwidths, lower latencies, and levels of parallelism. On the other hand, hardware vendors are able to fabricate economically combinations of storage and compute units and package them on storage devices. Both combined the result in new generation of Active Storage devices.
Smart SSDs (Do et al. 2013) or multi-stream SSDs aim to achieve better data processing performance by utilizing on device resources and pushing down data processing operations close to the data. Programming models such as SSDlets are being proposed. One trend is In-Storage Processing (Jo et al. 2016; Kim et al. 2016) that presents significant performance increase on embedded CPUs for standard DBMS operators. Combinations of storage and GPGPUs demonstrate an increase of up to 40x (Cho et al. 2013a). IBEX (Woods et al. 2013, 2014) is a system demonstrating operator pushdown on FPGA-based storage.
Do et al. (2013) is one of the first works to explore offloading parts of data processing on Smart SSDs, indicating potential of significant performance improvements (up to 2.7x) and energy savings (up to 3x). Do et al. (2013) defines a new session-based communication protocol (DBMS-SmartSSD) comprising three operations: OPEN, CLOSE, and GET. In addition they define a set of APIs for on-device functionality: Command API, Thread API, Data API, and Memory API. It does not only enable pushdown but also workload-dependent, cooperative processing. In addition, Do et al. Do et al. (2013) identify two research questions: (i) How can Active Storage handle the problem of on-device processing at the presence of a more recent version of the data in the buffer? (ii) What is the efficiency of operation pushdown at the presence of large main memories? The latter becomes obvious in the context of large data sets (Big Data) and computationally intensive operations.
Similarly, Seshadri et al. (2014) propose and describe a user-programmable SSD called Willow, which allows the users to augment the storage device with the application-specific logic. In order to provide the new functionality of the SSD for a certain application, three subsystems must be appropriately modified to support a new set of RPC commands: the application, the kernel driver, and the operating system running on each storage processing unit inside the Flash SSD.
The initial ideas of Do et al. (2013) have been recently extended in Kim et al. (2016), Jo et al. (2016), and Samsung (2015). Kim et al. (2016) demonstrate between 5x and 47x performance improvement for scans and joins. Jo et al. (2016) describe a similar approach for In-Storage Computing based on the Samsung PM1725 SSD with ISC option (Samsung 2015) integrated in MariaDB. Other approaches (Cho et al. 2013b; Tiwari et al. 2013). Tiwari et al. (2013) stress the importance of in situ processing.
Woods et al. (2014, 2013) demonstrate with Ibex an intelligent storage engine for commodity-relational databases. By off-loading complex queries operators, they tackle the bandwidth bottlenecks arising when moving large amounts of data from storage to processing nodes. In addition, the energy consumption is reduced due to the usage of FPGAs rather than general-purpose processors. Ibex supports aggregation (GROUP By), projection, and selection. Najafi et al. (2013) and Sadoghi et al. (2012) explore approaches for flexible query processing on FPGAs.
JAFAR is an Active Storage approach for column stores by Xi et al. (2015) and Babarinsa and Idreos (2015). JAFAR is based on MonetDB and aims at reducing data transfers, hence pushing size reducing DB operators such as selections; joins are not considered. Xi et al. (2015) stress the importance of on-chip accelerators but do not consider Active Storage and accelerators for complex computations in situ.
Memory: Processing-In-Memory (PIM)
Manufacturing costs of DRAM decrease, and memory volume increases steadily, while access latencies improve at a significantly lower rate yielding the so-called Memory Wall (Wulf and McKee 1995). On the one hand, technologies such as Hybrid Memory Cube (HMC) attempt to address this issue by locating processing units close to memory and by utilizing novel interfaces alike. On the other hand, new types of memory are introduced, characterized by an even higher density and therefore larger volumes, shorter latencies, and higher bandwidth and internal parallelism, as well as non-volatile persistence behavior.
Balasubramonian (2016) discusses in his article the features that can be meaningfully added to memory devices. Not only do these features execute parts of an application, but they may also take care of auxiliary operations that maintain high efficiency, reliability, and security. Research combining memory technologies with the Active Storage concept in general, often referred to as Processing-In-Memory (PIM), is very versatile. In the late 1990 (Patterson et al. 1997), proposed IRAM as a first attempt to address the Memory Wall, by unifying processing logic and DRAM, starting with general research question of the computer science like communication, interfaces, cache coherence, or address schemes. Hall et al. (1999) purpose combining their Data-IntensiVe Architecture (DIVA) PIM memories with external host processors and defining a PIM-to-PIM interconnect; Vermij et al. (2017) present an extension to the CPU architecture to enable NDP capabilities close to the main memory by introducing a new component attached to the system bus responsible for the communication; Boroumand et al. (2017) propose a new hardware cache coherence mechanism designed specifically for PIM; Picorel et al. (2017) show that the historically important flexibility to map any virtual page to any page frame is unnecessary regarding NDP and introduce Distributed Inverted Page Table (DIPTA) as an alternative near-memory structure.
Studying the upcomming new interface HMC, Azarkhish et al. (2017) analyzed its support for NDP in a modular and flexible fashion. The authors propose a fully backward compatible extension to the standard HMC called smart memory cube and design a high-bandwidth, low-latency, and AXI(4)-compatible logic base interconnect, featuring a novel address scrambling mechanism for the reduction in vault/bank conflicts. A completely different approach to tackle Active Storage in today’s memory is presented in Gao et al. (2016b). It introduces DRAF, an architecture for bit-level reconfigurable logic that uses DRAM subarrays to implement dense lookup tables because FPGAs introduce significant area and power overheads, making it difficult to use them in datacenter servers. Leaving the existing sequential programming models in touch by extending the instruction set architecture, Ahn et al. (2015) proposes new PIM-enabled instructions. Firstly, the proposed instruction set is interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms. Secondly, the instructions can be executed either in-memory or on the processors depending on the data locality. A conceptual near-memory acceleration architecture is presented by Kim et al. (2017b) claiming the need for adopting a high-level synthesis approach. In Lim and Park (2017), kernel operations that can greatly improve with PIM are analyzed resulting in the necessity of three categories of processing engines for NDP logic – in-order core, a coerce-grain reconfigurable processor (CGRA), and dedicated hardware.
Proposing Caribou, an intelligent distributed storage layer, István et al. (2017) target NDP on DRAM/NVRAM storage over the network through a simple key-value store interface. Utilizing FPGAs, each storage node provides high-bandwidth NDP capabilities and fault tolerance through replication by Zookeeper’s atomic broadcasts.
The application of the Active Storage concept on memories besides data management is often based on analytical scenarios or neural networks but comprises a variety of different approaches. Gao et al. (2016a) develop hardware and software for an NDP architecture for in-memory analytic frameworks, including MapReduce, graph processing, and deep neural networks. One year later, Gao et al. (2017) presents the hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable neural network accelerator using 3D memory. For a similar use case, (Chi et al. 2016) proposes PRIME, providing microarchitecture and circuit designs for a ReRAM-based PIM architecture enabling morphable functions with insignificant area overhead. A compiler-based allocation strategy approach for PIM architectures is proposed by Memoultion Wang et al. (2017). Focusing on convolutional neural networks, it offers thread-level parallelism that can fully exploit the computational power-embedded processors. Another hardware/software co-design for data analytics is presented by the Mondrian Data Engine (Drumond et al. 2017). It focuses on sequential access patterns to enable simple hardware that access memory in streams. A standardization of NDP architecture, in order for PIM stacks to be used for different GPU architectures is proposed by Kim et al. (2017a). Their approach intend to allow data to be spread across multiple memory stacks as is the norm in high-performance systems.
Having only the slight variation that data is not persisted at any time but rather streamed through, active networks are another very widespread application of the Active Storage concept. Powerful processing elements near the network adapters or often integrated to the network controller itself as a System-on-Chip (SoC) is not solely responsible for the conventional protocol interpretation anymore but also take over further tasks like security verifications and scheduling of in-transit services and data processing or simply improving the network performance.
Tennenhouse and Wetherall (1996) first introduced the term Active Network as an approach for performing sophisticated computation within the network. By injecting customized program features into the nodes of the network, it is possible to execute these at each traversed network router/switch. Continuing the research of Tennenhouse and Wetherall (1996) and Sykora and Koutny (2010) present an Active Network node called Smart Active Node (SAN). Thereby, they focus on its ability to translate data flow transparently between IP network and active network to further improve performance of IP applications.
Often Active Network is also referred as software-defined network (SDN) and comprises already an advanced state of research. Especially the area around security comprises progressive research in authentication and authorization, access control, threats, and DoS attacks as summarized by Ahmad et al. (2015). But also the utilization of RDMA-capable network became a trend since the demand on higher bandwidth arose with the introduction of dedicated GPUs in the computation. Ren et al. (2017) propose iRDMA, an RDMA-based parameter server architecture optimized for high-performance network environment supporting both GPU- and CPU-based training.
- Acharya A, Uysal M, Saltz J (1998) Active disks: Programming model, algorithms and evaluation. In: Proceedings of the eighth international conference on architectural support for programming languages and operating systems, ASPLOS VIII, pp 81–91Google Scholar
- Ahn J, Yoo S, Mutlu O, Choi K (2015) PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In: Proceeding of 42nd annual international symposium on computer architecture (ISCA’15), pp 336–348Google Scholar
- Babarinsa OO, Idreos S (2015) Jafar: near-data processing for databases. In: SIGMODGoogle Scholar
- Chen C, Chen Y (2012) Dynamic active storage for high performance I/O. In: 2012 41st international conference on Parallel Processing. IEEE, pp 379–388Google Scholar
- Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, Wang Y, Xie Y (2016) PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In: Proceeding of 2016 43rd international symposium on computer architecture (ISCA 2016), pp 27–39Google Scholar
- Cho BY, Jeong WS, Oh D, Ro WW (2013a) Xsd: accelerating mapreduce by harnessing the GPU inside an SSD. In: WoNDP: 1st workshop on near-data processing in conjunction with IEEE MICRO-46Google Scholar
- Cho S, Park C, Oh H, Kim S, Yi Y, Ganger GR (2013b) Active disk meets flash: a case for intelligent SSDs. In: Proceeding of ICS, pp 91–102Google Scholar
- Do J, Kee YS, Patel JM, Park C, Park K, DeWitt DJ (2013) Query processing on smart SSDs: opportunities and challenges. In: Proceeding of SIGMOD, pp 1221–1230Google Scholar
- Fan S, He Z, Tan H (2016) An active storage system with dynamic task assignment policy. In: 2016 12th international conference on natural computation fuzzy system and knowledge discovery (ICNC-FSKD 2016), pp 1421–1427Google Scholar
- Gao M, Ayers G, Kozyrakis C (2016a) Practical near-data processing for in-memory analytics frameworks. Parallel architecture and compilation techniques – Conference proceedings, PACT 2016-March, pp 113–124Google Scholar
- Gao M, Delimitrou C, Niu D, Malladi KT, Zheng H, Brennan B, Kozyrakis C (2016b) DRAF: a low-power DRAM-based reconfigurable acceleration fabric. In: 2016 ACM/IEEE 43rd annual international symposium on computer architecture. IEEE, pp 506–518Google Scholar
- Gao M, Pu J, Yang X, Horowitz M, Kozyrakis C (2017) TETRIS: scalable and efficient neural network acceleration with 3D memory. ASPLOS 51(2):751–764Google Scholar
- Hall M, Kogge P, Koller J, Diniz P, Chame J, Draper J, LaCoss J, Granacki J, Brockman J, Srivastava A, Athas W, Freeh V, Shin J, Park J (1999) Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In: ACM/IEEE conference on supercomputing (SC 1999), p 57Google Scholar
- Hsieh K, Ebrahim E, Kim G, Chatterjee N, O’Connor M, Vijaykumar N, Mutlu O, Keckler SW (2016) Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems. In: Proceeding of 2016 43rd international symposium on computer architecture (ISCA 2016), pp 204–216Google Scholar
- Kim G, Chatterjee N, O’Connor M, Hsieh K (2017a) Toward standardized near-data processing with unrestricted data placement for GPUs. In: Proceeding of international conference on high performance computing networking, storage and analysis (SC’17), pp 1–12Google Scholar
- Korinth J, Chevallerie Ddl, Koch A (2015) An open-source tool flow for the composition of reconfigurable hardware thread pool architectures. In: Proceedings of the 2015 IEEE 23rd annual international symposium on field-programmable custom computing machines (FCCM’15). IEEE Computer Society, Washington, DC, pp 195–198Google Scholar
- Kotra JB, Guttman D, Chidambaram Nachiappan N, Kandemir MT, Das CR (2017) Quantifying the potential benefits of on-chip near-data computing in manycore processors. In: 2017 IEEE 25th international symposium on modeling, analysis, and simulation of computer and telecommunication system, pp 198–209Google Scholar
- Muramatsu B, Gierschi S, McMartin F, Weimar S, Klotz G (2004) If you build it, will they come? In: Proceeding of 2004 joint ACM/IEEE Conference on digital libraries (JCDL’04) p 396Google Scholar
- Petrov I, Almeida G, Buchmann A, Ulrich G (2010) Building large storage based on flash disks. In: Proceeding of ADMS’10Google Scholar
- Picorel J, Jevdjic D, Falsafi B (2017) Near-Memory Address Translation. In: 2017 26th international conference on Parallel architectures and compilation techniques, pp 303–317, 1612.00445Google Scholar
- Ren Y, Wu X, Zhang L, Wang Y, Zhang W, Wang Z, Hack M, Jiang S (2017) iRDMA: efficient use of RDMA in distributed deep learning systems. In: IEEE 19th international conference on high performance computing and communications, pp 231–238Google Scholar
- Riedel E, Gibson GA, Faloutsos C (1998) Active storage for large-scale data mining and multimedia. In: Proceedings of the 24rd international conference on very large data bases (VLDB’98), pp 62–73Google Scholar
- Sadoghi M, Javed R, Tarafdar N, Singh H, Palaniappan R, Jacobsen HA (2012) Multi-query stream processing on FPGAs. In: 2012 IEEE 28th international conference on data engineering, pp 1229–1232Google Scholar
- Samsung (2015) In-storage computing. http://www.flash- memorysummit.com/English/Collaterals/Proceedings/ 2015/20150813_S301D_Ki.pdf
- Seshadri S, Gahagan M, Bhaskaran S, Bunker T, De A, Jin Y, Liu Y, Swanson S (2014) Willow: a user-programmable SSD. In: Proceeding of OSDI’14Google Scholar
- Sivathanu M, Bairavasundaram LN, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2005) Database-aware semantically-smart storage. In: Proceedings of the 4th conference on USENIX conference on file and storage technologies (FAST’05), vol 4, pp 18–18Google Scholar
- Sykora J, Koutny T (2010) Enhancing performance of networking applications by IP tunneling through active networks. In: 9th international conference on networks (ICN 2010), pp 361–364Google Scholar
- Tiwari D, Boboila S, Vazhkudai SS, Kim Y, Ma X, Desnoyers PJ, Solihin Y (2013) Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines. In: Proceeding of FAST, pp 119–132Google Scholar
- Wang Y, Zhang M, Yang J (2017) Towards memory-efficient processing-in-memory architecture for convolutional neural networks. In: Proceeding 18th ACM SIGPLAN/SIGBED conference on languages compilers, and tools for embedded systems (LCTES 2017), pp 81–90Google Scholar
- Woods L, Teubner J, Alonso G (2013) Less watts, more performance: an intelligent storage engine for data appliances. In: Proceeding of SIGMOD, pp 1073–1076Google Scholar
- Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. SIGARCH CAN 23(1):20–24Google Scholar
- Xi SL, Babarinsa O, Athanassoulis M, Idreos S (2015) Beyond the wall: near-data processing for databases. In: Proceeding of DaMoN, pp 2:1–2:10Google Scholar