The Dynamic Random Access Memory Challenge in Embedded Computing Systems

Current and emerging embedded applications require ever larger amount of data that have to be processed. Due to their large size, this data has to be stored off-chip in dynamic random access memories (DRAMs). The challenges introduced by DRAMs in those systems are manifold. These include limited bandwidth and latency, as well as power consumption and reduced reliability due to technology scaling. In this chapter we highlight the major challenges and possible solutions for the integration of DRAM subsystems into embedded computing systems.


Bandwidth and Latency
Bandwidth is the amount of data that can be transferred between DRAM and a computational unit within 1 s. The maximum DRAM bandwidth is limited to the number of data pins times the interface frequency. Latency is the access time that it takes to complete an access. In fact, latency helps bandwidth, but not vice versa [33]. For instance, lower DRAM latency results in more accesses per second, and therefore higher bandwidth, whereas increasing the number of data pins increases the bandwidth without decreasing latency. A fast execution of applications on embedded systems must not only be supported by the computational units, but the memory subsystem must be designed to avoid hitting the memory wall [43]. For example, embedded applications for autonomous driving will require between 400 and 1024 GB/s of memory bandwidth [16], which is hard to realize with the current DRAM technologies. To put the problem in perspective, we survey current memory architectures. To avoid pin limitations, designers and vendors are using Buffer on Board (BoB) organizations [7], in which an additional logic component is interposed between the CPU and DRAM to control the memory and to communicate with the CPU over a narrow, high-speed, serial interface. This technique is mainly used in server applications where several terabytes of DRAM are required. The required storage capacity in embedded systems is much smaller than in the high-performance systems BoB targets, and thus this organization is inappropriate. All the other following DRAM devices can achieve easily several GB capacity, which is enough for most of the embedded applications.
Package on Package (PoP) organizations reduce the distance between the DRAM and the MPSoC (from centimeters to millimeters), providing higher bandwidth, lower latency, better power efficiency, and smaller form factors, all of which are especially important for smartphones and tablets. Low power DDR DRAMs (e.g., LPDDR4) can be used either as a device on a PCB or mounted directly as PoP. The latter organization permits only one device to be connected, requiring DRAM commands to be serialized due to the resulting low pin count. For example, if eight LPDDR4 devices are used on a PCB, they deliver a theoretical bandwidth of 137 GB/s.
To address the huge memory demand of highly parallel GPUs, graphic DDR DRAMs (e.g., GDDR5X or GDDR6) use techniques like quad data rate (QDR) to deliver high bandwidth compared to conventional DDR DRAM. While LPDDR4 devices are designed and optimized for ultra-low power consumption with aggressive power gating and higher-threshold transistors, GDDR5X/6 devices focus on delivering the highest achievable bandwidth. Both use an architecture with distributed banks (heavy sub-banking) due to the wider data I/O widths of ×16/×32 and the larger data prefetch of up to 16 bit per data I/O. However, GDDR5X/6 devices improve the column-to-column cycle time (t CCD ) by reducing data path delays from primary sense amplifiers to the global sense amplifiers. Furthermore, GDDR5X/6 chips use an on-die phase lock loop (PLL) to achieve very high I/O performance in QDR mode. In contrast, LPDDR4 devices have no on-die PLL or delay lock loop (DLL). Combining 16 GDDR6 devices in QDR mode yields a theoretical bandwidth of 1 TB/s, as shown in Fig. 3

.2.
Another way of achieving high bandwidth is 3D stacking: examples include WIDE I/O, Micron's Hybrid Memory Cube (HMC), and Samsung's High Bandwidth Memory (HBM). These memories reduce the distance between CPU and external RAM from centimeters to micrometers by means of through-silicon via (TSV) technology. The available bandwidth increases due to more pins provided by the TSVs, but, more importantly, this technology provides a major boost in energy efficiency compared to standard off-chip (G)DDR devices.
The combination of high bandwidth communication and the lower power consumption of 3D integrated memory is an ideal fit for embedded systems. For example, four parallel HBM2 devices on a 2.5D silicon interposer can provide up to 1 TB/s [16]. However, 3D memories suffer from thermal issues, which we discuss in Sect. 3.4. From an application point of view, the DRAM subsystem has non-deterministic timing behavior [8] due to its complex protocol (i.e., the latency of a DRAM request depends on previous issued commands) and the runtime optimization of the memory controller; this makes it difficult to provide predictable performance and thus to guarantee real-time task predictability [1]. Figure 3.4 shows a histogram of the CHStone ADPCM benchmark [11] simulated on the DRAMSys framework [18]. 2 Although the average latency is concentrated around 40 ns, the memory latency can easily vary by an order of magnitude.
As with the bandwidth issues discussed above, the memory controller plays an integral role in this non-deterministic timing behavior. The memory controller has to manage, on one side, accesses to the DRAM memory from the compute fabric and, on the other side, the complex interface protocol of the DRAMs. In the following we discuss the main contributions to the DRAM latency that origin from the complex internal memory architecture and the memory controller.
• Row Misses: The latency of a bank access varies depending on the state of its row buffer. If a memory access targets the same row as the row currently cached in the buffer (a row hit), it results in lower latency and lower energy memory accesses. On the other hand, if a memory access targets a different row from that currently in the buffer (a row miss), it results in higher latency and energy consumption. • Close vs. Open Page Policy: Commercial off-the-shelf (COTS) DRAM controllers usually support two major modes: an open page policy (OPP) and closed page policy (CPP). The OPP keeps the current row active after a read or write, whereas the CPP precharges the row automatically after each access. The latter makes the latency for each access more predictable, but it also decreases performance for access patterns with high row-hit potential. • Refresh: DRAMs must be refreshed regularly due to their charge-based bit storage architecture. The memory controller has to issue this refresh operation periodically (e.g., every 64 ms). Normal accesses to the DRAM have to be blocked for the duration of the refresh operation t RF C (350 ns for DDR4), degrading performance with respect to both bandwidth and latency and increasing energy consumption. 3 If a memory access arrives at the same time that a refresh happens it will experience unpredictable latency. • Scheduling: COTS memory controllers are optimized for average case performance and therefore employ runtime scheduling of requests (c.f. Sect. 3.2) for online optimization. For example, with schedulers that attempt to maximize row hits it is possible that a request that misses the row could starve, which again results in a hardly predictable latency. • Arbitration: A major challenge arises when several computational units are issuing read and write requests to the memory controller. The different applications running on these compute units will place their requests in different input buffers, and arbitration must be performed. This leads to interference that can cause high unpredictability. • Command/Address and Data bus Contention: All banks in a DRAM share the same command/address and data buses, which can limit overall performance. If the data bus utilization is 100%, the maximum bandwidth is reached. On the other hand if the command bus utilization is 100%, WR and RD commands must be issued in later cycles that negatively impacts the bandwidth and the latency. • Current Limiting and Power Supply Network: In order to limit peak currents there exists a rolling time-frame, in which a maximum of four banks can be activated, called four activate window (t FAW ). There is also a minimum time interval between two ACT commands to different banks, (t RRD ). Also these constraints can influence bandwidth, latency, and predictability in specific scenarios. • Further Effects: Bank-Groups in DDR4 or GDDR or rank-to-rank switching constraints in DDR memories also impact the predictability.
Due to this unpredictable timing behavior, processors for embedded applications with real-time and strict latency constraints have thus far largely avoided using DRAM. For example, Infineon's Aurix CPU, which is widely used for safety-critical applications, does not provide a DRAM controller.
In past years there were many investigations with respect to DRAM controllers for real-time and mixed-criticality applications in embedded systems. A detailed book which summarizes those approaches has been presented by Goossens et al. [8].
Most of these approaches concentrate operating the DRAM with statically precomputed command patterns which guarantee a predictable behavior. However, this predictability often comes with a degradation of sustainable bandwidth. Moreover, the bandwidth numbers presented in Fig. 3.2 are theoretical maxima: the sustainable memory bandwidth is much less, and it strongly depends on how the data is stored in the memories, i.e., the memory access pattern [12]. Therefore, it is not only important to choose a memory that provides high bandwidth, it is also important to design a DRAM controller that can bring the sustainable bandwidth closer to the theoretical maximum.
As already mentioned, general-purpose DRAM controllers use online scheduling techniques to improve the sustainable bandwidth, e.g., by reducing the number of row misses or read/write transitions. In order to reduce the number of read/write transitions, DRAM controllers buffer read and write commands in two distinct queues. An arbiter switches between read and write mode to diminish the t W T R penalty, the minimum time interval between the end of a WR burst and a RD command.
However, in embedded systems, many applications (e.g., signal, image, or neural network processing) have regular, fixed, and deterministic memory access patterns. On the compute side, inherent application-specific knowledge has been heavily exploited for efficient compute architectures. However, on the memory side there is limited research that exploits application knowledge to improve the memory access behavior. In [12] we presented an application-specific memory controller (ASMC). Key of this controller is an optimized mapping of the logical addresses to physical DRAM addresses such that the row misses in the access pattern stream are minimized. The corresponding mathematical optimization problem is an integer linear programming problem. The solution of this problem maximizes the number of row buffer hits and exploits the bank-level parallelism of the DRAM device in order to reduce the latency and therefore to keep up the sustainable bandwidth near to the maximum. Therefore, such an ASMC can outperform online schedulers because it was designed with a global application view. Furthermore, for real-time embedded systems with this method we can easily determine WCET bounds, since no non-deterministic online scheduling is involved.
The efficiency of this approach is demonstrated on an industrial embedded image processing application that consists of image rotation and FFT. Due to realtime requirements this application requires a minimum bandwidth of 9.57 GB/s. Figure 3.5 shows the bandwidth and energy for the standard address mappings of a standard memory controller with standard row-bank-column (RBC) mapping and bank-row-column (BRC) mapping, a manual optimization of the mapping of an The ASMC approach has a runtime of ∼50 min, whereas the manual approach requires ∼1 week for an engineer to fully understand the application and analyze the behavior. Furthermore, by using the generated address mapping, all the online scheduling capabilities of the memory controller could be removed, which reduced the required area of the memory controller by 35%.
As mentioned already in Sect. 3.2 the refresh has a large impact on DRAM's bandwidth and latency. The overhead of refreshes can be reduced by only refreshing the memory cells inside the DRAM that hold data that are still alive. A large body of research exists developing schemes that manually refresh the DRAM row-by-row, characterizing each row's ability to retain data and eliminating unnecessary Refresh operations on rows that can be refreshed less often. These schemes have been shown to be extremely efficient. Since eliminating refresh improves both energy and performance of the memory system, these schemes offer the potential for significant gains in DRAM-system efficiency. However, these schemes are incompatible with the modern auto-refresh mechanism that is widely used: auto-refresh operates on multiple rows at once and not on a row-by-row basis. In addition, auto-refresh cannot skip any row, whether that row needs to be refreshed or not. Thus, the manual schemes use explicit row-level Activate (ACT) and Precharge (PRE) commands to refresh row-by-row, called row granular refresh (RGR). However, it was shown in [4] that techniques based on RGR could never be as effective as the DRAM's internal auto-refresh.
In [28] we presented a technique called optimized RGR which allows a rowby-row refresh with the same efficiency as the auto-refresh. Here, we investigated the timings that are relevant to Activate and Precharge commands and showed that these DRAM timing parameters can be reduced for performing the Refresh 1X mode 2X mode 4X mode

Average Response Latency [ns]
0 100 operation row-by-row. We could demonstrate a reduction of latency and increase of bandwidth compared to standard auto-refresh, as shown in Fig. 3.6. The results can be even improved if only alive data is refreshed (ORGR select). Additionally, ORGR improved the energy efficiency compared to RGR. It is becoming clear that embedded applications must concentrate on DRAM solutions like GDDR and HBM in combination with ASMCs and sophisticated refresh mechanisms in order to cope with their high bandwidth and low latency requirements.

Power Consumption
Power is one of the major challenges in today's embedded system development. According to Fig. 3.3, the preliminary choice for low power designs is LPDDR4 and Wide I/O2 due to their very low energy consumption. However, when aiming at high memory bandwidth, e.g., 1 TB/s these devices are not optimal. For example, to achieve the aforementioned bandwidth with LPDDR4, 64 devices (×32) are required. Although the average power would be only ∼17 W at a peak frequency of 2000 MHz, the high number of resulting I/O pins (2048) becomes unfeasible. Hence, the only alternative candidates for high bandwidth are HBM2 and GDDR5X/6. According to Figs. 3.2 and 3.3 the average power consumed 4 by the HBM (4 stack ×1024) and GDDR6 (16 devices, QDR, ×32) devices are ∼60 W and ∼150 W, respectively. These numbers show that DRAM will be a significant power contributor to embedded systems which require a high memory bandwidth. Therefore, it is mandatory to efficiently use DRAM's power-down modes in order to reduce power consumption.
In state-of-the-art memory controllers the entry to a power-down mode is scheduled when there was no activity in a period of time called timeout. DRAMs In [35] we showed that a highly opportunistic SREF entry results in an increased power consumption, since the SREF will always execute an internal refresh in the beginning. Therefore, the timeout for a SREF entry should be at least 500 clock cycles for a Wide I/O DRAM.
In [19] we presented an optimized power-down policy, called staggered powerdown, which considers all three available DRAM power-down modes to achieve the maximum saving in energy and the minimum in slow-down on the execution of the applications. The basic idea is to change to the next more efficient power-down state on a refresh event. With this method, unnecessary SREF entries will be avoided and the hardware timeout counters, as used in state-of-the-art controllers, are not required anymore. As shown in Fig. 3.7 for Wide I/O DRAMs an energy reduction up to 10% in high activity periods and up to 13% in idle phases is feasible.
A high power consumption also fosters a high thermal dissipation that largely impacts the reliability of a DRAM. This challenge is discussed in more detail in the next paragraph.

Temperature vs. Reliability
DRAMs are very sensitive to high temperature, which increases the leakage in the memory cells. Figure 3.8 shows the different leakage paths in a DRAM cell:  [23,34] • Drain Leakage (1), which includes the P-N junction leakage as well as gate induced drain leakage (GIDL). GIDL is mainly caused by trap assisted tunneling (TAT), and it is influenced by the number and distribution of traps in the band-gap region as well as the electric field. Since the negative wordline voltage and the positive charge stored in the cell capacitor (when a 1 is stored in the cell) increase the electric field in the band-gap region (gate-drain overlap region), GIDL is the major source of leakage for a stored 1 in the DRAM cell [31]. • Sub-threshold Leakage (2), which is the drain-source leakage of the cell transistor when it is in the OFF state. This current depends on various factors such as negative wordline voltage, bulk voltage, etc. When the bitlines are in precharged state (V DD /2) this can slightly charge the cell capacitor and therefore cause the degradation of a 0 stored in the cell. It can also degrade a 1 stored in the cell by discharging to the bitline, but the leakage will be very small due to the increased threshold voltage of the access transistor when a 1 is stored (body-bias effect). • Cell Capacitor Leakage (3), which is the leakage through the cell capacitor dielectric. With the technology scaling, also the capacitor area is decreasing. Therefore, to maintain the cell capacitance at the previous value, dielectric thickness has to be reduced, which increases the leakage. The use of new metal insulator metal (MIM) structure with high-k dielectric materials has helped to reduce this leakage. Capacitor leakage influences both stored 0's and stored 1's.
In order to avoid data corruption by retention errors due to leakage, the refresh frequency needs to be increased. The general rule of thumb is to double the refresh rate for every 10 • C increase over 85 • C [20]. For example, the refresh period must be decreased from 64 ms to 4-8 ms for 125 • C, which leads to a serious collapse of the sustainable bandwidth [16].
This situation is even worse for today's 3D stacked DRAM systems (e.g., Wide I/O, HBM, HMC, etc.), which aggravate the thermal crisis: i.e., these DRAMs are even more sensitive to temperature changes because of the stacked thin dies. Additionally, when aiming for highest bandwidths with HBM or HMC, these devices will consume, as mentioned before, a significant amount of power on a small area compared to their commodity counterparts. Thus, the self-heating of 3D-DRAMs is even more accelerated. Besides the leakage currents, crosstalk on bitlines and wordlines can also disturb the data stored in the cells or disturb their sensing. Due to the aforementioned effects and shrinking technology nodes, reliability is a major concern in DRAMs. Many techniques exist to improve the reliability, e.g., using error correcting codes (ECC) and/or spatial redundancy.
Approximate and probabilistic computing evolved as design paradigms that exploit the error resilience of applications to increase their performance and decrease the power consumption [10]. This paradigm can be extended to DRAMs resulting in approximate DRAMs (ADRAM) that enable a trade-off between energy efficiency, performance, and reliability. The inherent error resilience of applications allows sacrificing data storage robustness and stability by lowering the refresh rate or disabling refresh in DRAMs completely, as shown in Fig. 3.9. However, to apply ADRAM the statistical DRAM behavior with respect to retention time, process variation, and temperature has to be characterized.
Several studies for the usage of ADRAM are presented in [13][14][15]20]. One scenario is safe refresh disabling, i.e., if the data lifetime is smaller than the refresh period, the refresh can be completely switched off without impact on the system Approximate DRAM design space [42] [18] [6] [38] Exploring ADRAM in a system context is challenging, since a trade-off between accuracy and simulation performance must be considered. Our framework relies on SystemC Transaction Level Models (TLM) for fast and accurate simulation. Figure 3.10 shows the closed loop simulation. This simulation loop consists of (1) DRAM and gem5 Core Models [5,18], (2) a DRAM power model [6,29], which uses either parameters from datasheets, or real measurements [13,15], (3) a thermal model based on 3D-ICE [38], and (4) a DRAM retention error model [42].
As mentioned before, ECC is an efficient technique to improve DRAM's reliability, e.g., retention errors or errors induced by crosstalk. State-of-the-art ECC DDR DIMMs, for instance, consist of 8 DRAM devices and a further device for storing the ECC redundancy. Moreover, vendors recently introduced on-die ECC for LPDDR4 [24,26] to correct retention errors. With ECC the refresh rate can be lowered by 4×, which largely reduces the power consumption. Finding an efficient ECC is a non-trivial task. Traditional ECC techniques for DRAMs assume a symmetric behavior of the retention errors, i.e., the error probability for a stored 0 and 1 is identical. In [22] and [23] we presented a more accurate error model for the retention behavior that exploits the internal cell structure (the so-called true-or anticells) of a DRAM. This model is asymmetric and we could show that the channel capacity according to Shannon's capacity definition (the memory cell is considered as a noisy channel) of a single memory cell is larger than in the traditional commonly used symmetrical model. Hence, a more efficient coding must exist. In [23] we presented a new and low-overhead coding scheme that improves the reliability with respect to retention errors.

Safety and Security
Since DRAMs are more and more used in safety-critical applications like automotive, safety and security are major concerns for DRAMs that were originally mainly developed for consumer applications. Apart from the temperature based retention errors discussed in Sect. 3.4, DRAMs are also prone to transient soft errors, i.e., effects of cosmic particle strikes [9]. Moreover, due to the high frequency of DRAM interfaces transient transmission errors on the DRAM bus can occur. Furthermore, there can be hard errors related to stuck at failures or aging, which could result in a defect column decoder. There exists only a limited amount of studies on DRAM error rates in the field since manufacturers and data centers are very careful to share this sensitive information [30,36,41]. Sridharan et al. report 20-66 FIT for a single DRAM device [39,40]. This highlights the need for appropriate safety mechanisms in order to decrease the FIT rates. For example, the memory controller contains an additional logic that tests the interface periodically in order to detect errors or a strong ECC that is able to correct errors online in order to guarantee functional safety.
Apart from random failures, malicious causes can lead to a safety goal violation, too. Because of transitions to open environments for IoT or Car2X communication the vulnerability of DRAMs for embedded systems must be considered. As DRAM process technology scales down, the electrical interference between the memory cells increases, which leads to disturbance errors. Recently, the row-hammer problem [21,32] and its exploits [25,37] have caused a lot of attention in research and newspapers. By repeatedly opening and closing a DRAM row, called hammering, bits in adjacent rows can flip. This effect can be exploited to write on memory locations with prohibited access rights to, e.g., gain kernel privileges or escape a sandbox or hypervisor. In [25] the author showed that secret data can be read with a combination of row-hammer and data dependencies [23]. The row-hammer security attack [21] is a potential malicious behavior that has to be avoided. Controller triggered techniques like target row refresh where rows will be refreshed when their activation count exceeds a threshold or techniques on the device level like [2,44] can alleviate this problem. In [17] a methodology for reverse engineering DRAMs by reconstructing the physical location of memory cells without opening the device package and microscoping the device was presented. This method consists of a retention error analysis while a temperature gradient is applied to the DRAM device. With this insight into the internal DRAM structure row-hammer countermeasure techniques can be improved.

Conclusion
Emerging applications executed on embedded computing systems require ever increasing main memory sizes. Thus, DRAMs are indispensable to be integrated in such systems. However, the use of DRAMs implies many new challenges.
In this chapter, we highlighted some of the major challenges for the integration of DRAM subsystems into embedded computing systems. These challenges are namely: bandwidth, latency, power, temperature, reliability, safety, and security. Furthermore, we showed several solutions from our recent research activities in order to tackle and overcome these challenges.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.