Keywords

1 Introduction

GPUs focus on applications with high computational requirements and substantial parallelism that are insensitive to latency [1]. Large caches are ineffective for GPUs due the execution of thousands of parallel threads [2]. These factors cause GPUs and many GPU applications to require memory interfaces that provide significantly higher DRAM bandwidth than what is required and provided for regular CPUs. GPUs usually achieve the high memory bandwidth by using special graphics DRAM memories with lower capacity but wider and faster interfaces, such as GDDR5. These high throughput memory interfaces consume a significant amount of power. Modeling their power consumption accurately is thus important for architectural GPU power simulators.

In our previous work, we have shown that data values influence the energy consumption of GPU ALU operation significantly [3]. While executing the same sequence of instructions the power consumption changed from 155 W to 257 W, when the processed data values were changed. In this work we demonstrate that energy cost of memory transaction also is influenced significantly by the data values written to the DRAM or read from the DRAM. MEMPower provides predictions that consider the data values used in transaction as well as the location of the transaction.

Most current discrete GPUs employ GDDR5 or GDDR5X memories [4, 5]. Both employ pseudo open drain signaling (POD) [6]. In POD signaling, current flows when transmitting a zero, while no current flow happens when transmitting a one. To improve energy consumption as well as to limit the number of simultaneously switching outputs, both types of memories use data bus inversion (DBI) [7, 8]. DBI encoding transmits data inverted, if that results in a lower energy consumption and uses an extra signal line to allow the receiver to reverse the inversion of the data, if required. POD signaling, together with DBI encoding, is a source of data dependent energy consumption of the memory interface.

CMOS circuits consume dynamic power when their internal circuit nodes are recharged to a different state. How much energy is consumed, depends on the load capacitance of this node and the voltages. Bus wires providing long on-chip distance routing are usually structures with high load capacitance. External off-chip interfaces, also contain large loads in their drivers, receivers, wires as well as parasitic package capacitances. How often each of the wires is recharged, depends on the data and the encoding of the data transmitted over the wire. The recharging of wires and other circuit nodes partly explains, why the energy cost of memory transaction depends on the transmitted data.

Memory transactions are generated within the GPU cores, also called streaming multiprocessors (SM). In the GTX580 GPU, the SMs are organized into graphics processor clusters (GPCs) [9]. Each GPC contains 4 SMs. The GTX580 uses a full GF100 die with all four 4 SMs activated in each of the 4 GPCs.

This paper is structured as follows: We present related work in Sect. 2. Section 3 describes our experimental setup including our microbenchmarks. The following Sect. 4 shows how latency measurements can be used to discover the mapping between memory addresses and memory channels. It also describes the properties of the mapping and insights gained from latency measurements. Section 5 introduces the design of the data dependent power model and evaluates the accuracy of the model. Section 6 concludes the paper.

2 Related Work

GPUWattch [10] and GPUSimPow [11] do not take data values and locations into account when predicting the energy cost of each memory transaction. MEMPower takes data values into account and thus bridges the gap between architectural simulators and slow but precise RTL power simulators.

Wattch [12] collects some activity factors related to data for some memories and busses but does not model high performance GPUs and graphics DRAM.

Wong et al. used microbenchmarking to reveal various latency and cache characteristics of the GT200 [13], but do not consider energy and memory channel mapping. Mei and Chu used microbenchmarks to analyze the structure of the caches, shared memory as well as latency and throughput of the DRAM in more recent NVidia GPUs [14].

Table 1. GPU configuration in experimental evaluation.

3 Experimental Setup

For our experiments, we used an NVidia GTX580 GPU with a full GF100 chip using the Fermi architecture [9]. A short overview of its parameters is provided in Table 1. This GPU was selected for two main reasons: 1. GPGPU-Sim currently does not support more recent GPU architectures. Energy was measured using a GPU power measurement testbed that has been described in a previous work [11]. 2. Our previous work resulted in a data-dependent power model for the ALUs of this GPU [3]. This work adds the missing memory power model to enable the creation of architectural power model of the GTX580 GPU, that includes both ALU and memory data dependent power.

In order to measure the power consumption of memory transactions we developed custom microbenchmarks. These microbenchmarks execute the tested memory transaction millions of times. This allows us to measure the small energy used per transaction. In order to measure only the data dependent energy of each transaction we measure every transaction twice: Once with the test vector and once with a baseline vector of all ones. Then the energy consumed by the baseline vector is subtracted to calculate the energy difference caused by the specific test vector. Both measurements are performed at nearly the same time to ensure that the GPU temperature stays approximately constant in both measurements to avoid errors. Without this step GPU temperature variations could result in different amounts of static (leakage) power.

The microbenchmarks use inline PTX assembler to generate special load and store instructions that mostly bypass the L2 cache (ld.global.cv.u32 and st.wt.u32). Even with these instructions, using the nvprof profiler, we detected that multiple accesses to the same address, issued at nearly the same time, are still combined at the DRAM. Our microbenchmark was then redesigned to avoid this issue by making sure that the different SMs are not generating accesses to the same location at nearly the same time. The profiler was used to verify that our microbenchmark generates the expected number of memory transactions. Each measurement was performed 128 times and averaged. The order of the measurements was randomized.

4 Memory Layout

According to NVIDIA the GTX580 features 6 different memory channels [9]. CUDA allows us to allocate space in the GDDR5 but does not provide any control over which memory channels are used for the allocation. We suspected that the different memory channels might have different properties in terms of energy consumption due to different PCB layout of the memory channels as well as internal layout GF100 differences. To use all available memory bandwidth, allocations are typically spread over all memory channels, so that all the capacity can be used and all memory bandwidth can be utilized. However, when we want to measure a specific memory channel we need to identify where a specific memory location is actually allocated. As no public API is available to query that information, we hypothesized that the differences in physical distance between the GPU cores and the memory channels would also result in slightly different latencies when accessing the memory. CUDA offers a special %smid register that can be used to identify the SM executing the code and a %clock register that allows very fine-grained time measurements. We used these two features to measure the memory latency of reading from each location from each SM. We measure the latency of each location 32 times and averaged our measurements to reduce measurement noise. For each location, this results in a 16 element latency vector, where each element of the vector shows the average memory read latency from that SM to the memory location. We detected that the latency to the same memory location is indeed different from different SMs and different memory locations show different latency patterns. We noticed that the latency pattern stays constant for 256 consecutive naturally aligned bytes. This means the granularity of the mapping from addresses to memory channels is 256 bytes, and we only need to perform our latency measurements once for each 256 byte block to identify the location of the whole block.

As the memory latency is not completely deterministic but changes slightly, e.g. due to background framebuffer accesses running in parallel to the measurement, all the latency vectors are slightly different. We solved this issue using k-means clustering [15]. We initially tried to map our latency vectors into six clusters corresponding to the six memory controllers listed in NVIDIA’s descriptions of the GF100 [9]. This, however, failed to provide a plausible mapping of the memory locations, but mapping the latency vectors into twelve clusters was successful.

When we assume twelve clusters, all latency vectors are located close to one of the twelve centroids and the second closest centroid is much farther away. The number of points that gets assigned to each cluster is also approximately equal. When we access only locations mapped to one centroid, we achieve approximately 1/12 of the bandwidth achieved, when all locations from all channels are used. This pattern also continues if we selected larger subsets of the centroids, e.g. selecting locations from two clusters results in 1/6 of the bandwidth. The nvprof profiler also provides additional hints that the identified mapping is correct: Many DRAM counters are provided twice, one counter for something called subpartition 0 and another counter for subpartition 1. If we access only locations from a single cluster, we notice that only one of these two performance counters is incremented significantly, while the other counter stays very close to zero. This indicates all locations in each of the clusters are part of the same subpartition.

Lopes et al. list six L2 Cache banks with two slices each for GTX580 [16]. The GTX580 has a 384-bit wide memory interface. Six 64-bit wide channels together with the 8n prefetch of GDDR5 would result in a fetch-granularity of 64 bytes per burst. Memory access patterns that only access 32 consecutive bytes and do not touch the next 32 bytes would always overfetch 32 bytes per transaction and would result in an effective bandwidth of less than half the peak bandwidth. However, our experiments showed better than expected performance for 32 byte fetches. An additional hint at 32 byte transaction is also provided by the NVIDIA profiler, where many DRAM related performance counters are incremented by one per 32 bytes. This indicates that the GTX580 can fetch 32 bytes at a time, which is consistent with twelve 32-bit channels. From these findings, we estimate that the GTX580 uses six memory controllers with two subpartitions in each controller and one 32-bit wide channel per subpartition.

As twelve is not a power of two, the GTX580 cannot simply use a few address bits to select the memory channel. Round-robin mapping of addresses to memory channels is conceptually simple but would require a division of the addresses by twelve.

Fig. 1.
figure 1

1 MB memory block with recovered memory channel mapping, each pixel is equivalent to a 256 byte block

Figure 1 provides a graphical representation of the recovered memory mapping of 1 MB block of memory. Each pixel represents a 256 byte block, each of the 64 lines represents \(64 \times 256\,\mathrm{B}=16\) kB. The memory mapping seems to be structured, but does not use any simple round robin scheme. With this mapping twelve consecutive 256B blocks, on average, use 10.6 different memory channels. A simple round robin scheme would likely result in some applications having biased memory transaction patterns that favor some memory channels over others, which would result in a performance reduction. The mapping is likely the output of a simple hash function, that makes it unlikely for applications to use a biased memory access patterns by chance. Sell describes a similar scheme used by Xbox One X Scorpio Engine [17].

Table 2. DRAM latency.
Fig. 2.
figure 2

GF100 organization

We also analyzed the latency vectors (Table 2) to reveal more information about the internal structure of the GPU. We first notice that all SMs in the same GPC have nearly the same latency pattern for the memory channels. The first SM in each GPC seems to have the lowest latency. The other SMs are approximately 2, 6 and 8 cycles slower. This additional latency within the GPC does not depend on the memory channels addressed. It is also identical for all four GPCs. This indicates an identical layout of all four GPCs and a shared connection of all SMs of a GPC to the main interconnect. The latency of four memory channels is lowest at GPC1. This is also true for GPC2 and GPC3. There are no memory channels where GPC0 provides the lowest latency. We suspect that is the result of a layout such as shown in Fig. 2. This also matches well with the PCB layout of a GTX580 where DRAM chips are located on 3 of the four sides of the GF100 and the PCIe interface can be found at the bottom.

5 Data-Dependent Energy Consumption

As already described in the introduction, we expect two main reasons for data dependent energy consumption: 1. Special signaling lines such as the GDDR5 DQ lines with additional energy consumption at a certain signal level. 2. State changes of wires and other circuit nodes. Our model allows a fast and simple evaluation, for this reason, we selected a simple linear model. Every memory transaction is mapped to a small vector that describes the relevant properties of the block. A dot product of this vector with a coefficient vector results in the estimated energy consumption for this transaction. The coefficient vector is calculated in a calibration process.

Fig. 3.
figure 3

Memory datapath

The following properties of the block are used to estimate the energy consumption. We model signal level related energy consumption by including the population count of the block. The population count is the number of set bits. We also need to estimate the amount of recharging of internal wires and circuitry caused by the transaction. Memory transactions travel through several units and various connections until they finally reach the DRAM. A simplified diagram is shown in Fig. 3. We know that the transaction travels through a 32-bit wide interface between DRAM and memory controller. Unless a reordering of bits is performed, we know which bits will be transmitted through the same wire and could cause switching activity on these wires, e.g: bits 0, 32, 64, ... are transmitted on the same DQ line, bits 1, 33, 65, ... are transmitted on the next DQ line, etc. While we know the width of the DRAM interface itself, the width of the various internal interconnections is unknown. We assume the internal link width are powers of two and are at least byte wide. The coefficients for all potential link sizes are first added to the model. During the calibration of the model, the best subset of coefficients is selected, and we indirectly gain knowledge about the internal interconnections. Because GDDR5 memory can use DBI encoded data, an extra version of each of the previously described coefficients is added to our model. This second version assumes DBI encoded data.

Fig. 4.
figure 4

Store access prediction accuracy vs. \(\alpha \)

A synthetic set of test vectors was generated to calibrate the model. The calibration test vectors are designed to span a wide range of combinations in terms of toggles at various positions and in terms of population count. We measured the real energy consumption of our test vectors. Initially, the model uses a larger number of coefficients and some of these likely have no corresponding hardware structure in the GPU. This causes a significant risk of overfitting the coefficients to our calibration measurements. We avoid this issue by using LASSO regression as an alternative to regular least square fit [18]. Instead of fitting the calibration data as closely as possible LASSO also tries to reduce the number of used coefficients and reduces their size. The hyperparameter \(\alpha \) controls the trade off between number and size of the coefficients and prediction error with the calibration set.

In addition to the set of calibration vectors, we generated another set of test vectors to validate our model. The validation vectors are generated to mimic real application data. The vectors use various integer and floating-point data types, a mixture of random distributions with different parameters was used to generate realistic data. Real application data is often also highly correlated, some test vectors used a Gaussian process to provide correlated data.

Table 3. 128B transaction coefficients
Fig. 5.
figure 5

MEMPower energy prediction for store access

Figure 4 shows the prediction error at various values of \(\alpha \). \(\alpha =0.007\) results in the smallest error in the validation set for store transaction. Smaller values of \(\alpha \) overfit the calibration set, while larger values discard important coefficients. Table 3 shows the coefficients, it should be noted that the coefficients were calculated per 512 bitflips for numerical reasons. None of the DBI coefficients are used, which indicates that the GPU is not using DBI encoding for stores. The largest coefficient corresponds to a 32 byte wide link. Coefficients for 4 and 8 byte wide links are small. Narrow 1 or 2 byte wide links are not employed. The large coefficient for a 64 byte wide link could be linked to SM internal power consumption, as the SMs use 16 wide SIMD units with 32-bits per unit.

Fig. 6.
figure 6

MEMPower energy prediction for read access

Fig. 7.
figure 7

Normalized memory channel energy consumption

The heatmap in Fig. 5 shows the prediction accuracy of our model for 128 byte store transactions. If the model would offer perfect prediction all points would be on the dashed white line. However, all our predictions are very close to the line which indicate a great prediction accuracy. Our RMS error is 0.39 nJ and the relative error is just 3.1%. Smaller transactions use different coefficients, results are not shown here because of the limited space. But one interesting result is that register values from disabled threads influence the energy consumption. Likely these register values are still transmitted through parts of the interconnect but marked as inactive. Taking data values into account instead of assuming a constant average energy per transaction improves the prediction error from an average error of 1.7 nJ to a error of just 0.39 nJ.

Figure 6 shows the prediction accuracy of our load model. In general, the model achieves a good prediction accuracy of 9.1% but tends to underestimate the energy required for cheaper transactions. Our load kernel achieves a significantly lower bandwidth than the store kernel as it will not send the next load transaction before the last transaction returned, while stores will be pipelined. The lower bandwidth results in a reduced signal to noise ratio of the measurements. The load coefficients printed in Table 3 indicate that load transaction are employing DBI encoding. Error improves from 2.3 nJ to 1.43 nJ.

We combined the microbenchmarks with the memory channel identification technique from Sect. 4 to check for energy differences between different memory channels and SMs. We tested the first SM from each GPC and used simplified test vectors to check for changes of our most important coefficients. The normalized results are shown in Fig. 7. We detected only small differences between the different SMs, however, the blue coefficient for switching activity on a 4 byte wide bus shows a large variance between different memory channels. Memory transactions to channels 8 to 11 are significantly cheaper than memory transactions on Channels 0 to 3 and 5 to 7. Memory transactions on Channels 3 and 4 are more expensive. As these results are consistent for all four GPCs, these differences are likely the result of slightly different PCB layout of the different memory channels instead of chip internal routing.

6 Conclusion

In this paper, we have presented the MEMPower power model for GPU memory transactions. Our contributions can be summarized as follows:

  • We presented a novel technique to identify in which memory channel a specific memory address is located.

  • Our microbenchmarks uncovered previously unknown architectural details of GF100-based GPUs.

  • We show that memory channels are not completely identical, but differ in latency and energy consumption.

  • The MEMPower model improves the energy predictions accuracy by on average 37.8% for loads compared to non-data dependent models and provides a 77.1% improvement on our validation set for stores.

At peak bandwidth data dependent changes to energy can influence the total power consumption of the GTX580 GPU by more than 25 W or around 10% of the total power. Future Work includes software and hardware techniques to reduce the energy consumption. Common but expensive data patterns could be recoded to patterns with reduced energy consumption. As memory transactions are significantly more expensive than simple ALU operations, even software solutions could be beneficial. Programmer control over data allocation could allow rarely used data to be placed in memory channels with costlier memory access and often used data in memory channels with reduced energy consumption.