Introduction

Applications of optimization problems abound in society. Today, there are many examples of optimization-based decision support tools that improve important processes both in industry and the public sector. Such tools are becoming more powerful, more widespread, and more critical to the performance of their users. A successful tool provides substantial improvement of key factors to the user organization. Examples are savings of economical and environmental costs, enhanced customer service, higher revenues, less use of critical resources, and improvement of human factors. Vehicle routing software (Partyka and Hall 2012) is but one example.

The impact of such tools is to a large degree dependent on their optimization performance, i.e., the quality of solutions produced within a given response time requirement. Optimization performance is largely determined by the selected optimization method, the implementation of this method on the targeted hardware platform, and the computational performance of the hardware. These three factors are closely intertwined.

More often than not, the optimization problem to be solved is computationally hard. This is particularly true for discrete optimization problems (DOPs). Over the past few decades, there has been a tremendous increase in the ability to solve ever more complex optimization problems. Bixby (2002) reminds us that the performance of commercial Linear Programming solvers increased by a factor of one million in the period 1987–2000. Roughly a factor of 1,000 is due to better methods, and a similar factor stems from the general performance increase of computers.

For many applications, there is still a large gap between the requirements and the performance of today’s optimization-based decision support systems. The ability to provide better solutions in shorter time will give substantial savings through better optimization performance of existing tools. Moreover, applications that are too complex to be effectively solved by the technology of today may become within reach of the optimization technology of tomorrow. More integrated, larger, and richer optimization problems may be solved. Again, further performance increase will result from a combination of better optimization algorithms that are implemented in more efficient ways on more powerful computers.

For many decades, Moore’s law materialized in the form of a doubling of clock speed for commodity processors every 18 months or so. This was the realm of the tongue-in-cheek “Beach law” Footnote 1. Around year 2000, the architecture of processors for commodity computers started to change. Multi-core processors with an increasing number of cores and higher total theoretical performance than their single core predecessors emerged, but each core had lower clock speed. Hence, developers of sequential software could no longer enjoy the pleasant, serendipitous effects of the Beach law. From then on, algorithms for computationally hard tasks such as solution of optimization problems need an efficient, task parallel implementation to fully utilize multiple CPU cores Footnote 2.

In addition, there has over the past decade been a drastic improvement of performance and general programmability of massively parallel stream processing (data parallel) accelerators. Data parallelism, also called stream processing, means that each processor performs the same task on different pieces of distributed data. The origin was the graphics processing unit (GPU) that was a normal component in common PCs. Primarily driven by requirements from the gaming industry, the computational performance of GPUs developed rapidly. Thus, it became more and more interesting to utilize GPUs as accelerators for compute bound tasks in general purpose computing. This trend became a natural driver for better programmability of GPUs through industry-standard languages and high quality development tools.

GPUs of today have a large number of relatively simple processors that have general purpose computing capabilities and their architecture supports data parallelism. The theoretical GPU performance has lately increased far more rapidly than the theoretical CPU performance, as illustrated in Fig. 1. The GPU is now regarded as an accelerator to be used in tandem with a multi-core CPU. Leading processor manufacturers have recently developed an integrated multi-core CPU and GPU on a single die.

Fig. 1
figure 1

Performance of CPUs versus GPUs in terms of theoretical peak performance in single precision, and memory bandwidth. Data constructed from processor specifications as reported from Intel (2012) and NVIDIA (2012)

To fully profit from the general recent and future hardware development on modern PC architectures, optimization methods that combine task and data parallelism must be developed. Ideally, such methods should be flexible and self-adaptable to the hardware at hand. The parallel, heterogeneous architecture of modern processors also motivates a fundamental re-thinking of solution methods. Algorithms that are obviously inefficient in a sequential computing model may be optimal on a massively parallel architecture.

This paper has two main goals. First, we provide a tutorial style introduction to the modern PC architecture and how to exploit it through parallel computing. Second, we give a critical survey of the literature on discrete optimization for such architectures with a focus on routing problems. For selected papers, we discuss implementation details and insights. Our intended main audience consists of researchers and practitioners in discrete optimization, routing problems in particular, which are not proficient in modern PC hardware and heterogeneous computing. We hope the paper will serve as a useful basis for increased, high quality research and development efforts in this combined research area of high importance.

The area of GPU-based methods for discrete optimization is still in its infancy. The bulk of the limited literature consists of reports from rather basic implementations of existing optimization methods on GPU, with measurement of speedup relative to a CPU implementation of unknown quality. This is not necessarily uninteresting. A speedup of existing solution methods has great pragmatic value. It enables resolution of large, complex, and time critical applications of discrete optimization that are beyond the reach of current technology. Also, it enables more comprehensive and thorough empirical scientific investigations in discrete optimization and, hence, a deeper understanding. However, it is our opinion that research in this area should be performed in a more scientific fashion: with thorough and fair measurement of speedup, and also with focus on efficiency of the implementation. An important research avenue is the design of novel methods that exploit the full heterogeneity of modern PCs in an efficient, flexible, and possibly self-adaptable way. As far as we can see, there are no such publications in the literature. If this paper will inspire research in this direction, a main objective has been fulfilled. We strongly believe that the potential is huge.

The remainder of Part I of this paper is organized as follows. The “Parallel computing” section gives a brief introduction to parallel computing in general. In “Modern computer architecture”, we describe modern computer architectures with multi-core processors for task parallelism and accelerators for data parallelism (stream processing). Alternative programming environments for such hardware are discussed in “Development of modern GPU technology”. In “Programming example in CUDA”, a simple prototype of a GPU-based local search procedure is presented to illustrate the execution model of GPUs. We proceed in “Development strategies” with guidelines and strategies for optimizing GPU code. For illustrative purposes, we then take a closer look at the performance of GPU codes in “Profiling the local search example” and sum up Part I in “Summary and conclusion”. In Part II (Schulz et al. 2013), we give a survey of the literature on GPU-based methods in discrete optimization, with focus on routing problems.

Parallel computing

The idea of parallel computing dates back to the Italian mathematician Menabrea and his “Sketch of the Analytical Engine Invented by Charles Babbage” in (Menabrea 1842). Menabrea’s paper has extensive notes by the now famous translator, Lady Lovelace. In the notes she wrote what has been recognized as the world’s first computer program. It was not until the late 1960s that computers with multiple processors emerged and parallel computing was realized, however.

There are several main types of parallel computing. Apart from the low level instruction level parallelism that is offered by modern processors, there are two main categories: task parallelism and data parallelism. In task parallelism, different procedures are performed on possibly different sets of data, typically using different processes or threads. Normally, but not necessarily, the parallel threads or processes execute on multiple processors, and there is communication between them. In the basic form of data parallelism, the same procedure, often referred to as the kernel, is executed on multiple data in parallel, on multiple processors. There is also a distinction between fine-grained parallelism, where processes or threads synchronize or communicate many times per second, coarse-grained parallelism if they communicate less frequently, and embarrassingly parallel if they only rarely need to communicate or synchronize.

Parallel computer systems can be categorized by the nature of their processors, their processor interconnection, their memory and the communication between the processors. The set of processors may be homogeneous or heterogeneous. They may be integrated on the same chip and communicate via a high-bandwidth bus such as, modern multi-core PC processors, or physically distributed around the globe and communicate over the Internet as in grid computing. Main memory may be either shared between the processors or distributed. Computer clusters are groups of loosely connected, fully-fledged, typically general purpose, not necessarily similar computers that are tightly connected and communicate through a network.

In this paper, we concentrate on modern commodity processors with multiple cores that share memory, and one or more data parallel accelerators with separate memory such as the GPU, as the platform for parallel, heterogeneous computing. There is a substantial literature on scientific computing that exploits such hardware (Brodtkorb et al. 2010; Owens et al. 2008).

Modern computer architectures

From the first microprocessor emerged in the 1970s, up until 2004, virtually all mainstream computers have used a serial execution model, in which one instruction is executed after another. The exponentially growing performance of such CPUs has traditionally come from two main factors: an increasing number of transistors, and an increasing frequency. Around 2004, however, we saw an abrupt halt to the serial performance. Increasing the number of transistors yielded only marginal performance increases, and the frequency had reached the physical limit that the chip can withstand. Since then, we have instead seen an increase in parallelism. Whilst one previously used the increasing number of transistors for executing instructions more efficiently, the extra transistors today are spent on creating multi-core designs.

Simultaneously as we have seen a growing parallelism in CPUs, we have also seen alternative architectures emerge. Around the year 2000, researchers started exploring how GPUs could be used to solve non-graphics problems. GPUs utilize a SIMD (single-instruction-multiple-data) type of execution model. SIMD was originally developed in the 1970s for vector supercomputers Footnote 3. Although SIMD machines that can execute up to 64,000 instructions in parallel were developed, such computers were very specialized and expensive. In comparison, parallel computers based on several main-stream processors running independent tasks offered more flexibility at a lower cost. With the development of GPUs, a cheap, powerful SIMD based accelerator became easily accessible. Programming the GPU for non-graphic tasks was originally an error prone and cumbersome process, but showed that GPUs could solve a multitude of problems faster than the CPU. Since then, GPUs have become highly programmable using modern C-based languages, and have received widespread adaption. In fact, three of the worlds five fastest supercomputers today use GPU acceleration (http://www.top500.org/), and there is an increasing number of libraries, such as MAGMA and CULA sparse, and commercial software products, such as Adobe Photoshop and MATLAB, that incorporate GPU acceleration.

The reason for the widespread adoption of GPUs is twofold. The first reason is that GPUs are inexpensive and readily available in everything from laptops to supercomputers. The second reason is that they offer an enormous performance, especially when considering performance per watt or performance per dollar. This difference between the GPU and the CPU is due to their differing design intents. The CPU is a highly complex processor, and modern CPUs can have over two billion transistors Footnote 4. However, most of these transistors are spent on caches, complex logic for instruction execution and latency hiding, and operating system functionality, leaving only a small percentage for computational units. GPUs, on the other hand, have up to three billion transistors, a slight increase compared to CPUs, and spend most of these transistors on computational units. This means that GPUs cannot replace CPUs, as they do not contain enough complex functionality, but can at the same time offer an extreme floating point performance. A further difference between these architectures, is that CPUs are optimized for single thread performance, meaning it is very efficient at making one task run quickly. GPUs, on the other hand, are designed for throughput instead of single thread performance, meaning it can perform a lot of computations fast, but the speed of each computation might be slower.

The most recent trend in modern computer architectures is the incorporation of GPU cores and CPU cores on the same physical chip. This combines the best of both worlds by incorporating traditional CPU cores, which are efficient for serial tasks, and GPU cores, which are efficient for throughput tasks. There are also other alternatives to GPUs for accelerated computing. For example, in 2006 Sony, Toshiba and IBM released the Cell processor (Chen et al. 2007) used in both the PlayStation 3 and the first petaflops supercomputer (Barker et al. 2008). This processor was based on using one traditional CPU core coupled with eight lightweight accelerator cores all on the same chip, and it delivered unprecedented performance. However, the programming model was cumbersome and has been openly criticized by many, and there has not been an updated version yet, making it a one-off architecture. Another alternative is to use FPGAs (field programmable gate arrays). FPGAs are essentially reprogrammable integrated circuits that offer an extreme performance per watt ratio, as you only use power on actual computation. However, as with application specific integrated circuits (ASICs), programming them is both cumbersome and error prone as one has to consider details, such as timings etc. Nevertheless, over the last five years, there has been a tremendous development in programmability through the development of C-like languages. However, programming FPGAs is still a challenging process.

Development of modern GPU technology

GPUs were originally designed for offloading demanding graphics functions from the CPU to a dedicated co-processor. As such, it originally accelerated a fixed set of graphics operations, such as vertex transformations and lighting calculations of a 3D game world. In the early days of GPU computing, one had to use these graphics specific APIs, such as OpenGL (Shreiner et al. 2012) or DirectX (Luna 2012) to perform computations, see also Fig. 2. This was a cumbersome and error prone process, as one had to rephrase the problem into operations on graphical primitives. As a trivial example, the addition of two matrices could be performed by creating a window with one pixel per output element, and rendering one quadrilateral that covered the whole window. This quadrilateral would then be textured with two textures, in which the matrix values would be represented as a color, and the GPU would add these colors together unknowing that it was performing a matrix addition. For more complex algorithms, such as matrix multiplication or Gaussian elimination, however, this process becomes quite difficult.

Fig. 2
figure 2

Evolution of GPU programming languages. Initially, the graphics card was programmed using dedicated graphics languages, but since 2007 general purpose languages ,such as CUDA, DirectCompute, and OpenCL have appeared

The earliest GPUs that accelerated a fixed set of graphics functions used the so-called fixed function graphics pipeline, and around 2003 parts of this pipeline became programmable with the release of the NVIDIA GeForce 256 GPU and the C for Graphics (Cg) (Fernando and Kilgard 2003) language. Figure 3 shows this programmable graphics pipeline, in which the input is a set of vertices in 3D space that typically represent triangles of a 3D model. These vertices are then first transformed into the so-called clip space, essentially the world as seen from the camera, by the vertex shader Footnote 5. This is a programmable stage, meaning that we can calculate the new position of the vertex using a program. After vertices have been transformed into clip space, the GPU typically creates triangles from them in the primitive assembly stage, and removes triangles that are not seen by the camera in the primitive processing stage. Then, the GPU converts the triangles into fragments in a process called rasterization. Fragments are candidates for an output pixel, and each triangle that covers a pixel position gives rise to one fragment. This means that we may have multiple fragments per pixel, for example, if two triangles cover the same pixel. All these fragments are then processed by the fragment shader, which determines the color of the fragment using for example textures Footnote 6 and lighting calculations. Finally, all these fragments enter the framebuffer operations stage, which determines the final pixel color from all the input fragments. This stage can give rise to transparency, by blending the fragments from two overlapping triangles, or simply to choose the one closest to the camera, see Fig. 4. The latter is done using the depth test.

Fig. 3
figure 3

The programmable OpenGL graphics pipeline with programmable stages marked in orange. Input to the pipeline are vertices that often represent triangles in 3D, and the output is a 2D image on screen

Fig. 4
figure 4

Illustration of the framebuffer operations stage of the graphics pipeline. The left figure uses blending of the primitives for each fragment, and the right selects the fragment closest to the camera using the depth test

In 2007, however, NVIDIA released the first general-purpose language for programming GPUs, called CUDA. This release met the demands of researchers who up until then had either used the graphics APIs, or the various abstractions of these, and subsequently we saw a whole new level of interest in GPU computing. Since then, two alternatives to CUDA have emerged, namely OpenCL and DirectCompute. All these are quite similar when it comes to the basic programming concept, which is often referred to as data parallel execution.

Programming example in CUDA

To illustrate the execution model of GPUs and how they are programmed using CUDA, we present a simple prototype of a local search procedure running on the GPU. This programming example is designed for clarity and used to demonstrate how GPUs work, and thus we have neglected important performance optimizations. A more thorough discussion on algorithm design and optimization can be found in (Schulz 2013).

A local search procedure starts with a given (feasible) initial solution to the problem at hand as the current solution. In our example, it is an instance of the well-known Euclidean Travelling Salesman Problem: Given a number of cities, find a permutation with minimal total traveling cost of visiting all cities in the order of the permutation. Traveling cost between cities is defined as the Euclidean distance. Local search with steepest descent (or best improvement) then examines all solutions that lie within a certain neighborhood of the current solution. The best improving neighboring solution is accepted as the new current solution, and the local search procedure continues in the same way. The procedure stops when there is no improving neighbor, i.e., a local minimum for the defined neighborhood is found.

Typically, the neighborhood is not defined explicitly as a set of solutions, but described implicitly in terms of a type of change to the current solution defined by a neighborhood operator. Given a current solution and the operator, the neighborhood is generated by applying the operator in all possible ways to the solution. Each individual change is called a move. In our example, we use a simple swap neighborhood where a move simply exchanges the position of two specific cities in the permutation. In our representation, we keep the first city fixed to avoid rotating the solution.

The quality of a move can be expressed by the difference, or delta value, between the cost of the neighboring solution the move leads to and the cost of the current solution. Hereby a negative delta value means that the neighboring solution has less cost than the current one, i.e., for minimizing problems like the TSP it is better.

Let us start our example by showing how this can be implemented as a traditional CPU algorithm. A fixed random permutation is our initial solution. Let n be the numbers of cities in the problem instance to be solved, leading to a swap neighborhood size of (n − 1)(n − 2)/2 moves. We systematically generate these moves, evaluate each of their incremental cost, and select the best move as follows. We include a feasibility check of each move to illustrate where such a test can be performed (here on the CPU and later on the GPU), although a swap move for the TSP will always be feasible. After having evaluated the full neighborhood, we apply the best move, or exit if we have found a local minimum:

This problem is well suited for execution on GPUs, due to its highly parallel nature: the evaluation of each move can be performed independently of all other moves. However, finding the move that improves the solution the most is a serial process. Let us start showing how the evaluation can be done in parallel on the GPU. We start by first allocating storage space for the solution on the GPU, and copy the initial configuration to the GPU as well: Similarly, we allocate space for the city coordinates on the GPU and copy them from the CPU to the GPU. We can now write a kernel that evaluates the cost of moves, and stores this on the GPU. A kernel is a function that is invoked by a large number of threads in parallel on the GPU. Our approach is to write a kernel that evaluates in each thread a subset of the total number of moves, and stores the best move of the subset in main GPU memory (which must be allocated similarly to gpu_solution): Here, the keyword __global__ marks the function as a kernel, and the number of parallel invocations is determined by the grid and block configuration. The global CUDA variable blockDim.x contains our one-dimensional block size, threadIdx.x the index of the thread inside its block, and blockIdx.x the index of the block inside the grid. A block is simply a collection of threads, and a grid is a collection of blocks. In our example, we have chosen a total of 8,192 threads split into blocks consisting of 128 threads, giving us a total of 64 blocks. These numbers are somewhat arbitrarily chosen, but still follow some fundamental guidelines. The block size should be a multiple of 32, as the GPU executes 32 threads in SIMD Footnote 7 fashion, and we want enough blocks to occupy all of the 16 multiprocessors on current GPUs.

The next thing we now need to do, is to reduce the best moves for the 8,192 different subsets into the best global move, and apply this move. We can do this in another kernel, but this time, we only invoke one block consisting of 512 threads. This is because threads within one block can cooperate, whilst different blocks are independent. In the first part of the kernel, we use parallel reduction Footnote 8 in shared memory Footnote 9 to find the best move and we then apply this move: With these two GPU kernels, we can find the best move for the current configuration in parallel and then also apply it. What remains is the CPU logic for launching these kernels, and stopping execution when no moves improve the solution: Both the GPU and the CPU version of this code end up with the same solution in the same number of iterations Footnote 10, but there is a dramatic difference in execution speed. For 1,000 cities, the GPU version takes just over 2 s to find the local minimum, whilst the CPU uses over 175 s to complete the same task, a more than 80-fold increase in speed.

Our parallel local search on the GPU was able to achieve a 80 times speed increase compared to the GPU, a figure that is representative for many publications. However, this “speedup” is nothing more than an indication that the GPU has a potential. It is highly likely that both the GPU and the CPU are operating at only a fraction of peak performance, and it is still a major challenge to optimize both the CPU and the GPU version. In “Profiling the local search example”, we will show that our approach in fact far from utilizes the full potential of the GPU.

Development strategies

GPU programming differs from traditional multi-core CPU programming, because the hardware architecture is dramatically different. It is rather simple to get started with GPU programming, and it is often relatively easy to get speedups over existing CPU codes. But these first attempts at GPU computing are often sub-optimal, and do not utilize the hardware to a satisfactory degree. Achieving a scalable high-performance code that uses hardware resources efficiently is still a difficult task that can take months and years to master.

In this section, we present techniques for achieving a high resource utilization when it comes to GPUs Footnote 11. These techniques target NVIDIA GPUs using CUDA, but as both the programming model and hardware is similar for other GPUs and languages, many of these techniques are also applicable in a broader context.

The GPU execution model

The execution model of the GPU is based around the concept of launching a kernel on a grid consisting of blocks as shown Fig. 5. Each block is composed of a set of threads. All threads in the same block can synchronize and cooperate using fast shared memory. These blocks are executed by the GPU, so that a block runs on a single multiprocessor. However, we can have far more blocks than we have multiprocessors, since each multiprocessor can execute multiple blocks in a time-sliced fashion. The grid and block can be one-, two-, and three-dimensional, and determine the number of threads that will be used. Each thread has a unique identifier within its block, and each block has a unique identifier within the grid. By combining these two, we get a unique global identifier per thread.

Fig. 5
figure 5

The CUDA concept of a grid, blocks, and threads. The domain consists of distinct blocks, which again are made up of a set of threads that can communicate and cooperate. Each thread in the global grid can be identified uniquely by the use of its block index in combination with its thread index

Latency hiding and thread performance

The GPU uses the massively threaded execution model to hide memory latencies. Even though the GPU has a vastly superior memory bandwidth compared to CPUs, it still takes on the order of hundreds of clock cycles to transfer a single element from main GPU memory. This latency is hidden by the GPU as it automatically switches between threads. Once the current thread stalls on a memory fetch, the GPU activates another waiting thread in a fashion similar to Hyper-Threading (Marr et al. 2002) on Intel CPUs. This strategy is most efficient when there are enough available threads to completely hide the memory latency, however, meaning we need a lot of threads. As there is a maximum number of threads a GPU can support concurrently, we can calculate how large a percentage of this figure we are using. This is referred to as the occupancy of the GPU, and is a rough measure of how well the GPU program is at hiding memory and other latencies. As a rule of thumb it is good to keep a relatively high occupancy, but a higher occupancy does not necessarily equate higher performance: Once all latencies are hidden, a higher occupancy may actually degrade performance as it also affects other performance metrics.

Hardware support for multiple threads is available on Intel CPUs as Hyper-Threading, but a GPU thread operates quite differently from these CPU threads. One of the differences from traditional CPU programming is that the GPU executes instructions in a 32-way SIMD fashion, in which the same instruction is simultaneously executed in 32 neighboring threads, called a warp. This is illustrated in Fig. 6, in which different code paths are taken by different threads within one warp. This means that all threads within a warp must execute both parts of the branch, which in the worst case slows down the program by a factor 32. Conversely, the cost of such an if-statement is minimal when all threads in a warp take the same branch.

Fig. 6
figure 6

Thread divergence on 32-wide SIMD GPU architectures. All threads perform the same computations, but the result is masked out for the dashed boxes

Sorting is one technique that can be used to avoid expensive branching within a kernel: by sorting the different elements according to the branch we make sure the threads within each warp all execute their code without diverging. Another way of preventing branching is to perform the branch once on the CPU instead of once for each warp on the GPU. This can be done, for example, using templates: by replacing the branch variable with a template variable, we can generate two kernels, one for condition true, and one for condition false, and let the CPU select the correct kernel. The use of templates in this example is not particularly powerful, as the overhead of running a simple if-statement in the kernel would be small. When there are a lot of parameters, however, there can be a large performance gain from using template kernels (Harris 2011; Brodtkorb et al. 2012a). Another example of the benefit of kernel template arguments is the ability to specify different shared memory sizes at compile time, thus allowing the compiler to issue warnings for out-of-bounds access. The use of templates can also be used to perform compile-time loop unrolling, which has a great performance impact. By having separate kernels for different for-loop sizes, performance can be greatly improved.

Memory guidelines

The memory wall, in which transferring data to the processor is far more expensive than computing on that data, has halted the performance increase of CPU programs for a long time. It can also be a major problem on GPUs, which makes memory optimizations important. The first rule in optimizing memory is to reuse data and keep it as close as possible to the processor. The memory hierarchy on GPUs consists of three main memories, listed in decreasing order by speed: registers, shared memory, and global memory. The use or misuse of these can often determine the efficiency of GPU programs.

Registers are the fastest memory units on a GPU, and each multiprocessor on the GPU has a large, but limited, register file. This register file is divided amongst threads residing on that multiprocessor, and are private for each thread. If the threads in one block use more registers than are physically available, registers will also spill to the L1 cache Footnote 12 and global memory, which means that when you have a high number of threads, the number of registers available to each thread is very restricted. This is one of the reasons why a high occupancy may actually hurt performance. Thus, thread-level parallelism is not the only way of increasing performance, It is also possible to increase performance by decreasing the occupancy to allow more registers per thread.

The second fastest memory on the GPU is the shared memory. Shared memory is a very powerful tool in GPU computing because it allows all threads in a block to share data. Shared memory can be thought of as a kind of programmable cache, or scratchpad, in which the programmer is responsible for placing data there explicitly. However, as with caches its size is limited (up to 48 KB), which can be a limitation on the number of threads per block. Shared memory is physically organized into 32 banks that serve one warp with data simultaneously. For full speed, each thread must access a distinct bank, which can be achieved, for example, if the threads access consecutive 32-bit elements.

The third type of memory on the GPU is the global memory. This is the main memory of the GPU, and even though it has an impressive bandwidth, it has a high latency as discussed earlier. The latencies are preferably hidden by a large number of threads, but there are other pitfalls. First of all, just as with CPUs, the GPU transfers full cache lines Footnote 13 across the memory bus (called coalesced reads). Transferring a single element, therefore, consumes the same bandwidth as transferring a full cache line as a rule of thumb. To achieve full memory bandwidth, we should therefore program the kernel such that warps access continuous regions of memory. Furthermore, we want to transfer full cache lines, which is done by starting at a quad word boundary (the start address of a cache line), and transfer full quadwords (128 bytes) as the smallest unit. The address alignment is typically achieved by padding arrays. Alternatively, for non-cached loads, it is sufficient to align to word boundaries and transfer words (32 bytes). To fully occupy the memory bus the GPU also uses memory parallelism, in which a large number of outstanding memory requests are used to occupy the bandwidth. This is both a reason for a high memory latency, and a reason for high bandwidth utilization.

In addition to the above-mentioned memory areas, the NVIDIA GPUs of the recent Fermi architecture have hardware L1 and L2 caches. The L2 cache size is fixed and shared between all multiprocessors on the GPU, whilst the L1 cache is per multiprocessor. The L1 cache can be configured to be either 16 or 48 KB, at the expense of shared memory. The L2 cache, on the other hand, can be turned on or off at compile-time, or by using inline PTX assembly instructions in the kernel. The benefit of turning off the L2 cache is that the GPU is allowed to transfer smaller amounts of data than a full cache line, which will often improve the performance for sparse and other random access algorithms.

In addition to the L1 and L2 caches, the GPU also has caches related to traditional graphics functions. The constant memory cache is one example, which is typically used for arguments sent to a CUDA kernel. It is a cache tailored for broadcast, in which all threads in a block access the same data. The GPU also has a texture cache that can be used to accelerate reading global memory. However, the L1 cache has a higher bandwidth, so the texture cache is mostly useful if combined with texture functions, such as linear interpolation between elements.

Further guidelines

The CPU and the GPU operate asynchronously because they are different processors. This enables simultaneous execution on both processors, which is a key ingredient of heterogeneous computing: the efficient use of multiple different computational resources by letting each resource perform the tasks for which it is best suited. In the CUDA API, this is exposed as streams. Each stream is an in-order queue of operations that will be performed by the GPU, including memory transfers and kernel launches. A typical use-case is that the CPU schedules a memory copy from the CPU to the GPU, a kernel launch, and a copy of results from the GPU to the CPU. The CPU then continues processing simultaneously as the GPU executes its operations, and synchronization is only performed when the GPU results are needed. There is also support for multiple streams, which can execute simultaneously as long as they obey the order of operations within their respective streams. Current GPUs support up to 16 concurrent kernel launches (NVIDIA 2011), which means that we can both have data parallelism, in terms of a computational grid of blocks, and task parallelism, in terms of different concurrent kernels. GPUs furthermore support overlapping memory copies between the CPU and the GPU and kernel execution. This means that we can simultaneously copy data from the CPU to the GPU, execute 16 different kernels, and copy data from the GPU back to the CPU if all these operations are scheduled properly to different streams. In practice, however, it can be a challenge to achieve such high levels of task parallelism.

When transferring data between the CPU and the GPU, it can be beneficial to use so-called page-locked memory. Page locked memory is guaranteed to be continuous and in physical RAM (not swapped out to disk, for example), and is thus not pageable by the operating system. However, page-locked memory is scarce and rapidly exhausted if used carelessly. A further optimization for page-locked memory is to use write-combining allocation. This disables CPU caching of a memory area that the CPU will only write to, and can increase the bandwidth utilization by up to 40 % (NVIDIA 2011). It should also be noted that enabling error-correcting code (ECC) memory will negatively affect both the bandwidth utilization and available memory, as ECC requires extra bits for error control.

CUDA supports a unified address space, in which the physical location of a pointer is automatically determined. That is, data can be copied from the GPU to the CPU (or the other way round) without specifying the direction of the copy. While this might not seem like a great benefit at first, it greatly simplifies code needed to copy data between CPU and GPU memories, and enables advanced memory accesses. The unified memory space is particularly powerful when combined with mapped memory. A mapped memory area is a continuous block of memory that is available directly from both the CPU and the GPU at the same time. When using mapped memory, data transfers between the CPU and the GPU are automatically executed asynchronously with kernel execution when possible.

The most recent version of the CUDA API has become thread safe (NVIDIA 2011), so that one CPU thread can control multiple CUDA contexts (e.g., one for each physical GPU), and conversely multiple CPU threads can share a single CUDA context. The unified memory model together with the new thread safe context handling enables much faster transfers between different GPUs. The CPU thread can simply issue a direct GPU–GPU copy, bypassing a superfluous copy in CPU memory.

Profile driven development

A quote often attributed to Donald Knuth is that “premature optimization is the root of all evil” (Knuth 1974). The lesson in this statement is to make sure that the code produces the correct results before trying to optimize it, and optimize only where it will matter. Optimization always starts with identifying the major bottlenecks of the application, as performance will increase the most when removing these. However, locating the bottleneck is hard enough on a CPU, and can be even more difficult on a GPU. Optimization should also be considered a cyclic process, because after having found and removed one bottleneck, we need to repeat the process to find the next bottleneck in the application. This cyclic optimization can be repeated until the kernel operates close to the theoretical hardware limits or all optimization techniques have been exhausted.

To identify the performance bottleneck in a GPU application, it is important to choose an appropriate performance metric, and compare attained performance to the theoretical peak performance. When programming GPUs there are several bottlenecks one can encounter. For a GPU kernel there are essentially three main bottlenecks: the kernel may be limited by instruction throughput, memory throughput, or latencies. It may, however, also be that CPU–GPU communication and synchronization is a bottleneck, or that other overheads dominate the run-time.

When profiling a CUDA kernel, there are two main approaches to locate the performance bottleneck. The first and most obvious is to use the CUDA visual profiler. The profiler is a program that samples different hardware counters, and the correct interpretation of these numbers is required to identify bottlenecks. The second option is to strategically modify the source code in an attempt to single out what takes most time in the kernel.

The visual profiler can be used to identify whether a kernel is limited by bandwidth or arithmetic operations. This is done by simply looking at the instruction-to-byte ratio, or in other words finding out how many arithmetic operations your kernel performs per byte it reads. The ratio can be found by comparing the instructions issued counter (multiplied with the warp size, 32) to the sum of global store transactions and L1 global load miss counters (both multiplied with the cache line size, 128 bytes), or directly through the instruction/byte counter. Then we compare this ratio to the theoretical ratio for the specific hardware the kernel is running on, which is available in the profiler as the ideal instruction/byte ratio counter. Footnote 14

Unfortunately, the profiler does not always report accurate figures as the number of load and store instructions may be lower than the actual number of memory transactions (e.g., it depends on address patterns and individual transfer sizes). To get the most accurate figures, we can compare the run-time of different versions of the kernel: the original kernel, one Math version in which all memory loads and stores are removed, and one Memory version in which all arithmetic operations are removed, see Fig. 7. If the Math version is significantly faster than the original and Memory kernels, we know that the kernel is memory bound, and conversely for arithmetics. This method has the added benefit of showing how well memory operations and arithmetic operations overlap.

Fig. 7
figure 7

Run-time of modified kernels which are used to identify bottlenecks: (top left) a well-balanced kernel, (top right) a latency bound kernel, (bottom left) a memory bound kernel, and (bottom right) an arithmetic bound kernel. “Total” refers to the total kernel time, whilst “Memory” refers to a kernel stripped of arithmetic operations, and “Math” refers to a kernel stripped of memory operations. It is important to note that latencies are part of the measured run-times for all kernel versions

To create the Math kernel, we simply comment out all load operations, and move every store operation inside conditionals that will always evaluate to false. We do this to fool the compiler so that it does not optimize away the parts we want to profile, since the compiler will strip away all code not contributing to the final output to global memory. However, to make sure that the compiler does not move the computations inside the conditional as well, the result of the computations must also be used in the condition as shown in Fig. 8. Creating the Memory kernel, on the other hand, is much simpler. Here, we can simply comment out all arithmetic operations, and instead add all data used by the kernel, and write out the sum as the result.

Fig. 8
figure 8

Complier trick for arithmetic only kernel. By adding the kernel argument flag (which we always set to 0), we disable the complier from optimizing away the if-statement, and simultaneously disable the global store operations

If control flow or addressing is dependent on data in memory, as is often the case in discrete optimization, the method becomes less straightforward and requires special care. A further complication with modifying the source code is that the register count can change, which again can alter the occupancy and thereby invalidate the measured run-time. This can be solved by increasing the shared memory parameter in the launch configuration of the kernel, someKernel<<<grid_size, block_size, shared_mem_size, ...>>>(...), until the occupancy of the unmodified version is matched. The occupancy can easily be examined using the profiler or the CUDA occupancy calculator.

When a kernel appears to be well balanced (i.e., neither memory nor arithmetics appear to be the bottleneck), it does not necessarily mean that it operates close to the theoretical performance numbers. The kernel can be limited by latencies, which typically are caused by problematic data dependencies or the inherent latencies of arithmetic operations. Thus, if your kernel is well balanced, but operates at only a fraction of the theoretical peak, it is probably bound by latencies. In this case, a reorganization of memory requests and arithmetic operations can be beneficial: the goal should be to have many outstanding memory requests that can overlap with arithmetic operations.

Debugging

Debugging GPU programs has become almost as easy as debugging traditional CPU programs as more advanced debugging tools have emerged. Many CUDA programmers have encountered the “unspecified launch failure”, which used to be notoriously hard to debug. Such errors were typically only found by either modification and experimenting, or by careful examination of the source code. Today, however, there are powerful CUDA debugging tools for commonly used operating systems.

CUDA-GDB is available for Linux and Mac, can step through a kernel line by line at the granularity of a warp, e.g., identifying where an out-of-bounds memory access occurs, in a similar fashion to debugging a CPU program with GDB. In addition to stepping, CUDA-GDB also supports breakpoints, variable watches, and switching between blocks and threads. Other useful features include reports on the currently active CUDA threads on the GPU, reports on current hardware and memory utilization, and in-place substitution of changed code in running CUDA application. The tool enables debugging on hardware in real-time, and the only requirement for using CUDA-GDB is that the kernel is compiled with the gG flags. These flags make the compiler add debugging information into the executable, and the executable to spill all variables to memory.

Parallel NSight is a plug-in for Microsoft Visual Studio and Eclipse which offers conditional breakpoints, assembly level debugging, and memory checking directly in the IDE. It furthermore offers an excellent profiling tool, and is freely available to developers. Debugging used to require two distinct GPUs (one for display, and one for running the actual code to be debugged), but this requirement has been lifted as of version 2.2. Support for Linux and the Eclipse development IDE was also released with version 2.2, making Parallel NSight an excellent tool on all platforms.

Profiling the local search example

For illustrative purposes, we will profile the local search code described in “Programming example in CUDA” and show how we determine its performance using Parallel NSight and the Visual Profiler tool. It is often good to get an overview of the application by generating a timeline of the different GPU operations and measure a set of metrics for the kernels, as shown in Fig. 9. The Visual Profiler also offers the option of displaying averaged measurements for each kernel, as shown in Fig. 10. It is often possible to identify application bottlenecks by examining these different measurements, and a few selected measurements are presented in Table 1. The table shows that the neighborhood evaluation kernel takes the most time, and if we double the problem size it completely dominates the run-time. This means that we should focus our optimization efforts on this kernel first.

Fig. 9
figure 9

NSight generated timeline, which shows how long the different parts of the code take

Fig. 10
figure 10

Result of profiling the local search example on a GeForce GTX 480 with the NVIDIA Visual Profiler

Table 1 Selected profiling results for the local search example on a GeForce GTX 480

The first thing we can look at for this kernel is the achieved FLOPS counter, which indicates a performance of 142 gigaflops. The hardware maximum is over 1 teraflop, meaning we are way off. However, if our kernel is memory bound, this might still be ok, as measuring gigaflops for a memory bound kernel makes little sense.

We have to acknowledge that our problem is quite small in terms of memory usage. For each node in our problem, we need 12 bytes of storage (4 bytes for its place in the solution, 2 × 4 bytes for the 2D coordinates), yielding a total of 12 KB. The GTX480 has an L1 cache which holds 16 KB by default, more than enough to hold our whole problem. This is clearly visible in the profiling by a 100 % L1 Global Hit Rate counter as shown in Table 1. Each value is only read once from the global memory (DRAM), which explains the very low DRAM read throughput and efficiency. Unfortunately, this does not mean that our memory access pattern is well designed in general. In fact, reading the coordinates of a node in the cost computation means reading data at a random location, as the node is specified by a permutation (the solution). The effects of this can be observed when studying the instruction replay overhead. If threads within a warp cause non-coalesced reads, several instructions are necessary to read all needed data from memory, or as in our case, the L1 cache. For a coalesced read only one instruction would be necessary.

Each thread writes only the best move of the ones it has evaluated to global memory. As neighboring threads write to neighboring memory locations, we get coalesced writes and an excellent global memory store efficiency of 100 %. Good news also come from the problem of warp divergence. The branch efficiency counter is at 99.8 %, which means that of all branches taken, virtually none were divergent.

Summing up, we know now that our example has several shortcomings that limit its performance, and is operating far from the peak performance of the hardware. Some of these issues are simpler to address than others, but all require some redesign of the original local search algorithm. Therefore, we should revisit the algorithm with our new insight to achieve higher performance. In the Appendix, we apply some simple adjustments to the kernels to achieve a considerable improvement.

Summary and conclusion

Over the last decade, we have seen that GPUs have gone from curious hardware being exploited by a few researchers, to mainstream processors that power the worlds fastest supercomputers. The field of discrete optimization has also joined the trend with an increasing level of research on mapping solution methods for these problems to the GPU. In the foreseeable future, it is clear that GPUs and parallel computers will play an important role in all of computational science, including discrete optimization, and it is important for researchers to consider how to utilize these kinds of architectures.

This paper is Part I in a series of two papers. Here, we introduce graphics processing units in the context of discrete optimization. We give a short historical introduction to parallel computing, and the GPU in particular. We also show how local search can be written as a parallel algorithm and mapped to the GPU with an impressive speed-up, yet there is still room for major improvements, as the implementation far from utilizes the full potential of the GPU. We also discuss general development and optimization strategies for writing algorithms that may approach peak performance. In Part II (Schulz et al. 2013), we give a literature survey focusing on the use of GPUs for routing problems.