PALLAS: Mapping Applications onto Manycore

  • Michael Anderson
  • Bryan Catanzaro
  • Jike Chong
  • Ekaterina Gonina
  • Kurt Keutzer
  • Chao-Yue Lai
  • Mark Murphy
  • Bor-Yiing Su
  • Narayanan Sundaram


Parallel programming using the current state-of-the-art in software engineering techniques is hard. Expertise in parallel programming is necessary to deliver good performance in applications; however, it is very common that domain experts lack the requisite expertise in parallel programming. In order to drive the computer science research toward effectively using the available parallel hardware platforms, it is very important to make parallel programming systematical and productive. We believe that the key to designing parallel programs in a systematical way is software architecture, and the key to improve the productivity of developing parallel programs is software frameworks. The basis of both is design patterns and a pattern language.

We illustrate how we can use design patterns to architect a wide variety of real applications, including image recognition, speech recognition, optical ?ow computation, video background subtraction, compressed sensing MRI, computational finance, video games, and machine translation. By exploring software architectures of our applications, we achieved 10x-140x speedups in each of the applications. We illustrate how we can develop parallel programs productively using application frameworks and programming frameworks. We achieve 50%-100% of the performance while using four times fewer lines of code compared to hand-optimized code.


PALLAS SoftwareArchitecture ApplicationFramework Programming Framework Design Pattern Pattern Language 


PALLAS stands for Parallel Applications, Libraries, Languages, Algorithms, and Systems. We believe that productive development of applications for an emerging generation of highly parallel micro processors is the preeminent programming challenge of our time. Consequently, our goal is to enable the productive development of efficient parallel applications by domain experts, not just parallel programming experts. We believe that the key to the design of parallel programs is software architecture, and software frameworks [1] are the key to their efficient implementation. In our approach, the basis of both is design patterns and a pattern language. Borrowed from civil architecture, a design pattern refers to a generalizable solution to a recurring design problem. A pattern language is simply an organized way of navigating through a collection of design patterns to produce a design (Fig. 4.1). The computational elements of Our Pattern Language [2, 3] are built up from a series of computational patterns drawn largely from thirteen motifs [4] (Fig. 4.1(b)). We see these as the fundamental software building blocks that are then composed using the structural patterns of Our Pattern Language drawn from common software architectural styles [5], such as pipe‐and‐filter (Fig. 4.1(a)). A software architecture is then the hierarchical composition of computational and structural patterns, which we subsequently refine using lower‐level design patterns.
Fig. 4.1

Our Pattern Language

This software architecture and its refinement, although useful, are entirely conceptual. To implement the software, we rely on frameworks. We define a pattern-oriented software framework as an environment built on top of a software architecture in which customization is only allowed in harmony with the framework’s architecture. For example, if based on pipe‐and‐filter, then customization involves only modifying pipes or filters. We see application developers being serviced by application frameworks. These application frameworks have two advantages: First, the application programmer works within a familiar environment using concepts drawn from the application domain. Second, we prevent expression of many annoying problems of parallel programming such as non‐determinism, races, deadlock, and starvation.

To test and demonstrate our approach to parallel software development we have applied a pattern-oriented approach to parallel software development to a broad range of applications in computer vision, speech recognition, quantitative finance, games, and natural language translation. We have first used patterns and Our Pattern Language as conceptual tools to aid in the design and implementation of the applications. This work is described in Section 4.2. As our understanding of the use of patterns matured we have used patterns to define pattern-oriented frameworks for a speech recognition application and a programming framework for data parallelism. This is defined in Section 4.4.

4.2 Driving Applications

We describe eight applications from a broad set of domains ranging including image recognition, speech recognition, optical flow computation, video background subtraction, compressed sensing Magnetic Resonance Imaging (MRI), computational finance, video games and machine translation. In each of these applications we demonstrate how our pattern-based approach establishes a common set of vocabulary, aids in understanding parallelism opportunities and bottlenecks, and leads to the development of efficient parallel implementations of the underlying algorithms. We first describe the overall software architecture of an application, then illustrate how pattern decomposition helps highlight parallelism opportunities and bottlenecks, and discuss execution speedups achieved.

4.2.1 Content-Based Image Retrieval

The Content-Based Image Retrieval (CBIR) application is used to select images that match a set of training samples from a huge image database. As shown in Fig. 4.2, the user will select some exemplar images as input, and then the CBIR application will collect features from images in the image database, train the classifier based on the chosen exemplar images, and exercise the classifier to find some of the images that match the characteristics of the exemplar. For example a user may want to find all the photos of roses in a large database of pictures of flowers. If there are incorrect classification results, the user can provide feedback to the system, and the system will retrain and reexamine the image database to generate more accurate results.
Fig. 4.2

Architecture of the CBIR application

A wide variety of features can be used to describe an image, such as SIFT, SURF, HOG, MSER, color, texture, edges, contours, etc. We studied the state-of-the-art contour detection algorithm, the gPb algorithm [6]. This algorithm finds the boundaries between semantically meaningful objects in images without a priori knowing the content of the images. The computations of the gPb algorithm can be architected using the pipe-and-filter pattern as shown in Fig. 4.2. Gradients on color, brightness, and texture represent local cues of the image contours. The eigenvectors of an affinity matrix on pair-wise pixel similarity represent the global cues of the image contours. The gPb algorithm combines the local cues and the global cues to find image contours. Each computation in the gPb algorithm can be further architected by structural and computational patterns. For example, the k-means algorithm can be described by an iterative refinement pattern, iteratively compute sample means using the dense linear algebra pattern, and compute sample labels using the map-reduce pattern.

Given the feature vectors of images, we need to distinguish images that the end user recognizes as matches from the images the end user does not recognize as matches. Many machine learning algorithms can achieve this goal, such as k-nearest neighbor, naïve Bayes, logistic regression, support vector machine, decision tree, adaBoost, etc. We studied a particular state-of-the-art classifier approach known as support vector machines (SVM) [7]. The training phase and the classification phase of the SVM algorithm are architected in Fig. 4.2. The iterative refinement pattern is used to describe the computation of the training phase. Within the iterator, the map-reduce pattern is used to describe the computation of updating optimal conditions, selecting working set, and solving the quadratic programming problem. For the classification phase, the dense linear algebra pattern is used to represent the computation of dot products on vectors, and the map-reduce pattern is used to represent the computation of kernel function, summation, and scaling.

By examining efficient parallel algorithms for performing image contour detection, along with careful implementation on highly parallel, commodity processors from Nvidia, we reduced the runtime of the gPb algorithm from 237 seconds on the Intel Core i7 920 (2.66GHz) platform to 1.8 seconds on the Nvidia GTX 280 GPGPU, about 130x speedup, with uncompromised results [8]. We implemented the SVM by the Platt’s Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic in parallel on the Nvidia GeForce 8800 GTX GPGPU, and achieved 9-35x speedup on the training phase, 81-138x speedup on the classification phase against LIBSVM [9] on the Intel Core 2 Duo (2.66 GHz) platform [10].

4.2.2 Optical Flow and Tracking

Optical flow computation is a crucial first step for almost all dense motion feature extraction in video. Optical flow models have become far more reliable over the years, and recent parallel hardware, especially GPUs, also offers the potential to meet the speed requirements of large throughput video analysis. Currently the dominant class of optical flow techniques is based on extensions of the variational model of Horn and Schunck. In particular, we use the Large Displacement Optical flow (LDOF) algorithm [11] which integrates discrete point matches with a continuous energy formulation in order to obtain accurate flow for large displacements of small structures. This helps us track objects like limbs in human motion, balls in sports videos etc. much more accurately than other techniques.

A quite general numerical scheme that can efficiently compute solutions of basically all variational models is based on a coarse-to-fine warping scheme, where each level provides an update by solving a nonlinear system given by the Euler-Lagrange equations followed by fixed point iterations and a linear solver [12]. This strategy is very general and accommodates even non-convex regularizers (We use a convex regularizer that approximates the L1 norm). On serial hardware, a very efficient and quite straightforward linear solver is given by Gauss-Seidel with successive overrelaxation. But on parallel hardware, there is a need to investigate which algorithms perform better for optical flow problems. It is necessary to fully characterize the properties of the matrices involved - in this case, they are positive definite, enabling the use of the Conjugate gradient algorithm.

Figure 4.3(a) shows the overall architecture of the application. There are 3 main iterative refinement blocks – One for the coarse to fine refinement, one for doing the fixed-point iterations (linearizing the non-linear regularizer), and the third for the iterative sparse linear solver. Our implementation uses Preconditioned Conjugate gradient for solving the linear system of equations (third loop) for all the fixed points at all scales. Compared to other parallel solvers like Red-black relaxations, the preconditioned conjugate gradient algorithm performs more work per iteration (2.1x more) but requires fewer iterations (about 3x less), thus ensuring 40% better performance. For all solvers, it was necessary to take advantage of the sparse matrix structure (block penta-diagonal) to achieve high memory throughput. Compared to a serial Gauss-Seidel solver running on Intel Core2 Quad Q9550, our conjugate gradient solver on Nvidia GTX480 achieves a 47x speedup. We achieve a 78x speedup on the full application for achieving an equivalent error rate on the Middlebury optical flow dataset [13]. The runtime for running LDOF on a pair of 640x480 sized frames has been brought down from over 2 minutes to 1.8 seconds, making large displacement optical flow practical to use in a wide variety of motion estimation tasks.
Fig. 4.3

(a) Architecture of the large displacement optical flow application. (b) High-level description of the point tracker based on large displacement optical flow

By using the efficient optical flow solver, we developed a point tracking system [14]. Figure 4.3(b) shows the high level description of the tracker. The optical flow computation dominates the runtime of the tracker, taking up 93% of the total runtime even after parallelization. Compared to the most commonly used KLT tracker [15], we can track three orders of magnitude more points while achieving 46% better accuracy. Compared to the Particle Video tracker [16], we achieved 66% better accuracy while running an order of magnitude faster. In addition to the improved accuracy and speed, the tracker based on LDOF also provides improved tracking density and the ability to track large displacements. This has been possible through algorithmic exploration and an efficient parallel implementation of the large displacement optical flow algorithm on highly parallel processors (GPUs).

4.2.3 Stationary Video Background Subtraction

Stationary-video background subtraction is the problem of extracting the moving parts from a video where the camera does not move during the duration of the video, as is the case in surveillance videos. One tool used in solving this problem is a singular value decomposition (SVD) of a matrix with one column for each frame and a row for each pixel of the video [17]. The SVD operation makes it possible to extract the parts of the video that are common to every frame (i.e. the background).

The shape of this video matrix is extremely tall and skinny, because the number of pixels in a frame is typically far greater than the number of frames. For matrices of this shape, the SVD can very efficiently be found by first solving the QR decomposition of the matrix. The QR decomposition is an operation in which a matrix is factored into a product of two matrices, Q and R, where Q is orthogonal and R is upper triangular. So the main computation being done in this approach to stationary-video background subtraction is the QR decomposition of a tall-skinny matrix.

Figure 4.4 is the architecture used for a stationary-video background subtraction algorithm. The video matrix is the main data structure. We can apply geometric decomposition to break down the data structure into many small blocks that fit in cache, shown in Fig. 4.4. A recent algorithm for the QR decomposition, Communication-Avoiding QR [18], allows us to factor the entire matrix using only a sequence of small operations on these blocks. We get good performance because we are able to operate on each block in parallel in each processor’s cache. Using this approach on a Nvidia GTX480 card, we achieve a 27x speedup for the entire application compared to using Intel’s Math Kernel Library.
Fig. 4.4

Architecture of stationary-video background subtraction application

4.2.4 Automatic Speech Recognition

The Automatic Speech Recognition (ASR) application takes a speech audio waveform as input and produces a sequence of words representing the most-likely utterance the speaker intended to communicate. As shown in Fig. 4.5, ASR does this by first extracting acoustic features from the waveform and then decodes the feature sequence to produce a word sequence.
Fig. 4.5

Architecture of the ASR application

The feature extraction process involves a sequence of signal processing steps in the form of the pipe-and-filter pattern. The filters are aimed to remove variations among speakers and room acoustics and preserve features most useful to distinguishing word sequences. The decoding process performs statistical inference on a hidden Markov model using the Viterbi algorithm. The inference is performed by comparing each extracted feature to a speech model, which is trained off-line using a set of powerful statistical learning techniques. The module that performs inference, shown in Fig. 4.5 as the Inference Engine, has an iterative outer loop (iterative refinement pattern), that handles one input speech feature vector at a time. In each of the loop iterations, the algorithm performs a sequence of data-parallel steps (pipe-and-filter pattern). Modern manycore processors take advantage of the parallelism within each algorithmic step to accelerate the inference process (map-reduce pattern).

An implementation of such an inference engine involves a parallel graph traversal through an irregular graph-based knowledge network with millions of states and arcs (graph algorithm pattern). The challenge is not only to define a software architecture that exposes sufficient fine-grained application concurrency but also to efficiently synchronize between an increasing number of concurrent tasks and to effectively utilize parallelism opportunities in today’s highly parallel processors.

Chong, You et al. [19, 20] demonstrated substantial speedups of 3.4x on Intel Core i7 and 10.5x on NVIDIA GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms. Further parallel optimizations were demonstrated in Chong et al [21] through speech model transformations using domain knowledge and exploring other speech models on the latest manycore platforms [22].

Additional opportunities on the basis of such parallel implementations include accelerating the Hidden Markov Model (HMM) training algorithm, for example the Baum-Welch algorithm, and performing realtime multistream decoding, e.g. for audiovisual speech recognition. Both of these applications can make use of the highly parallelized likelihood computations that have already been optimized for the purpose of ASR, and can be expected to obtain similar gains in performance due to their highly regular and parallel structure.

4.2.5 Compressed Sensing MRI

Compressed Sensing is an approach to signal acquisition that enables high-fidelity reconstruction of certain signals sampled significantly below the Nyquist rate. The signal to be reconstructed must satisfy certain “Sparsity” conditions [23], but these conditions are satisfied at least approximately by signals in many aplications. Compressed Sensing has been applied to speed up the acquisition of MRI data [24], increasing the applicability of MRI to Pediatric medicine. Compressed Sensing reconstruction is computationally difficult, requiring solution of a non-linear L1 minimization problem [23]. L1 minimization problems are more difficult than, for example, least squares problems due to the non-differentiability of the L1 objective function. The difficulty is compounded by the size of the problems to be solved: we must determine the value of each voxel in a 3D MRI scan, so our L1 minimization typically involves billions of variables. This computational difficulty leads to long runtimes, limiting the clinical applicability of the technique. MRI images must be available interactively (i.e. in a few minutes) to the radiologist performing the examination, so that time-critical decisions can be made about further images to be taken.

Our solver for this problem implements a Projections Onto Convex Sets (POCS) method, which is shown in Fig. 4.6: we iteratively project the solution onto convex sets representing sparse signals and the feasible region of the minimization problem. Since these sets are convex and their intersection is nonempty, the procedure is guaranteed to converge. While L1 minimization problems can be cast as linear programs (LP) and solved by e.g. the Simplex method or Interior Point methods, the high numerical accuracy of LP solvers is unnecessary in our case. The POCS algorithm is much faster, and produces sufficiently high-quality images. Even still, the original Matlab implementation required approximately 30 seconds per 2D slice; a full 3D scan typically has several hundred slices, and an entire scan required hours to reconstruct. The L1 minimization is necessarily proceeded by a “Calibration” phase, which requires the solution of a number of least-squares problems. We solve these systems directly, using standard linear algebra libraries. The solutions of these systems provide a “Self-Consistency” model that incorporates information from up to 32 redundant acquisitions (channels) of the MR image.
Fig. 4.6

Architecture of the compressed sensing MRI application

We have produced two highly efficient parallel implementations of the POCS algorithm. Our evaluation platform is a 12-core 2.67 GHz Intel Xeon E5650 machine with four 30-core, 8-wide-SIMD 1.3 GHz Nvidia Tesla C1060 GPGPUs. For typically sized datasets with 8 channels, our OpenMP parallelized calibration runs in 20 seconds (140 ms per slice), on average. 40 iterations of our OpenMP POCS solver, sufficient for most datasets to converge, run in 334 seconds (2.1 seconds per slice) using all 12 CPU cores. On a single GPGPU, our Cuda POCS solver runs in 75 seconds (480 ms per slice) - 4.5x faster than the version on 12 CPU cores. Using multiple GPUs we get nearly linear speedup: the POCS solver runs in 20 seconds. Our GPU wavelet implementation is bandwidth-inefficient: a more highly optimized implementation will be up to 50% faster. Also, multi-GPU parallelization will provide additional 3-4X speedup. Using our OpenMP calibration and our Cuda POCS solver results in 40-second reconstruction times: this is the first clinically feasible compressed sensing MRI reconstruction implementation [25].

4.2.6 Market Value-at-Risk Estimation in Computational Finance

The proliferation of algorithmic trading, derivative usage and highly leveraged hedge funds has necessitated the estimation of market Value-at-Risk (VaR) in future market scenarios to measure the severity of potential financial losses. VaR reports are typically generated daily to summarize the vulnerabilities to market movements in the positions financial business units take. They are the central tenet of financial institutions’ market risk management operations.

VaR estimation is a direct application of the Monte Carlo computation pattern. It broadly entails simulating the effects of thousands to millions of potential market scenarios to collect statistics about the portfolio loss distribution going into the future. Each VaR simulation involves four steps as illustrated in Fig. 4.7(a). There exist significant parallelism opportunities for executing each of the steps over all the scenarios (Fig. 4.7(b)). We use the loss distribution of the resulting portfolio valuation, to estimate the exposure of a portfolio to a severe loss. The VaR is typically taken to be the value associated with a specific frequency in the range of the 1-in-100 up to the 1-in-20 loss event.
Fig. 4.7

Architecture of the market value-at-risk estimation in computational finance application

For an implementation optimized on a highly parallel platform such as today’s GPUs [26], we use the geometric decomposition pattern to partition the workload into blocks such that a block of scenarios can fit into a given size of fast memory in an implementation platform. Specifically, a small block of steps (1) and (2) can be merged and made to fit in lowest level cache, and a block of all four steps should fit into the memory on the device in a GPU-based platform (Fig. 4.7(c)).

We evaluated a standard implementation of quadratic VaR estimation using a portfolio-based approach, where financial outcome for all instruments in a portfolio are aggregated from quadratic approximations of risk factor losses. For step (1) we use a Sobol quasi-random number sequence; for step (2), we used the Box-Mueller transformation; for step (3), we use the quadratic estimation for the loss estimation; and for step (4), we use parallel reduction within a block and sequential reduction across blocks.

For a portfolio with up to 4096 risk factors, we achieved a 8.21x speedup on the GPU compared to an algorithmically equivalent multicore CPU implementation. Step (1) and (2) attained a speedup of 500x by more effectively utilizing computation and memory locality from applying the geometric decomposition pattern. Step (3) attained a speedup of 5x, and is limited by capabilities of Basic Linear Algorithm Subroutine (BLAS) implementations. Step (4) takes proportionally negligible runtime. Noting the key computation bottleneck in the loss estimation, we reformulated step (3) algorithmically and gained a further 60x speedup in loss estimation.

4.2.7 Games

A typical video game is a composition of several large subsystems such as physics, artificial intelligence (AI), and graphics. Subsystems can be large reusable libraries or “engines”, or functions created for a specific game. A primary concern of the game designer is how to efficiently manage communication between subsystems. The communication becomes more complex if the subsystems are to be run in parallel on a multicore device, requiring special coordination or locking for shared data. Also, each subsystem should have a well defined interface so it can easily be swapped with another similar library if necessary [27].

A solution to this problem is the application of the puppeteer pattern (Fig. 4.8). A puppeteer sits above the subsystems and acts as an intermediary for communication between subsystems. Suppose the AI subsystem changes a character’s direction and needs to inform the Physics subsystem, which will in turn update the character’s position and velocity. Instead of interfacing directly with the Physics subsystem, the AI subsystem informs the puppeteer of the change. The puppeteer passes on the information to any interested subsystems. The main benefit of the puppeteer pattern is that it reduces the total number of subsystem interfaces, which allows for greater flexibility and scalability.
Fig. 4.8

Architecture of the game application

Graphics Processing Units were developed specifically to enable more computation in the graphics subsystem. For other subsystems to take advantage of these parallel devices, they must be decomposed into their patterns and sped up individually. A simple AI subsystem, for example, is a collection of character state machines that read and write from a set of shared data. The shared data could contain the locations and orientations of the characters. This system can be architected structurally using the Agent & Repository pattern. Since the AI state machines operate independently from one another within a single frame, the task parallelism implementation pattern can be applied to speed up computation on parallel hardware.

Video games have a real-time constraint, the frame rate. The worst-case amount of computation must meet this constraint for a user with the minimum required hardware, unless the computation does not affect game play and can be optionally skipped. Another challenge is effectively managing access to the scene graph, the main data structure containing the game’s state. Data transfer can also be prohibitively expensive, especially when moving between devices.

4.2.8 Machine Translation

Machine translation (MT) is one of the classic problems in computer science and a vast area of research in the field of natural language processing (NLP). High-quality and fast MT enables a variety of exciting applications, such as real-time translation in foreign environments on handheld devices as well as defense and surveillance applications. A fast machine translator will also enable people speaking different languages communicate and share resources altogether on the Internet.

The most prevalent way of machine translation is the CKY algorithm [28, 29], which is composed of three phases: To use a translation model to translate phrases, to combine the translated phrases in a bottom-up fashion, and to extract the most likely translation with a top-down traversal. The architecture of the ML application is summarized in Fig. 4.9, where the three phases are represented by the pipe-and-filter pattern. The bottleneck of the CKY algorithm is in the second phase, in which we examine the probabilities of all possible combinations over the translated phrases using an N-gram language model, and this computation can be represented by the dynamic programming pattern. We parallelized the second step of the CKY algorithm on both GPU and CPU. When translating 1000 sentences with an average length of 28 words from Spanish to English, we achieved 1.8x speedup on GTX 480 and 2.3x speedup on Core i7 using 4 threads. When translating 350 sentences with length of more than 40 words, we achieved 2.3x speedup on GTX 480 and 2.6x speedup on Core i7 using 4 threads. This shows our parallelization works better with longer sentences because more concurrency is available.
Fig. 4.9

Architecture of the ML application

4.2.9 Summary

In this section, we have explored eight applications from a variety of domains, and demonstrated how patterns can serve as a set of vocabulary to allow software developers to quickly articulate and communicate the architecture of a piece of software. We have also touched on how patterns provide a set of known tradeoffs to inform software developers of potential bottlenecks in a design. These known tradeoffs help software developers identify key design decisions impacting the performance of an application. In the next section, we provide some perspective on the parallel speedups achieved in these applications.

4.3 Perspectives on Parallel Performance

When one writes parallel software, performance considerations are always close at hand. This is natural, since one could always forgo parallelization and use a sequential implementation, were it not for performance requirements. These performance considerations raise many questions: How can we tell if a program has been successfully parallelized? How can we compare performance between parallel platforms? How generally can we extrapolate from one performance claim to performance projections for another algorithm or architecture? Varying assumptions and perspectives lead to a surprising diversity of opinions on these questions, which is why it is important to be explicit about the assumptions one makes when making and evaluating performance claims.

We consider performance results under the following three guidelines, which we will explain, along with their justifications and implications.
  1. 1.

    Perfect linear speedup, under strong scaling, is not a necessary condition for successful parallelization.

  2. 2.

    The most useful kind of performance information comes from measured performance of a real application, running on real hardware.

  3. 3.

    Some algorithms are inherently more difficult to parallelize than others.


4.3.1 Linear Scaling Not Required

In the past, when one evaluated parallel software, it was important to achieve linear speedup under strong scaling, meaning that if one doubled the number of processors, keeping the problem size the same, the computation should take half the runtime. This was primarily due to economic reasons. Since a computer with twice as many cores cost at least twice as much as a smaller computer, in order to recoup one’s investment, linear scaling was required.

The situation has now changed, since we integrate large numbers of cores on a single die. Consider the status quo before the advent of on-die parallelism. Processor vendors created new microarchitectures, spending ever larger amounts of transistors on increasingly sophisticated single-thread processors. However, it was widely known that new microarchitectures did not provide performance gains in proportion to their increased complexity. Pat Gelsinger, of Intel, famously stated that processor performance increases only with the square root of transistor count [30]. Although the industry did not see linear increases in performance with respect to transistor count, the resulting performance gains realized through the uniprocessor era were still sufficient to propel the industry forward, providing end users new capabilities through increased performance.

On-die parallelism has exposed architectural complexity to the programmer as increased core counts. Today, increases in transistor count, to a first order approximation, are accompanied with linear increases in exposed parallelism, although not to linear increase in cost, due to Moore’s law. Accordingly, sublinear performance scaling as we increase the number of transistors, and hence the number of cores, should still provide end-users with the increased capabilities they have come to expect from the computer industry. In addition, workload sizes tend to scale as problems get harder, making parallelization easier to use in practice than Amdahl’s law and strong scaling assumptions would suggest [31]. We do not need to apply parallelism to every computation, only to those which are computationally intensive, which tend to have better parallelization characteristics because they are larger problems. Summarizing, when evaluating the success of the parallelization of a particular piece of software, we believe it is important to remember that the economics of parallelism today have made it possible for even modestly parallelized software to be successful.

4.3.2 Measure Real Problems on Real Hardware

It is often tempting to examine the parallel performance of the kernels of an application. They capture the heavy computational load of the application, and so their performance is critical. However, excessive focus on the kernels can be a mistake, since the glue that holds an application together can quickly become a bottleneck when the kernels are composed to form an application. Data structures often have to be transformed between kernels, serial work must be done to decide how the application should proceed, kernels must coordinate to ensure correct results. Accordingly, the most important performance data is achieved on complete applications, taking into account the composition of the entire application.

It is also important to examine realized, delivered application performance on concrete hardware, rather than comparing peak kernel-performance claims across various hardware platforms and trying to generalize and extrapolate expected performance. Peak, theoretical numbers are useful bounds, but they can be distracting. Most computations are not as easy to parallelize as kernels like Linpack [32], even though the kernels provide bounds on application performance. Some parallel platforms are significantly more brittle than others, in the sense that they may do very well on isolated kernels, but their general performance is fairly poor. In the end, the most important performance results concern complete applications on concrete hardware, all other performance results are useful primarily as bounds.

4.3.3 Consider the Algorithms

Successful parallelization requires consideration of the algorithms being parallelized. This is important in two senses. Firstly, we must realize that certain algorithms are harder to parallelize than others. Algorithms which require a lot of data sharing between threads, have unpredictable memory access patterns, or are characterized by very branchy control flow, are often inherently more difficult to parallelize than others. Some algorithms are embarrassingly sequential. Some are mostly sequential, and can be parallelized only through heroics that often result in modest performance gains despite significant software complexity. For this reason, it’s important not to compare speedup results for one algorithm versus another, if the algorithms accomplish different tasks. One should not expect all algorithms to parallelize with the same efficiency.

Secondly, when parallelizing an application, it is often useful to rethink the algorithms involved. Sometimes it is better to use algorithms which do more work, but are more parallelizable. And of course, if rethinking the algorithm leads to algorithmic variations which improve parallel as well as sequential efficiency, those improvements should be capitalized on.

4.3.4 Summing Up

At the end of the day, we parallelize applications because the increased performance leads to increased capabilities for end-users. Ultimately, we parallelize in order to solve bigger and harder problems, continuing to realize the full performance provided by Moore’s law in real applications.

4.4 Patterns to Frameworks

We define a software architecture as a hierarchical composition of structural and computational patterns. A pattern-oriented framework is a software environment (e.g. Ruby on Rails) that is based on particular software architecture (e.g. Model View Controller) and in which all user customization must be in harmony with that software architecture. In other words only particular customization points within the software architecture (e.g. elements of the Controller) are available for end-users to customize. Patterns and pattern-oriented frameworks assist application developers in quick prototyping of parallel software and enable fast exploration of software architectural design space. There are two types of frameworks being developed to target different developer usage models. The application frameworks provide an efficient reference implementation in an application domain along with a set of extension points to allow customization of functions in selected modules without jeopardizing the efficiency of the underlying efficient infrastructures. The programming frameworks provide a set of flexible tools to take advantage of parallel scalability in hardware without the burden of particular platform details. We motivate the need for these two types of frameworks and illustrate how they can be used.

4.4.1 Application Frameworks

Developing an efficient parallel application is often a significant undertaking. It requires not only a deep understanding of an application domain, but also advanced programming techniques for a parallel implementation platform. A deep understanding of an application domain enables domain experts to discover parallelization opportunities and make application-level design trade-offs to meet the requirements of the end user. Advanced programming techniques allow parallel programming experts to exploit the parallelization opportunities to utilize available parallel resources and navigate various levels of synchronization scope of an implementation platform.

In Automatic Speech Recognition (ASR) inference engine development, application domain knowledge includes topics such as: pruning heuristics to reduce required computation while maintaining recognition accuracy, and recognition-network construction techniques to handle periods of silence between word utterances. Advanced programming techniques include designing data structures for efficient vector processing, constructing program flows to minimize expensive synchronizations, and efficiently utilizing the atomic-operations supported on the implementation platform.

With the increasing complexity of parallel systems, domain experts often must make application level design trade-offs without the full view of parallel performance implications. On the other hand, parallel programming expert may not be aware of application-level design alternatives to optimize computations and synchronizations away from the performance bottlenecks they discover.

With Our Pattern Language, an application domain expert can quickly gain insights into potential parallel performance implications of a design by architecting it using the structural and computational patterns and becoming aware of the trade-offs governing these patterns. For the most commonly reoccurring composition of patterns in a domain, we can construct application frameworks pre-optimized for various parallel platforms. An example of such an application framework is proposed for the ASR application domain in Fig. 4.10. The application framework is based on an efficient parallel implementation of large vocabulary continuous speech recognition that achieved over 11x speedup over an optimized sequential implementation on CPU [21].
Fig. 4.10

Application framework for the ASR application

The application framework for ASR is hierarchical, with the top-level containing Feature Extractor and Inference Engine as fixed components (Fig. 4.10(a)). User can customize input format, intermediate data format, recognition network format, and output format according to a specific end-user usage model. The Feature Extractor component is a pipe-and-filter pattern-based-framework where the filters can be customized according to the end application needs (Fig. 4.10(b)). The Inference Engine component contains an Inference Engine Framework where there is a fixed structure of sequential steps wrapped in an iterative loop implementing the Viterbi algorithm (Fig. 4.10(c)). The computation within each step can be customized to incorporate many variations of the application.

For application domain experts, an application framework serves to restrict implementations to a software architecture that is known to be efficient while providing a plethora of opportunities for user customizations. Different user customizations can result in a whole class of applications in an application domain. For parallel programming experts, the application framework serves to accentuate critical performance bottlenecks in a class of applications, where performance improvements in these bottlenecks can lead to performance improvement of the whole class of applications. We demonstrated the effectiveness of our ASR application framework by introducing it to a Matlab/Java programmer. She enabled lip-reading in speech recognition by extending the audio-only speech recognition application framework to an audio-video speech recognition application. She was able to achieved 20x speedup in her application by implementing only plug-in modules for the Observation Probability Computation (Fig. 4.10(c)) and file input/output modules on the underlying manycore platform.

An application framework captures an efficient software architecture that implements a common reoccurring composition of patterns for an application domain. It creates a productive interface between application domain experts and parallel programming experts.

4.4.2 Programming Frameworks Efficiency & Portability Through Programming Frameworks

While application frameworks help application developers to create new and interesting applications within a specific architecture, application domain researchers and application framework developers need more flexibility to create the tools they need. It is essential to provide frameworks that will help them take advantage of the hardware scalability from parallelism while still being shielded from the particular platform details. We believe programming frameworks provide this abstraction. It might be tempting to assume that programming frameworks should be completely agnostic about the application domain. In practice, programming frameworks need to support specific application domains so that they can take advantage of data structures and transformations that are specific to a particular domain. In other words, in order to ensure good performance it is necessary to tailor the optimizations performed to a particular domain.

Application domains like computer vision and machine learning heavily employ regular data structures (dense matrices, vectors, structured sparse matrices etc.). In these cases the important optimizations that need to be performed are figuring out how much concurrency in the application needs to be exposed, and how to map this efficiently onto the hardware. In particular, modern parallel processors have several levels of parallelism – at the SIMD level, at the thread level, at the core level etc. Hardware and programming model restrictions may or may not allow us to exploit all these levels efficiently.

In addition, programming frameworks can also handle optimizations that are specific to particular architectures. For instance, the limited physical memory of current many-core architectures like CUDA-capable GPUs and the high cost of data transfer between the CPU and GPU mean memory management through efficient scheduling is important. Programming frameworks can help the application framework developers by performing high quality task and data transfer scheduling to ensure low overheads and better efficiency [33]. Copperhead

During our work investigating application parallelization, we discovered that the Data Parallelism, Strict Data Parallelism, and SIMD patterns predominate in many important computations. The Data Parallel pattern involves finding parallelism in a computation by examining independent data elements in the computation. The Strict Data Parallelism pattern is an implementation pattern where the programmer exploits available data parallelism by mapping independent threads over independent data elements, and the SIMD pattern is an execution pattern where the programmer utilizes Single Instruction, Multiple Data hardware to efficiently execute operations over vectors.

In our opinion, Data Parallelism seems to be increasingly important, since it provides abundant, scalable parallelism for finely-grained parallel architectures, towards which the industry is headed. Accordingly, we decided to build a framework to enable more productive exploitation of Data Parallelism. This framework is called Copperhead.

Copperhead is a functional subset of the Python programming language, designed for expressing compositions of data-parallel operations, such as map, reduce, scan, sort, split, join, scatter, gather, and so forth. Parallelism in Copperhead arises entirely from mapping functions over independent data elements, and synchronization also arises entirely from joining independent arrays, or accessing non-local data.

The specifics of high-performance data-parallel programming often depend critically on the particular composition of data-parallel operations. For example, when parallelism is nested, the compiler can choose to turn parallel map invocations into sequential iterations, but the choice of whether a particular map is executed in parallel depends on its composition into the rest of the computation, as well as the particulars of the parallel platform being targeted. Consequently, Copperhead makes use of Selective, Embedded, Just-In-Time Specialization [34] to use information from the computation being performed in order to specialize the resulting code to the platform being targeted. When a data-parallel function call is invoked, the runtime examines the composition of data-parallel operations, and compiles it into parallel C, which is then dispatched on the parallel platform.

Copperhead is designed to support aggressive restructuring of data-parallel computations in order to map well to parallel hardware, with the goal of minimizing synchronization and data movement, which are the enemies of successful parallel computing. By specializing Copperhead programs to Nvidia Graphics Processors, we achieved 45-100% of the performance with about four times fewer lines of code when compared to hand-tuned CUDA C++ code on sparse matrix vector multiplication, preconditioned conjugate gradient linear solver, and support vector machine training routines. Our goal is to develop Copperhead to the point that it can provide full support of implementing the computations we have been investigating in Computer Vision and Machine Learning, providing useful performance as well as high productivity [35].

4.5 Conclusions

Our goal is to enable the productive development of efficient parallel applications by domain experts, not just parallel programming experts. As the world of computing becomes more specialized, we believe that understanding particular domains, such as computer vision, will be challenging enough, and domain experts will not have the time or inclination to become expert programmers of parallel processors as well. Thus if domain experts are to benefit from computing advances in parallel processors, new programming environments tailored for domain experts will need to be provided. We believe that the key to the design of parallel programs is software architecture, and the key to their efficient implementation is software frameworks. In our approach, the basis of both is design patterns and a pattern language. Further, we believe that patterns can empower software developers to effectively communicate, integrate, and explore software designs.

To test our beliefs we have explored eight applications from a wide variety of domains. In particular, we have successfully applied patterns to architect software systems for content-based image retrieval, optical flow, video background subtraction, compressed-sensing MRI, automatic speech recognition, and value-at-risk analysis in quantitative finance. We are in process of applying patterns to architect software systems for computer games and machine translation. Altogether these applications show very diverse computational characteristics and nearly cover the entire range of computational patterns in our pattern language. In our explorations we have demonstrated how patterns can serve as the basic vocabulary for the description of the architecture of these applications. We have also shown how the choice of patterns in which to describe an architecture naturally explores a set of known trade-offs that helps to inform software developers of potential bottlenecks in a design. These known trade-offs help software developers identify key design decisions impacting the performance of an application. In this process we have indeed convinced ourselves that patterns were not only useful in helping to conceptualize the architecture of a software system and communicate it to others, but patterns are also useful in achieving efficient software implementations. In the process of creating parallel implementations of this wide variety of applications we also gained some general insights about speeding up applications on parallel processors and we have reported those here as well.

We are also investigating how architectures based on patterns may be used to define application and programming frameworks. We define a (pattern-oriented) framework as a software environment in which all user customization must be in harmony with the underlying architecture. An application framework is a domain-specific framework that solves application-level problems like speech recognition and a programming framework is a framework that solves an programming implementation level problem like the implementation of data parallelism. The application frameworks provide an efficient reference implementation in an application domain along with a set of extension points to allow customization of functions in selected modules without jeopardizing the efficiency of the underlying efficient infrastructures. The programming frameworks provide a set of flexible tools to take advantage of parallel scalability in hardware without the burden of particular platform details. We motivate the need for these two types of frameworks and illustrate how they can be used. There are many open questions about the relative merit of application and programming frameworks versus alternative approaches to software implementation such as domain-specific languages. We are in the process of clarifying the advantages and disadvantages of these two approaches.

4.6 Appendices

4.6.1 Structural Patterns

  • Pipe-and-filter: A structure of a fixed sequence of filters that take input data from preceding filters, carry out computations on that data, and then pass the output to the next filter. The filters are side-effect free; i.e., the result of their action is only to transform input data into output data.

  • Iterative refinement: A structure of an initialization followed by refinement through a collection of steps repeatedly until a termination condition is met.

  • Map-reduce: A structure of two phases: (1) a map phase where items from an “input data set” are mapped onto a “generated data set”, and (2) a reduction phase where the generated data set is reduced or otherwise summarized to generate the final result.

  • Puppeteer: A structure of a puppeteer encapsulates and controls references of the puppets by delegating operations to the puppets and collecting return data from the puppets.

4.6.2 Computational Patterns

  • Dense linear algebra: A computation is organized as a sequence of arithmetic expressions acting on dense arrays of data. The operations and data access patterns are well defined mathematically so data can be pre-fetched and CPUs can execute close to their theoretically allowed peak performance. Applications of this pattern typically use standard building blocks defined in terms of the dimensions of the dense arrays with vectors (BLAS level 1), matrix-vector (BLAS level 2), and matrix-matrix (BLAS level 3) operations.

  • Graph algorithm: A computation which can be abstracted into operations on vertices and edges, with vertices represent objects, and edges represent relationship among objects.

  • Monte Carlo: A computation that estimate a solution of a problem by statistical sampling its solution space with a set of experiments using different parameter settings.

  • Dynamic Programming: A computation that exhibits the properties of overlapping subproblem and optimal substructure. Overlapping subproblem means a problem can be solved by smaller overlapping subproblems recursively. Optimal substructure means the optimal solution of a problem can be obtained by combining the optimal solutions of the subproblems properly.

4.6.3 Parallel Algorithm Strategy Patterns

  • Data Parallelism: An algorithm is organized as operations applied concurrently to the elements of a set of data structures. The concurrency is in the data. This pattern can be generalized by defining an index space. The data structures within a problem are aligned to this index space and concurrency is introduced by applying a stream of operations for each point in the index space.

  • Geometric decomposition: An algorithm is organized by (1) dividing the key data structures within a problem into regular chunks, and (2) updating each chunk in parallel. Typically, communication occurs at chunk boundaries so an algorithm breaks down into three components: (1) exchange boundary data, (2) update the interiors or each chunk, and (3) update boundary regions. The size of the chunks is dictated by the properties of the memory hierarchy to maximize reuse of data from local memory/cache.


  1. 1.
    Catanzaro B, Keutzer K (2010) Parallel Computing with Patterns and Frameworks. ACM Crossroads, vol. 16, no. 5, pp. 22-27.Google Scholar
  2. 2.
    Our pattern language. Accessed 15 December 2009.
  3. 3.
    Keutzer K, Mattson T (2009) A design pattern language for engineering (parallel) software. Intel Technology Journal, Addressing the Challenges of Tera-scale Computing, vol.13, no. 4, pp. 6–19.Google Scholar
  4. 4.
    Asanovic K et al (2006) The landscape of parallel computing research: A view from Berkeley. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2006-183.Google Scholar
  5. 5.
    Garlan D, Shaw M (1994) An introduction to software architecture. Tech. Rep.,, Pittsburgh, PA, USA.Google Scholar
  6. 6.
    Maire M, Arbelaez P, Fowlkes C, and Malik J (2008) Using contours to detect and localize junctions in natural images. CVPR 2008, pp. 1–8.Google Scholar
  7. 7.
    Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning, 20: 273–297.MATHGoogle Scholar
  8. 8.
    Catanzaro B, Su B, Sundaram N, Lee Y, Murphy M, Keutzer K (2009) Efficient, high quality image contour detector. ICCV 2009, pp. 2381-2388.Google Scholar
  9. 9.
    Chang C, Lin C (2001) LIBSVM : a library for support vector machines. Software available at Accessed 15 December 2009.
  10. 10.
    Catanzaro B, Sundaram N, Keutzer K (2008) Fast support vector machine training and classification on graphics processors. ICML 2008, pp 104-111.CrossRefGoogle Scholar
  11. 11.
    Brox T, Malik J (2010) Large displacement optical flow:descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99.Google Scholar
  12. 12.
    Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. ECCV 2004, pp. 25–36.Google Scholar
  13. 13.
    Baker S, Scharstein D, Lewis J, Roth S, Black M, Szeliski R (2007) A database and evaluation methodology for optical flow. ICCV 2009, pp. 1–8.Google Scholar
  14. 14.
    Sundaram N, Brox T, Keutzer K (2010) Dense Point Trajectories by GPU-accelerated Large Displacement Optical Flow. ECCV 2010, pp. 438–451.Google Scholar
  15. 15.
    Zach C, Gallup D, Frahm J M (2008) Fast gain-adaptive KLT tracking on the GPU. CVPR Workshop on Visual Computer Vision on GPU’s.Google Scholar
  16. 16.
    Sand P, Teller S (2008) Particle video: Long-range motion estimation using point trajectories. International Journal of Computer Vision, pp. 72–91.Google Scholar
  17. 17.
    Wang L, Wang L, Wen M, Zhuo Q, Wang W (2007) Background subtraction using incremental subspace learning. ICIP 2007, vol. 5, pp. 45–48.Google Scholar
  18. 18.
    Demmel J, Grigori L, Hoemmen M, Langou J (2008) Communication-optimal parallel and sequential QR and LU factorizations. Tech. Rep. UCB/EECS-2008-89.Google Scholar
  19. 19.
    Chong J, You K, Yi Y, Gonina E, Hughes C, Sung W, Keutzer K (2009) Scalable HMM-based inference engine in large vocabulary continuous speech recognition. ICME 2009, pp. 1797-1800.Google Scholar
  20. 20.
    You K, Chong J, Yi Y, Gonina E, Hughes C, Chen Y, Sung W, Keutzer K (2009) Parallel scalability in speech recognition: Inference engine in large vocabulary continuous speech recognition. IEEE Signal Processing Magazine, 26(6): 124-135.CrossRefGoogle Scholar
  21. 21.
    Chong J, Gonina E, Yi Y, Keutzer K (2009) A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit. Proceeding of the 10th Annual Conference of the International Speech Communication Association, pp. 1183 – 1186.Google Scholar
  22. 22.
    Chong J, Gonina E, You K, Keutzer K (2010) Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms. Proceedings of the 11th Annual Conference of the International Speech Communication Association, pp. 1489-1492.Google Scholar
  23. 23.
    Candès E J (2006) Compressive sampling. Proceedings of the International Congress of Mathematicians.Google Scholar
  24. 24.
    Lustig M, Alley M, Vasanawala S, Donoho D L, Pauly J M (2009) Autocalibrating parallel imaging compressed sensing using L1 SPIR-iT with Poisson-Disc sampling and joint sparsity constraints. ISMRM Workshop on Data Sampling and Image Reconstruction.Google Scholar
  25. 25.
    Murphy M, Keutzer K, Vasanawala S, Lustig M (2010) Clinically Feasible Reconstruction for L1-SPIRiT Parallel Imaging and Compressed Sensing MRI. ISMRM 2010.Google Scholar
  26. 26.
    Dixon M, Chong J, Keutzer K (2009) Acceleration of market value-at-risk estimation. Workshop on High Performance Computing in Finance at Super Computing.Google Scholar
  27. 27.
    Worth B, Lindberg P, Granatir (2009) Smoke: Game Threading Tutorial. Game Developers Conference.Google Scholar
  28. 28.
    Cocke J, Schwartz J T (1970) Programming languages and their compilers: Preliminary notes. Courant Institute of Mathematical Sciences, New York University, Tech. Rep.Google Scholar
  29. 29.
    Kasami T (1965) An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific report AFCRL-65-758, Air Force Cambridge Research Lab, Bedford, MA.Google Scholar
  30. 30.
    Pollack F (1999) Microarchitecture challenges in the coming generations of CMOS process tech-nologies. MICRO-32.Google Scholar
  31. 31.
    Gustafson J L (1988) Reevaluating Amdahl’s Law, CACM, 31(5): 532-533.Google Scholar
  32. 32.
    Luszczek P, Bailey D, Dongarra J, Kepner J, Lucas R, Rabenseifner R, Takahashi D (2006) The HPC Challenge (HPCC) benchmark suite. SC06 Conference Tutorial.Google Scholar
  33. 33.
    Sundaram N, Raghunathan, Chakradhar S (2009) A framework for efficient and scalable execution of domain specific templates on GPUs. IEEE International Parallel and Distributed Processing Symposium.Google Scholar
  34. 34.
    Catanzaro B, Kamil S, Lee Y, Asanovic K, Demmel J, Keutzer K, Shalf J, Yelick K, Fox A (2009) SEJITS: Getting productivity and performance with Selective Embedded JIT Specialization. Programming Models for Emerging Architectures.Google Scholar
  35. 35.
    Catanzaro B, Garland M, Keutzer K (2010) Copperhead: Compiling an Embedded Data Parallel Language. Tech. Rep. UCB/EECS-2010-124.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Michael Anderson
  • Bryan Catanzaro
  • Jike Chong
  • Ekaterina Gonina
  • Kurt Keutzer
    • 1
  • Chao-Yue Lai
  • Mark Murphy
  • Bor-Yiing Su
  • Narayanan Sundaram
  1. 1.University of CaliforniaBerkeleyUSA

Personalised recommendations