Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In Chapters 5 to 7 we reviewed the methods, tools, and techniques for application tuning, explained by using examples of HPC applications and benchmarks. The whole process followed the top-down software optimization framework explained in Chapter 3. The general approach to the tuning process is based on a quantitative analysis of execution resources required by an application and how these match the capabilities of the platform the application is run on. The blueprint analysis of platform capabilities and system-level tuning considerations were provided in Chapter 4, based on several system architecture metrics discussed in Chapter 2.

In this final chapter we would like to generalize the approach to application performance analysis, and offer a different and higher level view of application and system bottlenecks. The higher level view is needed to see potentially new, undiscovered performance limitations caused by invisible details inside the internal implementations of software, firmware, and hardware components.

Abstraction and Generalization of the Platform Architecture

Middleware and software architectures play a big role in HPC and in other application areas. Today, almost nobody interacts with the hardware directly. Instead, the interaction of the programmer and the hardware is facilitated via an application programming interface (API). If you think that programming in assembly language today is direct interaction with hardware, we have to disappoint you; it is not. The instruction stream is decoded into sequences of special microcode operations that in the end serve as the commands to the execution units.

Software abstractions are an unavoidable part of modern applications design, and in this part of the book we will look at the software architecture from the point of view of abstraction and the consequences of using one set of abstractions over others. Selection of some abstractions may result in performance penalties because of the added translation steps; for others, the impact may be hidden by efficient pipelining (such as happens with microcode translation inside processors) and causes almost no visible overhead.

Types of Abstractions

An abstraction is a technique used to separate conceptualized ideas from specific instances and implementations of those at hand. These conceptualizations are used to hide the internal complexity of the hardware, allow portability of software, and increase the productivity of development via better reuse of components. Abstractions that are implemented in software, middleware, or firmware also allow for fixing hardware bugs with software that results in a reduced time to market for very complex systems, such as supercomputers. We believe it is generally good to have the right level of abstraction. Abstractions today are generally an unavoidable thing: we have to use different kinds of APIs because an interaction with the raw hardware is (almost) impossible. During performance optimization work, any performance overhead must be quantified to judge whether there is need to consider a lower level of abstraction that could gain more performance and increase efficiency.

Abstractions apply to both control flow and data structures. Control abstraction hides the order in which the individual statements, instructions, or function calls of a program are executed. The data abstraction allows us to use high-level types, classes, and complex structures without the need to know the details about how they are stored in a computer memory or disk, or are transferred over the network. One can regard the notion of an object in object-oriented programming as an attempt to combine abstractions of data and code, and to deal with instances of objects through their specific properties and methods. Object-oriented programming is sometimes a convenient approach that improves code modularity, reuses software components, and increases productivity of development and support of the application.

Some examples of control flow abstractions that a typical developer in high-performance computing will face include the following:

  • Decoding of processor instruction set into microcode. These are specific for a microarchitecture implementation of different processors. The details of the mapping between processor instructions and microcode operations are discussed in Chapter 7. The mapping is not a simple one-to-one or one-to-many relation. With technologies like macro fusion, 1 the number of internal micro-operations may end up smaller than the number of incoming instructions. This abstraction allows processor designers to preserve a common instruction set architecture (ISA) across different implementations and to extend the ISA while preserving backwards compatibility. The decoding of processor instructions into micro-operations is a pipeline process, and it usually does not cause performance penalties in HPC codes.

  • Virtual machine, involving just-in-time compilation (JIT, widely used, for example, in Java or in the Microsoft Common Language Runtime [CLR] virtual machines) or dynamic translation (such as in scripting or interpreted languages, such as Python or Perl). Here, compilation is done during execution of a program, rather than prior to execution. With JIT, the program can be stored in a higher level compressed byte-code that is usually a portable representation, and a virtual machine translates it into processor instructions on the fly. JIT implementations can be sufficiently fast for use even in HPC applications, and we have seen large HPC apps written in Java and Python. And, by the way, the number of such applications grows.

  • Programming languages. These control abstraction. They offer notions such as functions, looping, conditional execution, and so on, to make it easier and more productive to write programs. Higher level languages, such as Fortran or C, often require compilation of programs to translate code into a stream of processor-specific instructions to achieve high performance. Unlike instruction decoding or just-in-time compilation, this happens ahead of time before the program executes. The approach ensures that overheads related to compilation of the program code to machine instructions are not impacting application execution.

  • Library of routines and modules. Most programming languages support extensions of programs with subprograms, modules, or libraries of routines. This enables modular architecture of final programs for faster development, better test coverage, and greater portability. Several well-known libraries provide de-facto standard sets of routines for many HPC programs, such as basic linear algebra subprograms (BLAS), 2 linear algebra package (LAPACK), 3 and the FFTW 4 software library for computing discrete Fourier transforms (DFTs). These libraries not only hide the complexity of underlying algorithms but also enable vendors of hardware to provide highly tuned implementations for best performance on their computer architectures. For example, Intel Math Kernel Library (MKL), included in Intel Parallel Studio XE, provides optimized linear algebra (BLAS, LAPACK, sparse solvers, and ScaLAPACK for clusters), multidimensional (up to 7D) fast Fourier transformations and FFTW interfaces, vector math (including trigonometric, hyperbolic, exponential, logarithmic, power, root, and rounding) routines, random number generators (such as congruent, recursive, Wichman-Hill, Mersenne twister, Sobol sequences, etc.), statistics (quantiles, min/max, variance-covariance, etc.), and data fitting (spline, interpolation, cell search) routines for the latest Intel microprocessors.

  • API calls. Any kind of API calls provided by the operating system (OS) hide the complexity of an interaction between operating system tasks and the hardware-supported context of execution exposed by the processors. Examples of these include calls from OS to the basic input/output subsystem (BIOS) abstracting the implementation of the underlying hardware platform or a threading API that creates, controls, and coordinates the threads of execution within the application.

  • Operating system. This, and specifically its scheduler, makes every program believe that it runs continuously on the system without any interruptions. In fact, the OS scheduler does interrupt execution, and even puts execution of a program on hold to give other programs access to the processor resources.

  • Full system virtualization. This includes using virtual machine monitors (VMM), such as Xen, KVM, VMWare, or others. VMMs usually abstract the entire platform so that every operating system believes it is the only one running on a system, while, in fact, VMMs are doing both control and data abstraction among all the different OS versions currently executing on a platform.

Data abstraction allows handling of data bits in meaningful ways. For example, data abstraction can be found behind:

  • Datatypes

  • Virtual memory

The notion of a datatype enforces a clear separation between the abstract properties of a data type and the concrete details of its implementation in hardware. The abstract properties of datatype are visible to client code and can be as simple as an integer datatype or as complex as a hash-table or a class. While the specific implementation (i.e., the way the bytes are stored in computer memory) is kept entirely private, the internal implementation of storing data in memory can differ from machine to machine (e.g., little-endian vs. big-endian storage), and can change over time to incorporate efficiency improvements. A specific example, relevant for high-performance computing, is the representation of real numbers using floating-point datatypes, which are limited in length.

As the length of processor registers is limited, it is not possible to equally represent all possible floating-point numbers in digital hardware. The number of possible representations is very large, and different encodings of the floating-point numbers in the fixed-length register will have significantly different numerical qualities of the computations, causing problems for application developers and users comparing results. IEEE 754-1985 was an industry standard for representing (and processing) floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008. IEEE 754 characterizes numbers in binary, providing definitions of precision, as well as defining representations for positive and negative infinity, a “negative zero,” exceptions to handle invalid results like division by zero, special values called NaNs (Not-a-Number) for representing those exceptions, denormalized numbers, and rounding modes.

Virtual memoryabstraction is made by OS’s virtual memory manager with help of hardware. This abstraction makes every 64-bit program believe it has 264 bytes (or 16 exabytes) of byte-addressable memory to use, while in fact the amount of physical memory available to be shared by multiple programs on a system is much lower: tens or, at best, hundreds of gigabytes. Virtual memory offers a significant reduction of complexity in writing software, and makes it run on a wide range of machines. However, the mechanisms implementing virtual memory involve translation from virtual address to physical address may require a high-cost process called page walk to happen and use of a lot of memory-management hardware inside the processor (e.g., translation lookaside buffers, or TLBs). This page walk process and the entire virtual to physical memory translation are invisible to the application. However, it has its hidden cost, which may be seen as a performance cost associated with the loading of page tables. Some measurements (such as one reported by Linus Torvalds) 5 provide an estimate of over 1000 processor cycles required for handling a page fault in the Linux operating system on modern processors.

Levels of Abstraction and Complexities

As we said previously, abstraction is an important notion in computer science, and it is used throughout many instances in computer and software engineering. In practice, software development abstractions are used to reduce duplication of information in a program. The basic mechanism of control abstraction is a function or subroutine; and the ones for data abstraction include various forms of type polymorphism. The more advanced mechanisms that combine data and control abstractions include abstract data types, such as classes, polytypism, and so on. These are the abstractions the software developer usually deals with.

In essence, the approach of abstraction is to deal with the problem at a higher level by focusing on the essential details, ignoring specifics and implementation at the lower level, and to reuse lower level implementations following the “DRY principle” (“Don’t repeat yourself”). This approach leads to layered architectures across the entire computer engineering discipline. The examples of layered architectures include Intel QuickPath Interconnect (QPI) protocol, 6 OSI model for computer network protocols, 7 the OpenGL library, 8 and the byte stream input/output (I/O) model used in most modern operating systems. Historically, in computer architecture the computer platform is represented as constituting five abstraction levels: hardware, firmware, assembler, operating system, and processes. 9 Recent developments in virtualization support add more layers to the stack. While those additional layers of abstraction are necessary to achieve higher productivity, the increase in stack depth may impact application performance.

Raw Hardware vs. Virtualized Hardware in the Cloud

One specific abstraction method that became widely used in enterprise and cloud computing, and is being greatly debated in relation to HPC applications, is full hardware virtualization. Hardware virtualization, or platform virtualization, is a method in which a virtual machine acts like a real computer for an operating system. Software executed on these virtual machines is separated from the underlying hardware resources and hides specific implementation details. Different levels of hardware virtualization use techniques like emulation, binary translation, and dynamic code generation. The virtual machines are created and managed by hypervisor or virtual machine monitor (VMM), which can be (and most often are) implemented in software, but may also be a firmware or even a hardware implementation.

The virtualization techniques have their roots in mainframe computers, and have been available in mainframes and RISC servers for a long time. Hardware assistance and support for hypervisor, introduced in x86 servers in 2005, has started a growth of interest and usage of virtualization in the x86 world. Hardware assistance helped reduce performance overhead considerably and removed a need for binary patching of the operating system kernel. The active development of several commercial (like ones by VMWare, Parallels, etc.) and open-source (Xen, KVM, etc.) hypervisors helped establish hardware virtualization as a base technology for enterprise data center and cloud computing applications. It promoted the development of such popular directions these days as software-defined storage (SDS) and software-defined networks (SDN), and finally brought the concept of the software-defined data center (SDDC) that extends virtualization concepts such as abstraction, pooling, and automation to all of the data center’s resources and services to achieve IT as a service.

A complete system virtualization brings certain operational advantages, such as simplified provisioning (through a higher level of integration of application software with the operating system environment) to provide a stable software image to applications (and handling of emulation of newer or obsolete hardware at VMM level) that would be beneficial in making legacy software work on modern hardware without software modifications. For enterprise and cloud applications, virtualization offers additional value, as a hypervisor allows for the implementation of several reliability techniques (virtual machine migration from one machine to another, system-level memory snapshotting, etc.) and utilization improvements via consolidation—i.e., putting several underutilized virtual machines on one physical server).

However, hardware virtualization has not progressed at the same pace within the HPC user community. Though the main quoted reason for not adopting hardware virtualization is performance overhead caused by hypervisor, it is probably the most debatable one. There are studies showing that the overhead is rather small for certain workloads, and running jobs using a pay-per-use model can be more cost-effective versus buying and managing your own cluster. 10 We tend to believe there are other reasons; for example, that the values of virtualization recognized by enterprise and cloud application customers are not compelling for HPC users. Consolidation is almost of no use (though it is possible to implement it using popular HPC batch job schedulers), and live migration and snapshotting are not more scalable than checkpointing techniques used in HPC. However, the cost reduction of virtualized hardware, predominantly hosted by large cloud providers, in some sense already generates demand exploration of high-performance computing applications in the hosted cloud services.

This trend will drive a need for optimization of HPC applications (which are tightly coupled, distributed memory applications) for execution in the hosted virtualized environments, and we see a great need for the tools and techniques to evolve to efficiently carry out this job.

Questions about Application Design

Abstractions are unavoidable. There are some abstractions we can choose (such as your own application architecture, programming language, and so on, or whether to run it under a virtualized or “bare-metal” operating system), while most others we have to live with (such as instruction decoding inside modern processors, or operating system virtual memory management). In any case, each abstraction layer will add a stage to a pipeline of queues for data flow and will complicate the control path, which may, or may not, become a bottleneck for the application performance. As the complexity of application increases, a necessity grows as well to characterize the bottlenecks imposed by abstractions involved and quantify their impact on your application.

As it is not feasible to write a cookbook or produce a fully comprehensive set of recommendations to avoid any potential performance problem with an application, we would rather offer a different approach. While developing a new application or analyzing existing code, you will need to understand the available options, or the unavoidable limitations. Practically, there are tradeoffs between application performance and productivity and between maintainability and quality of the resulting program. It is important to consider several questions during your application or system design and optimization work so as to drive proper decision making in regard to programming and execution environments, and related middleware. These questions, when answered or addressed, will improve your knowledge about the application. At the same time, this approach allows development of structured understanding of the tradeoffs necessary to achieve those desired characteristics.

Designing for Performance and Scaling

HPC is about scalability of applications and the ability to solve large problems by using parallel computers. So, achieving high performance by enabling scalability is a key differentiation of an HPC approach. We dedicate a lot of material in Chapters 6 and 7 to methods for achieving great single-node and single-threaded performance, but we also spent significant time in Chapter 5 discussing how to achieve great parallel efficiency of MPI applications.

The main tools for high performance and scalable design are Amdahl’s Law and Gustafson’s observation that we both discussed in Chapter 2. They have to be kept in mind when asking questions related to application scaling. For instance:

  • What is the minimum share of time the application is running serially (non-parallel)? We assigned f to that share of time in the Amdahl’s Law formula.

  • How does the share of time taken by the serial part change when more computing nodes or threads are added? In other words, consider whether f is a constant or it depends on the number of processors p used.

Practical answers to these two questions are a sufficient start toward understanding the scaling limits for applications. Let us consider an example of running an application on 64 processors. If, in the specific implementation, approximately 10 percent of the time is serial execution (i.e., f = 0.1), then the maximum theoretical performance improvement (speedup) over a single processor will be limited to 8.76. Usually, some amount of serial execution is unavoidable, but the cumulative contribution to the application runtime should not exceed some fraction of a percent to allow efficient use of large parallel machines.

Some of the most prominent sources of serialization in high-performance computing applications, which can be somehow addressed by the application developer, are:

  • Disk and network input/output: Though there may be parallel storage hardware, people tend to forget that widely used APIs are serial and synchronous. The local disk I/O takes significant amounts of time, and you could consider using the POSIX asynchronous I/O API (see, for example, an article by M. Tim Jones) 11 instead of traditional synchronous blocking system calls.

  • Explicit barriers and serial sections: In the parallel patterns, such as MPI_Barrier, these are called inside MPI programs, or OpenMP barrier or single directives. There are certainly necessary cases to have synchronizations between parallel sections of code, but the manual serialization has to be used with care.

  • Serial (not vectorized, not threaded, or not parallelized in any other way) parts of the program: In many applications, specifically in HPC ones, the actual share of codebase that runs serial may be the greatest. It does not make sense parallelizing parsing of the configuration files (which may result in a lot of extra code), validating provided input, or writing a logfile with execution progress and diagnostic information. It is fine for that entire code to remain serial as long as it does not take a significant share of the application’s runtime!

At the same time, there are other sources of serialization, coming from the specific control abstractions or APIs. Some examples would be:

  • Implicit barriers, such as the ones “hidden” at the end of OpenMP for/do or sections work-sharing constructs (if nowait clause is omitted) or many MPI library calls. Follow the recommendations in Chapter 5 to avoid superfluous synchronization and replace blocking collective operations by MPI-3 non-blocking ones. For the multithreaded applications using OpenMP, review the “Thread Synchronization and Locking” part in Chapter 6.

  • Internal synchronization APIs, such as many kernel routines or library calls. If any of the external library calls are identified as big-time consumers in your application, study the library documentation or contact its developer to find a better alternative.

Designing for Flexibility and Performance Portability

The coding to the lowest level of abstraction aiming at the best performance is not possible in large-scale applications. The use of assembly language or low-level intrinsics is not recommended by Intel engineers, and though it is available in the Intel compilers, such low-level programming should be seen as the last resort. Code reuse is the best working approach for achieving maintainability and successful evolution of the applications. Again, the levels of abstractions in nonperformance-critical parts of the program are of no importance; choose whatever abstraction you find suitable and keep it as flexible as possible to ensure smooth code evolution.

However, ask yourself a couple of questions about parts of the programs contributing most to overall runtime:

  • What are the predominant data layouts and the data access patterns inside the critical execution path?

  • How is the parallelism exposed in the critical execution path?

Sometimes the use of specialized, highly optimized libraries to implement time-consuming algorithms in the program will help achieve flexibility and portability, and will define the answers to these questions. As discussed earlier, software libraries, such as Intel MKL, will offer you a useful abstraction and will hide the complexity of the implementation. But let us discuss these questions in greater details, in case you are working on an algorithm yourself.

Data Layout

The first question above is about data abstractions. Most, if not all, computer architectures benefit from sequential streaming of data accesses, and the ideal situation happens when the amount of repeatedly accessed data fits into the processor caches that are roughly 2.5MiB per core in modern Intel Core-based processors. Such behavior is a consequence of the double-data rate (DDR) dynamic random access memory (DRAM) module architecture used by modern computers. If the data access is wrapped into special containers (as often observed in C++ programs), frequent access to that data can add overhead from the “envelope” around data bits that may be higher than the actual time of computing with the values.

The data layout is very important to consider when ensuring efficient use of SIMD processing, as discussed in Chapter 7. Let’s consider an example where an assemblage of three values is defined within a single structure and corresponding values from each set are to be processed simultaneously, where the pointers to that enclosing structure are passed around as function arguments. This can be, for instance, a collection of three coordinates of points in space, x, y, and z; and our application has to deal with N of such points. To store the coordinates in memory we could consider two possible definitions for structures (using C language notation) presented in Listings 8-1 and 8-2.

  • Structure of arrays (SoA): Where each of the coordinates is stored in a dedicated array and three of these arrays are combined into one structure.

Listing 8-1. Definition of SoA (Structure of Arrays) in C

#define N 1024

typedef struct _SoA {

double x[N];

double y[N];

double z[N];

} SoA_coordinates;

SoA_coordinates foo;

// access i'th element of array as foo.x[i], foo.y[i], and foo.z[i]

  • Array of structures (AoS): Where three coordinates constitute one structure and then an array of these structures is defined.

Listing 8-2. Example Definition of AoS (Array of Structures) in C

#define N 1024

typedef struct _AoS {

double x;

double y;

double z;

} AoS_coordinates;

AoS_coordinates bar[N];

// access i'th element of array as bar[i].x, bar[i].y, and bar[i].z

The layouts in memory for each of the options are shown in Figure 8-1.

Figure 8-1.
figure 1

Layout in memory for SoA and AoS options

For an application developer, the latter case—the array of structures—will likely make more sense: the location of each point is represented by three coordinates (x, y, and z), so each point coordinate is described by one object (an instance of the structure AoS coordinates), and then many points are put together into an array named foo.

However, for the performance on SIMD—capable processors, the former case—the structure of arrays–is proved to be usually better. In “A Case Study Comparing AoS (Arrays of Structures) and SoA (Structures of Arrays) Data Layouts for a Compute-intensive Loop Run on Intel Xeon Processors and Intel Xeon Phi Product Family Coprocessors,” 12 the advantages of the SoA over the AoS layout for vectorization were clearly demonstrated. The compiler is almost always able to produce better and faster running code by vectorizing the SoA data layout than the AoS data layout. So, when in doubt or unless you can prove otherwise, use SoA instead of AoS, especially on the Intel Xeon Phi coprocessor. However, the SoA data layout comes with the cost of reduced locality between accesses to multiple fields of the original structure instance, and may result in increased TLB pressure and visible costs of page-fault handling.

A data organization in memory that is beneficial for one computer architecture may end up not being the best for another. What can be done to achieve performance portability of the code, as the different data layouts may result in different observed efficiencies of the application on various computer systems? To achieve performance portability, the developer could abstract data storage and implement different access mechanisms for different machines. As an example, in the widely used Berlin Quantum ChromoDynamics application, or BQCD, 13 authors Hinnerk Stüben and Yoshifumi Nakamura allowed several data layout options in memory for the key data structures, such as the arrays representing two dimensional matrices of complex numbers.

Some of the supported layouts of arrays of complex numbers are shown in Figure 8-2.

  • Standard layout,where each complex number is represented as a structure of two elements: the real (re) and imaginary (im) parts of the complex number, and the array is stored in a typical AoS layout.

  • Vector layout, in which SoA is used to store the real and imaginary parts in separate arrays packed into one structure. This layout is usually more beneficial for use with vectoring instructions.

  • SIMD layout, which is specifically optimized for the SIMD instruction sets, such as Intel SSE or Intel AVX. It is sometimes referred as “Array of Structure of Arrays” (AoSoA) and is indeed a combination of the other two approaches: several elements of real part are stored sequentially in memory to fit one SIMD register (for instance, four double-precision floating-point values in one 256-bit AVX register), followed by same number of elements storing the imaginary parts occupying another SIMD register, and so on. This layout allows a more efficient instruction stream generation for the latest Intel processors.

Figure 8-2.
figure 2

Data layouts in memory for the arrays of complex numbers available in BQCD

The BQCD build system provides simple selection of storage layouts and also permit choosing a different code path for performance-critical sections of the application dealing with that data. The developers of BQCD invested a great effort in developing highly optimized instruction code for the several computer architectures on which BQCD is typically run.

The results obtained by the BQDC developers 14 on a server with two Intel Xeon E5-2695 v2 processors are summarized in Table 8-1 and conclude that the SIMD, or AoSoA, layout with optimized code path delivers the best performance for hot loops over vector or standard layouts.

Table 8-1. Performance in MFLOPS/Core of BQCD Matrix-Vector Multiply Routines with Different Layouts

Note

Often, for memory bandwidth-bound kernels, when the dataset fits into Level 2 cache, the performance of compute kernels can be 10 times higher than when data resides in main memory.

Structured Approach to Express Parallelism

The second question asked above is how the parallelism is exposed in the application. This question embraces a better understanding of the control abstractions used. Selecting the right control abstraction for parallel processing, along with the data distribution method between processors, is key to achieving great performance and scalability of your code.

There are many ways to express parallelism. Depending on a specific algorithm, the optimal parallel implementation may employ different control and data distribution patterns, such as the ones presented in Figure 8-3. 5 These patterns can be used to express computations in a wide variety of domains, and they usually take two things into account: tasks and data decomposition. To achieve scalable parallel performance, algorithms need to be designed to respect data locality and isolation, which are important for distributed memory and NUMA system architectures.

Figure 8-3.
figure 3

Overview of parallel control and data management patterns (Source: Structured Parallel Programming: Patterns for Efficient Computation)

Often found among HPC application patterns is the partition pattern. It provides a simple way to ensure both data locality and isolation: the data is broken down into a set of nonoverlapping regions, and then multiple instances of the same compute task are assigned to operate on the partitions. As the partitions do not overlap, the computational tasks can operate in parallel on their own working set without interfering with each other. While selecting control flow and the data layouts, the one specific issue to watch for is a load imbalance. The best application performance will be achieved when all computing elements are loaded to the maximum, and that computational load is evenly distributed among the computing elements.

Structured approaches to parallel programming and careful selection of parallel patterns are probably the best ways to achieve high performance and scalability of various parallel algorithms. For the interested reader, we recommend two great books on this topic: Structured Parallel Programming: Patterns for Efficient Computation 15 and Patterns for Parallel Programming. 16

Understanding Bounds and Projecting Bottlenecks

Whether you are writing a new application from scratch or working on an updated, more efficient implementation of an existing program, it is a critical step to analyze the influences that hardware will bring. The detailed analysis should be done for the pieces of code consuming most of the time in the program. Some of specific questions to be addressed are:

  • Will the new implementation be memory, storage, or compute bandwidth bound on the considered computer systems?

  • Is there an opportunity for a different implementation of the same algorithm that will not be impacted by the bounds of current implementation and may result in greater performance and scaling? This question is related to the previous one, but focuses on research for better algorithms and implementations. As soon as a better implementation is suggested, it has to be studied and the bottlenecks identified and their impact quantified.

  • How will the performance behavior of the application change with increased levels of concurrency?

    For instance, if a partition pattern is used, the more MPI ranks that are added, the less data (and work) per MPI rank there will be (in so-called strong scaling scenarios). Even if you may not have access to such a machine today, the development of manycore processors follows Moore’s Law and, as a result, your application may be executed on such a machine sooner than you think. For example, in 2004 the mainstream computational nodes in a cluster had two to four processors and 4 to 8 GiB of memory. Just some eight years later, Intel Xeon Phi coprocessor chips had over 60 cores, each capable of executing four threads with wider 512-bit SIMD execution units and 8 to 16 GiB of fast local memory on the co-processor cards. So, to clarify this question:

    • At which scale will the dataset per thread fit into the cache inside the processors? This point in a scaling graph may lead to observed superlinear performance improvement for memory bandwidth bound kernels when since the cache bandwidth is dramatically higher than the memory bandwidth, as we saw in Chapter 2. (We have discussed this effect in the BQCD example earlier in this chapter).

    • When can a further increase of concurrency impact vectorization and relative share of time for synchronization between the processes? If the concurrency level continues increasing, will it lead, at some point, to diminishing benefits of vectorization? For instance, the SIMD processing efficiency will drop when the loop trip count begins to approach values that are too small for vectorization to yield a positive impact.

The research and analysis in this part may end up requiring most of the time and dedication. Simulations and quantitative analysis done here should be later used to validate performance observations of running application. If the process is followed rigorously it will certainly bring a great insight into how the code performs and will give ideas for additional improvements.

Data Storage or Transfer vs. Recalculation

A more in-depth analysis of a specific parallel implementation of a selected algorithm may consider issues not often researched during single-node optimization projects. One of the areas to look at is a decision on recalculation versus storing or sending data over the network.

Imagine that you have a parallel program using many MPI ranks, and there is a single value or a small array that all MPI ranks must use at some point. One approach would be to compute required value by one of the ranks and then send it to all others ranks (i.e., broadcast the value). Another approach would be to let every rank recompute the required values independently and avoid potential wait times caused by the broadcast.

Which approach would be better? There is no universal answer to this question; it will depend on the definition of “better,” as well as which inputs are required to compute the needed data and how long the calculation would take versus how much data there is to send over the network.

If by “better” we mean a lower application runtime, then in general computing the required values independently by every rank might be faster. However, the requirements of input data need to be questioned, so that the input data for the calculation must be available to all ranks. This can add time to transfer input data over the network to the time required to compute the value if all the ranks do not have access to the inputs. Also, if the code runs on a heterogeneous cluster (such as with Intel Xeon hosts and Intel Xeon Phi coprocessors), the recalculation may result in slightly different values on different ranks, because of the different processor architectures.

On the other hand, if the important value metric is power or the energy-to-solution, the best answer may end up differently. When one MPI rank is computing, all other processors are usually paused or sleeping, waiting for the result, and this enables power savings on a potentially very large number of cluster nodes. Of course, there will be some additional power needed to complete a broadcast send. But assuming the calculation takes longer than the network operation, this approach may end up resulting into a lower average power and energy-to-solution.

Total Productivity Assessment

Sometimes, optimization of an existing application requires rewriting some of its parts using a new computing paradigm or a different programming language. What will it take to implement the changes and how will the new implementation impact abstraction layers? The main question here is not how fast the application will run but, rather, what it will take to develop and optimize the application, as well as to support the code on future computational platforms. These are the final questions asked in our description, but they have to be thought through from the start. The angle to consider should be from the productivity of development, ongoing maintenance, and potentially user support. Applications are rarely written and then forgotten. Users often require extensions of functionality, increases in performance, and support of new hardware features. Thus, in a majority of cases it is not only the application’s performance that matters but also the development team’s efficiency and the time it takes to extend functionality or port the code to new hardware. A detailed study targeted to selecting the most suitable programming model and languages for implementation of the program may save a lot of effort in future support of the code.

Will you have all necessary resources to implement desired changes in the code? The resources will certainly include the complete suite of development tools for producing your application, debugging it, and profiling the final program and its components. Based on our own experience, if you have an established performance target, then working with the help of powerful high-productivity tools like Intel Parallel Studio XE 2015 Cluster Edition will certainly reduce your time and effort in reaching the target. We schematically summarize this observation in Figure 8-4.

Figure 8-4.
figure 4

Achieved performance vs. effort, depending on tools usage

However, in addition to having the right tools, a successful optimization or development project will require knowledge and access to new areas of expertise and the time to learn new things. The tools, programming environments, and models based on open specifications or standards, such as OpenMP or Message Passing Interface (MPI), as we have extensively covered in this book, allow easier access to knowledge and expertise, as well as ensure portability of the code among different platforms whose vendors support the standard or specification. And since the standards and open specifications are supported by multiple vendors of hardware and middleware software, it is much easier to ensure protection of the investments made in your program development.

Summary

We discussed data and control abstractions used on computer systems today and across all hardware and software layers. Layered implementations are used to enable component-level design, increase code portability, and achieve investment protection. However, increased levels of abstractions add complexity and may impact performance. Very often the abstractions are unavoidable, as they are hidden inside implementation of components that are outside of your control. At the same time, the developers can often choose the coding abstractions used while implementing a program or improving performance of an existing application.

There is no universal way to write the best and fastest performing application. Usually the performance is a compromise that involves many points of view. To find the best balance we suggest analyzing the abstractions involved and then judging whether the tradeoffs are reasonable and acceptable. We suggested several questions to be asked in addressing scaling versus performance, flexibility versus specialty, re-computing versus storing the data in memory or transferring over the network, as well as understanding the bounds and bottlenecks, and obtaining a total productivity assessment. Answering these questions will increase your understanding of the program internals and the ecosystem around it, and may result in new ideas about how to achieve even higher performance for your application.

References

  1. 1. 

    Intel Corporation, “Intel 64 and IA-32 Architectures Optimization Reference Manual,” www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html .

  2. 2. 

    C. L. Lawson, J. R. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic Linear Algebra Subprograms for Fortran Usage,” ACM Transactions on Mathematical Software (TOMS) 5, no. 3 (1979): 308–23.

  3. 3. 

    E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, et al., LAPACK Users’ Guide, 3rd ed. (Philadelphia: Society for Industrial and Applied Mathematics, 1999).

  4. 4. 

    M. Frigo and S. Johnson, “FFTW: an adaptive software architecture for the FFT,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3 (Seattle: IEEE, 1998).

  5. 5. 

    L. Torvalds, “Linus Torvalds Blog,” https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6 .

  6. 6. 

    Intel Corporation, “An Introduction to the Intel QuickPath Interconnect, Document Number: 320412,” January 2009, www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf .

  7. 7. 

    ISO/IEC, “ISO/IEC International Standard 7498-1:1994 (E),” http://standards.iso.org/ittf/PubliclyAvailableStandards/s020269_ISO_IEC_7498-1_1994(E).zip .

  8. 8. 

    Khronos Group, “OpenGL: The Industry’s Foundation for High Performance Graphics,” www.opengl.org/ .

  9. 9. 

    A. S. Tanenbaum, Structured Computer Organization (Englewood Cliffs, NJ: Prentice-Hall, 1979).

  10. 10. 

    A. Gupta, L. V. Kale, F. M. V. Gioachin, C. H. Suen, and Bu-Sung, “The Who, What, Why and How of High Performance Computing Applications,” HP Laboratories, www.hpl.hp.com/techreports/2013/HPL-2013-49.pdf .

  11. 11. 

    M. T. Jones, “Boost Application Performance Using Asynchronous I/O,” www.ibm.com/developerworks/library/l-async/ .

  12. 12. 

    P. J. Besl, “A Case Study Comparing AoS (Arrays of Structures) and SoA (Structures of Arrays) Data Layouts for a Compute-intensive Loop Run on Intel Xeon Processors and Intel Xeon Phi Product Family Coprocessors,” https://software.intel.com/en-us/articles/a-case-study-comparing-aos-arrays-of-structures-and-soa-structures-of-arrays-data-layouts .

  13. 13. 

    H. Stüben and N. Yoshifumi, “BQCD,” www.rrz.uni-hamburg.de/bqcd .

  14. 14. 

    H. Stüben, “Lattice QCD Simulations on SuperMUC,” www.lrz.de/services/compute/supermuc/magazinesbooks/supermuc_results_2014/Hinnerk_Stueben_2014.pdf .

  15. 15. 

    M. McCool, J. Reinders, and A. Robison, Structured Parallel Programming: Patterns for Efficient Computation (San Francisco: Morgan Kaufmann, 2012).

  16. 16. 

    T. G. Mattson, B. A. Sanders, and B. L. Massingill, Patterns for Parallel Programming (Boston: Addison-Wesley Professional, 2006).