During the period between the 1980s to early 2000s, desktop PCs were the main computing platforms, with separate components such as the CPU, chipset, and discrete graphics cards. In this period, integrated graphics was at its infancy starting with the Intel (R) 810 (TM) chipset, mainly targeting the low-cost market segment, and power consumption was not typically a concern. CPU speed was the overarching differentiator between one generation of platforms and the next. Consequently, when the micro-architecture of a CPU was being designed, one of the key questions was how to achieve higher performance. The traditional way to achieve that was to keep increasing the clock speed. However, growth in transistor speed had been approaching its physical limits, and this implied that the processor clock speed could not continue to increase. In the past few years, the maximum CPU speeds for desktops and tablets began to plateau and are now ranging between 3—3.5 and 1.5—2 GHz, respectively. With the advent of platforms with smaller form factors, keeping the processor frequency limited has become the new norm, while focus has shifted toward lowering the system power consumption and toward more efficient utilization of available system resources.

Digital video applications require huge amounts of processing. Additionally, real-time processing and playback requirements mandate certain capabilities and performance levels from the system. Only a couple of decades ago, real-time video encoding was possible only by using high-performance, special-purpose hardware or massively parallel computing on general-purpose processors, primarily in noncommercial academic solutions. Both hardware and software needed careful performance optimization and tuning at the system and application level to achieve reasonable quality in real-time video. However, with the tremendous improvement in processor speed and system resource utilization in recent years, encoding speed at higher orders of magnitude, with even better quality, can be achieved with today’s processors.

This chapter starts with a brief discussion of CPU clock speed and considers why indefinite increases in clock speed are impractical. The discourse then turns to motivations for achieving high video coding speed, and the tradeoffs necessary to achieve such performance. Then we discuss the factors affecting encoding speed, performance bottlenecks that can be encountered, and approaches to optimization. Finally, we present various performance-measurement considerations, tools, applications, methods, and metrics.

CPU Speed and its Limits

The following are the major reasons the CPU clock speed cannot continue to increase indefinitely:

  • High-frequency circuits consume power at a rate that increases with frequency; dissipating that heat becomes impossible at a certain point. In 2001, Intel CTO Pat Gelsinger predicted, “Ten years from now, microprocessors will run at 10 GHz to 30 GHz.” But for their proportional size, “these chips will produce as much heat as a nuclear reactor.”Footnote 1 Heat dissipation in high-frequency circuits is a fundamental problem with normal cooling technologies, and indefinite increases in frequency is not feasible from either economic or engineering points of view.

  • Contemporary power-saving techniques such as clock gating and power gating do not work with high-frequency circuits. In clock gating, a clock-enable is inserted before each state element such that the element is not clocked if the data remains unchanged. This saves significant charge/discharge that would be wasted in writing the same bit, but it introduces an extra delay into the critical clock path, which is not suitable for high-frequency design. In power gating, large transistors act as voltage sources for various functional blocks of the processor; the functional blocks can potentially be turned off when unused. However, owing to the extra voltage drop in power-gating transistors, the switching speed slows down; therefore, this technique is not amenable to high-frequency design, either.

  • Transistors themselves have reached a plateau in speed. While transistors are getting smaller, they are not getting much faster. To understand why, let’s consider the following fact from electronics: a thinner gate dielectric leads to a stronger electric field across the transistor channel, enabling it to switch faster. A reduction in transistor gate area means that the gate could be made thinner without adversely increasing the load capacitance necessary to charge up the control node to create the electric field. However, at 45 nm process technology, the gate dielectric was already approximately 0.9 nm thick, which is about the size of a single silicon-dioxide molecule. It is simply impossible to make this any thinner from the same material. With 22 nm, Intel has made use of the innovative tri-gate technology to combat this limitation. Further, changing the gate dielectric and the connection material helped increase the transistor speed but resulted in an expensive solution. Basically, the easy scaling we have had in the 1980s and 1990s, when every shrink in transistor size would also lead to faster transistors, is not available anymore.

  • Transistors are no longer the dominant factor in processor speed. The wires connecting these transistors are becoming the most significant delay factor. As transistors become smaller, the connecting wires become thinner, offering higher resistances and allowing lower currents. Given the fact that smaller transistors are able to drive less current, it is easy to see that the circuit path delay is only partially determined by transistor switching speed. To overcome this, attempts are made during chip design to route the clock and the data signal on similar paths, thus obtaining about the same travel time for these two signals. This works effectively for data-heavy, control-light tasks such as a fixed-function video codec engine. However, the design of general-purpose microprocessors is complex, with irregular interactions and data travels to multiple locations that do not always follow the clock. Not only are there feedback paths and loops but there are also control-heavy centralized resources such as scheduling, branch prediction, register files, and so on. Such tasks can be parallelized using multiple cores, but thinner wires are required when processor frequencies are increased.

Motivation for Improvement

In the video world, performance is an overloaded term. In some literature, encoder performance refers to the compression efficiency in terms of number of bits used to obtain certain visual quality level. The average bit rate savings of the test encoder compared to a reference encoder is reckoned as the objective coding performance criterion. Examples of this approach can be found in Nguen and MerpeFootnote 2 and Grois et al.Footnote 3 From another view, encoder performance means the encoding speed in frames per second (FPS). In this book, we adopt this latter meaning. We also note that FPS may be used for different purposes. A video clip generally has an associated frame rate in terms of FPS (e.g., 24 or 30 FPS), which means that the clip is supposed to be played back in real time (i.e., at that specified FPS) to offer the perception of smooth motion. However, when the compressed video clip is generated, the processing and compression tasks can be carried out many times faster than real time; this speed, also expressed in FPS, is referred to as the encoding speed or encoder performance. Note that in some real-time applications such as video conferencing, where the video frames are only consumed in real time, an encoding speed faster than real time is not necessary but is sufficient, as faster processing allows the processor to go to an idle state early, thereby saving power.

However, there are several video applications and usages where faster than real-time processing is desirable. For example:

  • Long-duration video can be compressed in a much shorter time. This is useful for video editors, who typically deal with a large amount of video content and work within specified time limits.

  • Video archiving applications can call for compressing and storing large amount of video, and can benefit from fast encoding.

  • Video recording applications can store the recorded video in a suitable compressed format; the speedy encoding allows concurrently running encoding and/or non-encoding tasks to share processing units.

  • Converting videos from one format to another benefits from fast encoding. For example, several DVDs can be simultaneously converted from MPEG-2 to AVC using popular video coding applications such as Handbrake.

  • Video transcoding for authoring, editing, uploading to the Internet, burning to discs, or cloud distribution can take advantage of encoding as fast as possible. In particular, by using multiple times faster than real-time encoding, many cloud-based video distribution-on-demand services can serve multiple requests simultaneously while optimizing the network bandwidth by packaging together multiple bitstreams for distribution.

  • Video transrating applications can benefit from fast encoding. Cable, telecommunications, and satellite video distribution is often made efficient by transrating a video to a lower bit rate, thereby accommodating more video programs within the same channel bandwidth. Although the overall delay in a transrating and repacketization system is typically constant and only real-time processing is needed, speedup in the transrating and constituent encoding tasks is still desirable from the point of view of scheduling flexibility and resource utilization.

Typical video applications involve a series of tasks, such as video data capture; compression, transmission, or storage; decompression; and display, while trying to maintain a constant overall system delay. The delay introduced by the camera and display devices is typically negligible; quite often, the decoding, encoding, and processing times become the performance focus. Among these, the decoding tasks are usually specified by the video standards and they need a certain number of operations per second. But the magnitude of computation in video encoding and processing tasks exceeds by a large margin the computational need of the decoding tasks. Therefore, depending on the application requirements, the encoding and processing tasks are usually more appropriate candidates for performance optimization, owing to their higher complexities.

In particular, video encoding requires a large number of signal processing operations—on the order of billions of operations per second. Fortunately, video compression can easily be decomposed into pipelined tasks. Within the individual tasks, the video data can be further disintegrated in either spatial or temporal dimensions into a set of independent sections, making it suitable for parallel processing. Taking advantage of this property, it is possible to obtain faster than real-time video encoding performance by using multiple processing units concurrently. These processing units may be a combination of dedicated special-purpose fixed-function and/or programmable hardware units. The advantage of specialized hardware is that it is usually optimized for specific tasks, so that those tasks are accomplished in a performance- and power-optimized manner. However, programmable units provide flexibility and do not become obsolete easily. Performance tuning for programmable units are also less expensive than the dedicated hardware units. Therefore, efficiently combining the specialized and programmable units into a hybrid solution can deliver an order of magnitude greater than real-time performance, as offered by the recent Intel (R) Core (TM) and Intel (R) Atom (TM) CPUs, where the heavy lifting of the encoding tasks is carried out by the integrated graphics processing units (GPU).

Performance Considerations

In video encoding and processing applications, performance optimization aims to appropriately change the design or implementation to improve the encoding or processing speed. Increasing the processor frequency alone does not yield the best-performing encoding solution, and as discussed before, there is a limit to such frequency increase. Therefore, other approaches for performance enhancement need to be explored. Note that some techniques implement the necessary design or implementation changes relatively cheaply, but others may need significant investment. For example, inexpensive approaches to obtaining higher performance include parallelization of encoding tasks, adjusting schedules of the tasks, optimization of resource utilization for individual tasks, and so on. It is interesting to note that higher performance can also be achieved by using more complex dedicated-hardware units, which in turn is more expensive to manufacture. A general consideration for performance optimization is to judiciously choose the techniques that would provide the highest performance with lowest expense and lowest overhead. However, depending on the nature of the application and available resources, it may be necessary to accommodate large dollar expenditures to provide the expected performance. For example, a bigger cache may cost more money, but it will likely help achieve certain performance objectives. Thus, the tradeoffs for any performance optimization must be well thought out.

Usually performance optimization is not considered by itself; it is studied together with visual quality and aspects of power consumption. For instance, a higher CPU or GPU operating frequency will provide faster encoding speed, but will also consume more energy. A tradeoff between energy consumed and faster encoding speed is thus necessary at the system design and architectural level. For today’s video applications running on resource-constrained computing platforms, a balanced tradeoff can be obtained by maximizing the utilization of available system resources when they are active and putting them to sleep when they are not needed, thereby achieving simultaneous power optimization.

However, note that higher encoding speeds can also be achieved by manipulating some video encoding parameters such as the bit rate or quantization parameters. By discarding a large percentage of high-frequency details, less information remains to be processed and the encoding becomes faster. However, this approach directly affects the visual quality of the resulting video. Therefore, a balance is also necessary between visual quality and performance achieved using this technique.

There are three major ways encoding performance can be maximized for a given period of time:

  • Ensure that available system resources, including the processor and memory, are fully utilized during the active period of the workload. However, depending on the workload, the nature of resource utilization may be different. For example, an encoding application should run at a 100 percent duty cycle of the processor. As mentioned earlier, such performance maximization can also include considerations for power optimization—for example, by running at 100 percent duty cycle for as long as necessary and quickly going to sleep afterwards. However, for a real-time playback application, it is likely that only a fraction of the resources will be utilized—say, at 10 percent duty cycle. In such cases, performance optimization may not be needed and power saving is likely to be emphasized instead.

  • Use specialized resources, if available. As these resources are generally designed for balanced performance and power for certain tasks, this approach would provide performance improvement without requiring explicit tradeoffs.

  • Depending on the application requirements, tune certain video parameters to enhance encoding speed. However, encoding parameters also affect quality, compression, and power; therefore, their tradeoffs against performance should be carefully considered.

Maximum Resource Utilization

Applications, services, drivers, and the operating system compete for the important system resources, including processor time, physical memory space and virtual address space, disk service time and disk space, network bandwidth, and battery power. To achieve the best performance per dollar, it is important to maximally utilize the available system resources for the shortest period of time possible. Thus, maximum performance is obtained at the cost of minimum power consumption. Toward this end, the following techniques are typically employed:

  • Task parallelization:Many tasks are independent of each other and can run in parallel, where resources do not need to wait until all other tasks are done. Parallelization of tasks makes full utilization of the processor. Often, pipelines of tasks can also be formed to keep the resources busy during the operational period, thereby achieving maximum resource utilization. (Task parallelization will be discussed in more detail in a later section.)

  • Registers, caches, and memory utilization: Optimal use of memory hierarchy is an important consideration for performance. Memory devices at a lower level are faster to access, but are smaller in size; they have higher transfer bandwidth with fewer transfer units, but are more costly per byte compared to the higher level memory devices. Register transfer operations are controlled by the processor at processor speed. Caches are typically implemented as static random access memories (SRAMs) and are controlled by the memory management unit (MMU). Careful use of multiple levels of cache at the system-level programs can provide a balance between data access latency and the size of the data. Main memories are typically implemented as dynamic RAMs (DRAMs), are much larger than the cache, but require slower direct memory access (DMA) operations for data access. The main memory typically has multiple modules connected by a system bus or switching network. Memory is accessed randomly or in a block-by-block basis. In parallel memory organizations, both interleaved and pipelined accesses are practiced: interleaving spreads contiguous memory locations into different memory modules, while access memory modules are overlapped in a pipelined fashion. Performance of data transfer between adjacent levels of memory hierarchy is represented in terms of hit (or miss) ratios—that is, the probability that an information item will be found at a certain memory level. The frequency of memory access and the effective access time depend on the program behavior and choices in memory design. Often, extensive analysis of program traces can lead to optimization opportunities.

  • Disk access optimization:Video encoding consists of processing large amounts of data. Therefore, often disk I/O speed, memory latency, memory bandwidth, and so on become the performance bottlenecks rather than the processing itself. Many optimization techniques are available in the literature addressing disk access. Use of redundant arrays of inexpensive disks (RAID) is a common but costly data-storage virtualization technique that controls data access redundancy and provides balance among reliability, availability, performance, and capacity.

  • Instruction pipelining:Depending on the underlying processor architecture, such as complex instruction set computing (CISC) processor, reduced instruction set computing (RISC) processor, very long instruction word (VLIW) processor, vector supercomputer, and the like, the cycles per instruction are different with respect to their corresponding processor clock rates. However, to achieve the minimum number of no operations (NOPs) and pipeline stalls, and thereby optimize the utilization of resources, there needs to be careful instruction pipelining and pipeline synchronization.

Resource Specialization

In addition to maximizing the utilization of resources, performance is enhanced by using specialized resources. Particular improvements in this area include the following:

  • Special media instruction sets:Modern processors have enhanced instruction sets that include special media instructions possessing inherent parallelism. For example, to calculate the sum of absolute difference (SAD) for a eight 16-bit pixel vector, a 128-bit single instruction multiple data (SIMD) instruction can be used, expending one load and one parallel operation, as opposed to the traditional sequential approach where sixteen 16-bit loads, eight subtractions, eight absolute-value operations, and eight accumulation operations would have been needed. For encoding tasks such as motion estimation, such media instructions play the most important role in speeding up the compute-intensive task.

  • GPU acceleration: Traditionally, video encoding tasks have been carried out on multi-core CPUs. Operation-intensive tasks such as video encoding often run with high CPU utilization for all cores. For higher resolution videos, the CPU can be pushed beyond its capability so that the task would not be complete in real time. There are several research efforts to employ parallelization techniques on various shared-memory and distributed-memory platforms to deal with this issue, some of which are discussed in the next section. However, it is easy to see that to obtain a desirable and scalable encoding solution, CPU-only solutions are often not sufficient.

Recent processors such as Intel Core and Atom processors offer hardware acceleration for video encoding and processing tasks by using the integrated processor graphics hardware. While special-purpose hardware units are generally optimized for certain tasks, general-purpose computing units are more flexible in that they can be programmed for a variety of tasks. The Intel processor graphics hardware is a combination of fixed-function and programmable units, providing a balance among speed, flexibility, and scalability. Substantial attention is also paid to optimizing the systems running these graphics hardware for low power consumption, thus providing high performance with reduced power cost. Thus, using hardware acceleration for video encoding and processing tasks is performance and power friendly as long as the real-time supply of input video data is ensured.

Figure 5-1 shows CPU utilization of a typical encoding session with and without processor graphics hardware—that is, GPU acceleration. From this figure, it is obvious that employing GPU acceleration not only makes the CPU available for other tasks but also increases the performance of the encoding itself. In this example, the encoding speed went up from less than 1 FPS to over 86 FPS.

Figure 5-1.
figure 1

CPU utilization of typical encoding with and without GPU acceleration

Video Parameters Tuning

To tune the video parameters for optimum performance, it is important to understand the main factors that contribute to performance, and to identify and address the typical performance bottlenecks.

Factors Determining Encoding Speed

Many factors affect the video encoding speed, including system hardware, network configurations, storage device types, nature of the encoding tasks, available parallelization opportunities, video complexity and formats, and hardware acceleration possibilities. Interactions among these factors can make performance tuning complex.

System Configurations

There are several configurable system parameters that affect, to varying degrees, the performance of workloads such as the video encoding speed. Some of these parameters are the following:

  • Number of cores: The number of processing CPU and GPU cores directly contributes to workload performance. Distributing the workload into various cores can increase the speed of processing. In general, all the processing cores should be in the same performance states for optimum resource utilization. The performance states are discussed in Chapter 6 in detail.

  • CPU and GPU frequencies:The CPU and GPU core and package clock frequencies are the principal determining factors for the execution speed of encoding tasks. Given that such tasks can take advantage of full hardware acceleration, or can be shared between the CPU and the GPU, utilization of these resources, their capabilities in terms of clock frequencies, the dependences and scheduling among these tasks, and the respective data access latencies are crucial factors for performance optimization.

  • Memory size and memory speed: Larger memory size is usually better for video encoding and processing tasks, as this helps accommodate the increasingly higher video resolutions without excessive memory paging costs. Higher memory speed, obviously, also significantly contributes to speeding up these tasks.

  • Cache configurations: Cache memory is a fast memory built into the CPU or other hardware units, or located next to it on a separate chip. Frequently repeated instructions and data are stored in the cache memory, allowing the CPU to avoid loading and storing data from the slower system bus, and thereby improving overall system speed. Cache built into the CPU itself is referred to as Level 1 (L1) cache, while cache residing on a separate chip next to the CPU is called Level 2 (L2) cache. Some CPUs have both L1 and L2 caches built in and designate the cache chip as Level 3 (L3) cache. Use of L3 caches significantly improves the performance of video encoding and processing tasks. Similarly, integrated GPUs have several layers of cache. Further, recent processors with embedded dynamic random access memories (eDRAMs) generally yield 10 to 12 percent higher performance for video encoding tasks.

  • Data access speed: Apart from scheduling delays, data availability for processing depends on the non-volatile storage speed and storage type. For example, solid-state disk drives (SSDs) provide much faster data access compared to traditional spinning magnetic hard disk drives, without sacrificing reliability. Disk caching in hard disks uses the same principle as memory caching in CPUs. Frequently accessed hard-disk data is stored in a separate segment of RAM, avoiding frequent retrieval from the hard disk. Disk caching yields significantly better performance in video encoding applications where repeated data access is quite common.

  • Chipset and I/O throughput: Given that uncompressed video is input to the video encoding tasks, nd some processing tasks also output the video in uncompressed formats, often I/O operations become the bottleneck in these tasks, especially for higher resolution videos. In I/O-bound tasks, an appropriately optimized chipset can remove this bottleneck, improving overall performance. Other well-known techniques to improve the efficiency of I/O operations and to reduce the I/O latency include intelligent video data placement on parallel disk arrays, disk seek optimization, disk scheduling, and adaptive disk prefetching.

  • System clock resolution: The default timer resolution in Windows is 15.625 msec, corresponding to 64 timer interrupts per second. For tasks such as video encoding, where all operations related to a video frame must be done within the specified time frame (e.g., 33 msec for 30 fps video), the default timer resolution is not sufficient. This is because a task may need to wait until the next available timer tick to get scheduled for execution. Since there are often dependences among the encoding tasks, such as DCT transform and variable length coding, scheduling these tasks must carefully consider timer resolution along with the power consumption for optimum performance. In many applications, a timer resolution of 1 msec is typically a better choice.

  • BIOS: Several performance-related parameters can be adjusted from the BIOS; among them are peripheral component interconnect express (PCIe) latency and clock gating, advanced configuration and power interface (ACPI) settings (e.g., disabling hibernation), CPU configuration (e.g., enabling adjacent cache line prefetch), CPU and graphics power management control (e.g., allowing support for more than two frequency ranges, allowing turbo mode, allowing CPU to go to C-states when it is not fully utilized [details of C-states are discussed in Chapter 6], configuring C-state latency, setting interrupt response time limits, enabling graphics render standby), enabling overclocking features (e.g., setting graphics overclocking frequency), and so on.

  • Graphics driver:Graphics drivers incorporate various performance optimizations, particularly for hardware-accelerated video encoding and processing tasks. Appropriate and updated graphics drivers would make a difference in attaining the best performance.

  • Operating system:Operating systems typically perform many optimizations, improving the performance of the run-time environments. They also control priorities of processes and threads. For example, Dalvik and ART (Android RunTime) are the old and new run times, respectively, that execute the application instructions inside Android. While Dalvik is a just-in-time (JIT) run time that executes code only when it is needed, ART—which was introduced in Android 4.4 KitKat and is already available to users—is an ahead-of-time (AOT) run time that executes code before it is actually needed. Comparisons between Dalvik and ART on Android 4.4 have shown that the latter brings enhanced performance and battery efficiency, and will be available as the default run time for devices running Android version 4.5 (Lollipop).

  • Power settings: In addition to thermal design power (TDP), Intel has introduced a new specification, called the scenario design power (SDP) since the third-generation Core and Pentium Y-processors. While TDP specifies power dissipation under worst-case real-world workloads and conditions, SDP specifies power dissipation under a specific usage scenario. SDP can be used for benchmarking and evaluation of power characteristics against specific target design requirements and system cooling capabilities. Generally, processors with higher TDP (or SDP) give higher performance. Therefore, depending on the need, a user can choose to obtain a system with higher TDP. However, on a certain platform, the operating system usually offers different power setting modes, such as high performance, balanced, or power saver. These modes control how aggressively the system will go to various levels of idle states. These modes have a noticeable impact on performance, especially for video encoding and processing applications.

The Nature of Workloads

The nature of a workload can influence the performance and can help pinpoint possible bottlenecks. For example, for video coding applications, the following common influential factors should be considered:

  • Compute-bound tasks: A task is “compute bound” if it would complete earlier on a faster processor. It is also considered compute bound if the task is parallelizable and can have an earlier finish time with an increased number of processors. This means the task spends the majority of its time using the processor for computation rather than on I/O or memory operations. Depending on the parameters used, many video coding tasks, such as motion estimation and prediction, mode decision, transform and quantization, in-loop deblocking, and so on, may be compute bound. Integrated processor graphics, where certain compute-intensive tasks are performed using fixed-function hardware, greatly helps improve the performance of compute-bound tasks.

  • I/O-bound tasks: A task is “I/O bound” if it would complete earlier with an increase in speed of the I/O subsystem or the I/O throughput. Usually, disk speed limits the performance of I/O-bound tasks. Reading raw video data from files for input to a video encoder, especially reading higher resolution uncompressed video data, is often I/O bound.

  • Memory-bound tasks: A task is “memory bound” if its rate of progress is limited by the amount of memory available and the speed of that memory access. For example, storing multiple reference frames in memory for video encoding is likely to be memory bound. The same task may be transformed from compute bound to memory bound on higher frequency processors, owing to the ability of faster processing.

  • Inter-process communication: Owing to dependences, tasks running on different processes in parallel often need to communicate with each other. This is quite common in parallel video encoding tasks. Depending on the configuration of the parallel platform, interprocess communication may materialize using message passing, using shared memory, or other techniques. Excessive interprocess communication adversely affects the performance and increasingly dominates the balance between the computation and the communication as the number of processes grows. In practice, to achieve improved scalability, parallel video encoder designers need to minimize the communication cost, even at the expense of increased computation or memory operations.

  • Task scheduling: The scheduling of tasks running in parallel has a huge impact on overall performance, particularly on heterogeneous computing platforms. Heterogeneous multi-core processors with the same instruction set architecture (ISA) are typically composed of small (e.g., in-order) power-efficient cores and big (e.g., out-of-order) high-performance cores. In general, small cores can achieve good performance if the workload inherently has high levels of instruction level parallelism (ILP). On the other hand, big cores provide good performance if the workload exhibits high levels of memory-level parallelism (MLP) or requires the ILP to be extracted dynamically. Therefore, scheduling decisions on such platforms can be significantly improved by taking into account how well a small or big core can exploit the ILP and MLP characteristics of a workload. On the other hand, making wrong scheduling decisions can lead to suboptimal performance and excess energy or power consumption. Techniques are available in the literature to understand which workload-to-core mapping is likely to provide the best performance.Footnote 4

  • Latency: Latency usually results from communication delay of a remote memory access and involves network delays, cache miss penalty, and delays caused by contentions in split transactions. Latency hiding can be accomplished through four complementary approachesFootnote 5: (i) using prefetching techniques which brings instructions or data close to the processor before it is actually needed, (ii) using coherent caches supported by hardware to reduce cache misses, (iii) using relaxed memory consistency models that allow buffering and pipelining of memory references, and (iv) using multiple-context support that allows a processor to switch from one context to another when a long latency operation is encountered. Responsiveness of a system depends on latency. For real-time video communication applications such as video conferencing, latency is an important performance factor, as it significantly impacts the user experience.

  • Throughput: Throughput is a measure of how many tasks a system can execute per unit of time. This is also known as the system throughput. The number of tasks the CPU can handle per unit time is the CPU throughput. As system throughput is derived from the CPU (and other resource) throughput, when multiple tasks are interleaved for CPU execution, CPU throughput is higher than the system throughput. This is due to the system overheads caused by the I/O, compiler, and the operating system, because of which the CPU is kept idle for a fraction of the time. In real-time video communication applications, the smoothness of the video depends on the system throughput. Thus, it is important to optimize all stages in the system, so that inefficiency in one stage does not hinder overall performance.

Encoding Tools and Parameters

It should be noted that not only do the various algorithmic tasks affect the performance, but some video encoding tools and parameters are also important factors. Most of these tools emerged as quality-improvement tools or as tools to provide robustness against transmission errors. Fortunately, however, they usually offer opportunities for performance optimization through parallelization. The tools that are not parallelization friendly can take advantage of algorithmic and code optimization techniques, as described in the following sections. Here are a few important tools and parameters.

Independent data units

To facilitate parallelization and performance gain, implementations of video coding algorithms usually exploit frame-level or group of frame-level independence or divide video frames into independent data units such as slices, slice groups, tiles, or wavefronts.

At the frame level, usually there is little parallelism owing to motion compensation dependences. Even if parallelized, because of the varying frame complexities, the encoding and decoding times generally fluctuate a lot, thus creating an imbalance in resource utilization. Also, owing to dependency structure, the overall latency may increase with frame-level parallelization.

A video frame consists of one or more slices. A slice is a group of macroblocks usually processed in raster-scan order. Figure 5-2 shows a typical video frame partitioned into several slices or groups of slices.

Figure 5-2.
figure 2

Partitioning of a video frame into slices and slice groups

Slices were introduced mainly to prevent loss of quality in the case of transmission errors. As slices are defined as independent data units, loss of a slice is localized and may not impact other slices unless they use the lost slice as a reference. Exploiting the same property of independence, slices can be used in parallel for increased performance. In an experiment using a typical AVC encoder, it was found that four slices per frame can yield a 5 to 15 percent performance gain compared to a single slice per frame, depending on the encoding parameters. However, employing slices for parallelism may incur significant coding efficiency losses. This is because, to keep the data units independent, spatial redundancy reduction opportunities may be wasted. Such loss in coding efficiency may be manifested as a loss in visual quality. For example, in the previous experiment with AVC encoder, four slices per frame resulted in a visual quality loss of ∼0.2 to ∼0.4 dB compared to a single slice per frame, depending on the encoding parameters. Further, a decoder relying on performance gains from parallel processing of multiple slices alone may not obtain such gain if it receives a video sequence with a single slice per frame.

The concept of slice groups was also introduced as an error-robustness feature. Macroblocks belonging to a slice group are typically mixed with macroblocks from other slice groups during transmission, so that loss of network packets minimally affects the individual slices in a slice group. However, owing to the independence of slice groups, they are good candidates for parallelization as well.

In standards after H.264, the picture can be divided into rectangular tiles—that is, groups of coding tree blocks separated by vertical and horizontal boundaries. Tile boundaries, similarly to slice boundaries, break parse and prediction dependences so that a tile can be processed independently, but the in-loop filters such as the deblocking filters can still cross tile boundaries. Tiles have better coding efficiency compared to slices. This is because tiles allow picture partition shapes that contain samples with a potential higher correlation than slices, and tiles do not have the slice header overhead. But, similar to slices, the coding efficiency loss increases with the number of tiles, owing to the breaking of dependences along partition boundaries and the resetting of CABAC probabilities at the beginning of each partition.

In the H.265 standard, wavefronts are introduced to process rows of coding tree blocks in parallel, each row starting with the CABAC probabilities available after processing the second block of the row above. This creates a different type of dependency, but still provides an advantage compared to slices and tiles, in that no coding dependences are broken at row boundaries. Figure 5-3 shows an example wavefront.

Figure 5-3.
figure 3

Wavefronts amenable to parallel processing; for the starting macroblock of a row, CABAC probabilities are propagated from the second block of the previous macroblock row

The CABAC probabilities are propagated from the second block of the previous row without altering the raster-scan order. This reduces the coding efficiency losses and results in only small rate-distortion differences compared to nonparallel bitstreams. However, the wavefront dependencies mean that all the rows cannot start processing at the same time. This introduces parallelization inefficiencies, a situation that is more prominent with more parallel processors.

However, the ramping inefficiencies of wavefront parallel processing can be mitigated by overlapping the execution of consecutive pictures.Footnote 6 Experimental results reported by Chi et al. show that on a 12-core system running at 3.33 GHz, for decoding of 3840×2160 video sequences, overlapped wavefronts provide a speedup by a factor of nearly 11, while regular wavefronts and tiles provide reasonable speedup of 9.3 and 8.7, respectively.

GOP structure

The encoding of intra-coded (I) pictures, predicted (P) pictures, and bi-predicted (B) pictures requires different amounts of computation and consequently has different finish times. The pattern of their combination, commonly known as the group of pictures (GOP) structure, is thus an important factor affecting the encoding speed. In standards before the H.264, I-pictures were the fastest and B-pictures were the slowest, owing to added motion estimation and related complexities. However, in the H.264 and later standards, I-pictures may also take a long time because of Intra prediction.

Depending on the video contents, the use of B-pictures in the H.264 standard may decrease the bit rate by up to 10 percent for the same quality, but their impact on performance varies from one video sequence to another, as the memory access frequency varies from -16 to +12 percent.Footnote 7 Figure 5-4 shows the results of another experiment comparing the quality achieved by using no B-picture, one B-picture, and two B-pictures. In this case, using more B-pictures yields better quality. As a rule of thumb, B-pictures may make the coding process slower for a single processing unit, but they can be more effectively parallelized, as a B-picture typically is not dependent on another B-picture unless it is used as a reference—for instance, in a pyramid structure.

Figure 5-4.
figure 4

Effect of B-pictures on quality for a 1280×720 H.264 encoded video sequence named park run

Bit rate control

Using a constant quantization parameter for each picture in a group of pictures is generally faster than trying to control the quantization parameter based on an available bit budget and picture complexity. Extra compute must be done for such control. Additionally, bit rate control mechanisms in video encoders need to determine the impact of choosing certain quantization parameters on the resulting number of bits as they try to maintain the bit rate and try not to overflow or underflow the decoder buffer. This involves a feedback path from the entropy coding unit back to the bit rate control unit, where bit rate control model parameters are recomputed with the updated information of bit usage. Often, this process may go through multiple passes of entropy coding or computing model parameters. Although the process is inherently sequential, algorithmic optimization of bit rate control can be done to improve performance for applications operating within a limited bandwidth of video transmission. For example, in a multi-pass rate control algorithm, trying to reduce the number of passes will improve the performance. An algorithm may also try to collect the statistics and analyze the complexity in the first pass and then perform actual entropy coding in subsequent passes until the bit rate constraints are met.

Multiple reference pictures

It is easy to find situations where one reference picture may yield a better block matching and consequent lower cost of motion prediction than another reference picture. For example, in motion predictions involving occluded areas, a regular pattern of using the immediate previous or the immediate future picture may not yield the best match for certain macroblocks. It may be necessary to search in a different reference picture where that macroblock was visible. Sometimes, more than one reference picture gives a better motion prediction compared to a single reference picture. This is the case, for example, during irregular object motion that does not align with particular grids of the reference pictures. Figure 5-5 shows an example of multiple reference pictures being used.

Figure 5-5.
figure 5

Motion compensated prediction with multiple reference pictures

To accommodate the need for multiple predictions, in the H.264 and later standards, the multiple reference pictures feature was introduced, resulting in improved visual quality. However, there is a significant performance cost incurred when performing searches in multiple reference pictures. Note that if the searches in various reference pictures can be done in parallel, the performance penalty can be alleviated to some extent while still providing higher visual quality compared to single-reference motion prediction.

R-D Lagrangian optimization

For the encoding of video sequences using the H.264 and later standards, Lagrangian optimization techniques are typically used for choice of the macroblock mode and estimation of motion vectors. The mode of each macroblock is chosen out of all possible modes by minimizing a rate-distortion cost function, where distortion may be represented by the sum of the squared differences between the original and the reconstructed signals of the same macroblock, and the rate is that required to encode the macroblock with the entropy coder. Similarly, motion vectors can be efficiently estimated by minimizing a rate-distortion cost function, where distortion is usually represented by the sum of squared differences between the current macroblock and the motion compensated macroblock, and the rate is that required to transmit the motion information consisting of the motion vector and the corresponding reference frame number. The Lagrangian parameters in both minimization problems are dependent on the quantization parameter, which in turn is dependent on the target bit rate.

Clearly, both of these minimizations require large amounts of computation. While loop parallelization, vectorization, and other techniques can be applied for performance optimization, early exits from the loops can also be made if the algorithm chooses to do so, at the risk of possible non-optimal macroblock mode and motion vectors that may impact the visual quality at particular target bit rates. These parallelization approaches are discussed in the next section.

Frame/field mode for interlaced video

For interlaced video, choice of frame/field mode at the macroblock or picture level significantly affects performance. On the other hand, the interlaced video quality is generally improved by using tools such as macroblock-adaptive or picture-adaptive frame/field coding. It is possible to enhance performance by using only a certain pattern of frame and field coding, but this may compromise the visual quality.

Adaptive deblocking filter

Using in-loop deblocking filters on reconstructed pictures reduces blocky artifacts. Deblocked pictures, therefore, serve as a better-quality reference for intra- and inter-picture predictions, and result in overall better visual quality for the same bit rate. The strength of the deblocking filters may vary and can be adaptive on the three levels: at the slice level, based on individual characteristics of a video sequence; at the block-edge level, based on intra- versus inter-mode decision, motion differences, and the presence of residuals in the two participating neighboring blocks; and at the pixel level, based on an analysis to distinguish between the true edges and the edges created by the blocky artifact. True edges should be left unfiltered, while the edges from quantization should be smoothed out.

In general, deblocking results in bit rate savings of around 6 to 9 percent at medium qualitiesFootnote 8; equivalently at the same bit rate, the subjective picture quality improvements are more remarkable. Deblocking filters add a massive number of operations per frame and substantially slow down the coding process. Also, it is difficult to parallelize this task because it is not confined to the independent data units, such as slices. This is another example of a tradeoff between visual quality and performance.

Video Complexity and Formats

Video complexity is an important factor that influences the encoding speed. More complex scenes in a video generally take longer to encode, as more information remains to be coded after quantization. Complex scenes include scenes with fine texture details, arbitrary shapes, high motion, random unpredictable motion, occluded areas, and so on. For example, scenes with trees, moving water bodies, fire, smoke, and the like are generally complex, and are often less efficiently compressed, impacting encoding speed as well. On the other hand, easy scenes consisting of single-tone backgrounds and one or two foreground objects, such as head and shoulder-type scenes, are generally prone to better prediction, where matching prediction units can be found early and the encoding can be accelerated. These easy scenes are often generated from applications such as a videophone, video conferencing, news broadcasts, and so on. Frequent scene changes require many frames to be independently encoded, resulting in less frequent use of prediction of the frame data. If the same video quality is attempted, only lower compression can be achieved. With more data to process, performance will be affected.

Video source and target formats are also important considerations. Apart from the professional video contents generated by film and TV studios, typical sources of video include smartphones, point-and-shoot cameras, consumer camcorders, and DVRs/PVRs. For consumption, these video contents are generally converted to target formats appropriate for various devices, such as Apple iPads, Microsoft XBoxes, Sony PSx consoles, and the like, or for uploading to the Internet. Such conversion may or may not use video processing operations such as scaling, denoising, and so on. Thus, depending on the target usage, the complexity of operations will vary, exerting different speed requirements and exhibiting different performance results.

GPU-based Acceleration Opportunities

Applications and system-level software can take advantage of hardware acceleration opportunities, in particular GPU-based accelerations, to speed up the video encoding and processing tasks. Either partial or full hardware acceleration can be used. For example, in a transcoding application, either the decoding or the encoding part or both, along with necessary video processing tasks, can be hardware accelerated for better performance. By employing GPU-based hardware acceleration, typically an order of magnitude faster than real-time performance can be achieved, even for complex videos.

Furthermore, hardware-based security solutions can be used for seamless integration with hardware-accelerated encoding and processing for overall enhancement of the encoding speed of premium video contents. In traditional security solutions, security software would occasionally interrupt and slow down long encoding sessions running on the CPU. However, by employing hardware-based security, improvements can be achieved in both performance and security.

Performance Optimization Approaches

The main video encoding tasks are amenable to performance optimization, usually at the expense of visual quality or power consumption. Some of the techniques may have only trivial impact on power consumption and some may have little quality impact, yet they improve the performance. Other techniques may result in either quality or power impacts while improving performance.

Algorithmic optimizations contribute significantly to speeding up the processing involved in video encoding or decoding. If the algorithm runs on multi-core or multiprocessor environments, quite a few parallelization approaches can be employed. Furthermore, compiler and code optimization generally yield an additional degree of performance improvement. Besides these techniques, finding and removing the performance bottlenecks assists performance optimization in important ways. In the context of video coding, common performance optimization techniques include the following.

Algorithmic Optimization

Video coding algorithms typically focus on improving quality at the expense of performance. Such techniques include the use of B-pictures, multiple-reference pictures, two-pass bit rate control, R-D Langrangian optimization, adaptive deblocking filter, and so on. On the other hand, performance optimization using algorithmic approaches attempt to improve performance in two ways. The first way is by using fast algorithms, typically at the expense of higher complexity, higher power consumption, or lower quality. Joint optimization approaches of performance and complexity are also available in the literature.Footnote 9 A second way is to design algorithms that exploit the available parallelization opportunities with little or no quality loss.Footnote 10

Fast Algorithms

Many fast algorithms for various video coding tasks are available in the literature, especially for the tasks that take longer times to finish. For example, numerous fast-motion estimation algorithms try to achieve an order of magnitude higher speed compared to a full-search algorithm with potential sacrifice in quality. Recent fast-motion estimation algorithms, however, exploit the statistical distribution of motion vectors and only search around the most likely motion vector candidates to achieve not only a fast performance but almost no quality loss as well. Similarly, fast DCT algorithmsFootnote 11 depend on smart factorization and smart-code optimization techniques. Some algorithms exploit the fact that the overall accuracy of the DCT and inverse DCT is not affected by the rounding off and truncations intrinsic to the quantization process.Footnote 12 Fast algorithms for other video coding tasks try to reduce the search space, to exit early from loops, to exploit inherent video properties, to perform activity analysis, and so on, with a view toward achieving better performance. There are several ways to improve the encoding speed using algorithmic optimization.

Fast Transforms

Fast transforms use factorization and other algorithmic maneuvers to reduce the computational complexity in terms of number of arithmetic operations needed to rapidly compute the transform. Fast Fourier Transform (FFT) is a prime example of this, which takes only O(N log N) arithmetic operations, instead of the O(N 2) operations required in the original N -point Discrete Fourier Transform (DFT) algorithm. For large data sets, the resulting time difference is huge; in fact, the advent of FFT made it practical to calculate Fourier Transform on the fly and enabled many practical applications. Furthermore, instead of floating-point operations, fast transforms tend to use integer operations that can be more efficiently optimized. Typically, fast transforms such as the DCT do not introduce errors so there is no additional impact on the visual quality of the results. However, possible improvements in power consumption because of fewer arithmetic operations are usually not significant, either.

Fast DCT or its variants are universally used in the video coding standards. In the H.264 and later standards, transform is generally performed together with quantization to avoid loss in arithmetic precision. Nonetheless, as fast transform is performed on a large set of video data, data parallelism approaches can easily be employed to parallelize the transform and improve the performance. A data parallel approach is illustrated in the following example.

Let’s consider the butterfly operations in the first stage of DCT (see Figure 2.17), which can be expressed as:

Considering each input uk to be a 16-bit integer, sets of four such inputs can be rearranged into 64-bit wide vectors registers, as shown in Figure 5-6. The rearrangement is necessary to maintain the correspondence of data elements on which operations are performed. This will provide 64-bit wide additions and subtractions in parallel, effectively speeding up this section of operations by a factor of 4. Similarly, wider vector registers can be exploited for further improved performance.

Figure 5-6.
figure 6

Data rearrangement in 8-point DCT to facilitate data parallelism

Fast Intra Prediction

In the H.264 and later standards, in addition to the transform, Intra prediction is used in spatial redundancy reduction. However, the Intra frame encoding process has several data-dependent and computationally intensive coding methodologies that limit the overall encoding speed. It causes not only a high degree of computational complexity but also a fairly large delay, especially for the real-time video applications. To resolve these issues, based on the DCT properties and spatial activity analysis, Elarabi and BayoumiFootnote 13 proposed a high throughput, fast and precise Intra mode selection, and a direction-prediction algorithm that significantly reduces the computational complexity and the processing run time required for the Intra frame prediction process. The algorithm achieves ∼56 percent better Intra prediction run time compared to the standard AVC implementation (JM 18.2), and ∼35 to 39 percent better Intra prediction run time compared to other fast Intra prediction techniques. At the same time, it achieves a PSNR within 1.8 percent (0.72 dB) of the standard implementation JM 18.2, which is also ∼18 to 22 percent better than other fast Intra prediction algorithms. In another example, using a zigzag pattern of calculating the 4×4 DC prediction mode, Alam et al.Footnote 14 has improved both the PSNR (up to 1.2 dB) and the run time (up to ∼25 percent) over the standard implementation.

Fast Motion Estimation

Block matching motion estimation is the most common technique used in inter-picture motion prediction and temporal redundancy reduction. It performs a search to find the best matching block in the reference picture with the current block in the current picture. The estimation process is typically conducted in two parts: estimation with integer pixel-level precision and with fractional pixel-level precision. Often, fractional pixel-level motion search is done with half-pixel and quarter-pixel precision around the best integer pixel position, and the resulting motion vectors are appropriately scaled to maintain the precision.

Motion estimation is the most time-consuming process in the coding framework. It typically takes ∼60 to 90 percent of the compute time required by the whole encoding process, depending on the configuration and the algorithm. Thus, a fast implementation of motion estimation is very important for real-time video applications.

There are many ways to speed up the motion estimation process. These include:

  • Fewer locations can be searched to find the matching block. However, the problem of how to determine which locations to search has been an active area of research for longer than two decades, producing numerous fast-motion estimation algorithms. If the right locations are not involved, it is easy to fall into local minima and miss the global minimum in the search space. This would likely result in nonoptimal motion vectors. Consequently, a higher cost would be incurred in terms of coding efficiency if the block is predicted from a reference block using these motion vectors, compared to when the block is simply coded as Intra. Thus, the block may end up being coded as an Intra block, and fail to take advantage of existing temporal redundancy.

Recent algorithms typically search around the most likely candidates of motion vectors to find the matching block. Predicted motion vectors are formed based on the motion vectors of the neighboring macroblocks, on the trend of the inter-picture motion of an object, or on the motion statistics. Some search algorithms use different search zones with varying degrees of importance. For example, an algorithm may start the search around the predicted motion vector and, if necessary, continue the search around the co-located macroblock in the reference picture. Experimentally determined thresholds are commonly used to control the flow of the search. The reference software implementation of the H.264 and later standards use ­fast-search algorithms that depict these characteristics.

  • Instead of matching the entire block, partial information from the blocks may be matched for each search location. For example, every other pixel in the current block can be matched with corresponding pixels in the reference block.

  • A search can be terminated early based on certain conditions and thresholds that are usually determined experimentally. An example of such early termination can be found in the adaptive motion estimation technique proposed by Zhang et al.,Footnote 15 which improves the speed by ∼25 percent for the macroblocks in motion, while improves the performance by ∼3 percent even for stationary macroblocks by checking only five locations. The average PSNR loss is insignificant at ∼0.1 dB.

  • Instead of waiting for the reconstructed picture to be available, the source pictures can be used as references, saving the need for reconstruction at the encoder. Although this technique provides significant performance gain, it has the disadvantage that the prediction error is propagated from one frame to the next, resulting in significant loss in visual quality.

  • Motion estimation is easily parallelizable in a data-parallel manner. As the same block-matching operation such as the SAD is used on all the matching candidates, and the matching candidates are independent of each other, SIMD can easily be employed. Further, motion estimation for each block in the current picture can be done in parallel as long as an appropriate search window for each block is available from the reference picture. Combining both approaches, a single program multiple data (SPMD)-type of parallelization can be used for each picture.

  • Using a hierarchy of scaled reference pictures, it is possible to conduct the fractional and integer pixel parts separately in parallel, and then combine the results.

  • In bi-directional motion estimation, forward and backward estimations can be done in parallel.

Fast Mode Decision

The H.264 and later standards allow the use of variable block sizes that opens the opportunity to achieve significant gains in coding efficiency. However, it also results in very high computational complexity, as mode decision becomes another important and time-consuming process. To improve the mode decision performance, Wu et al.Footnote 16 proposed a fast inter-mode decision algorithm based on spatial homogeneity and the temporal stationarity characteristics of video objects, so that only a few modes are selected as candidate modes. The spatial homogeneity of a macroblock is decided based on its edge intensity, while the temporal stationarity is determined by the difference between the current macroblock and its co-located counterpart in the reference frame. This algorithm reduces 30 percent of the encoding time, on average, with a negligible PSNR loss of 0.03 dB or, equivalently, a bit rate increment of 0.6 percent.

Fast Entropy Coding

Entropy coding such as CABAC is inherently a sequential task and is not amenable to parallelization. It often becomes the performance bottleneck for video encoding. Thus, performance optimization of the CABAC engine can enhance the overall encoding throughput. In one example,Footnote 17 as much as ∼34 percent of throughput enhancement is achieved by pre-normalization, hybrid path coverage, and bypass bin splitting. Context modeling is also improved by using a state dual-transition scheme to reduce the critical path, allowing real-time ultra-HDTV video encoding on an example 65 nm video encoder chip running at 330 MHz.

Parallelization Approaches

Parallelization is critical for enabling multi-threaded encoding or decoding applications adapted to today’s multi-core architectures. Independent data units can easily scale with the parallel units, whereas dependences limit the scalability and parallelization efficiency. Since several independent data units can be found in video data structures, their parallelization is straightforward. However, not all data units and tasks are independent. When there are dependences among some data units or tasks, there are two ways to handle the dependences: by communicating the appropriate data units to the right processors, and by using redundant data structure. It is important to note that the interprocessor communication is an added overhead compared to a sequential (non-parallel, or scalar) processing. Therefore, parallelization approaches are typically watchful of the communication costs, sometimes at the expense of storing redundant data. In general, a careful balance is needed among the computation, communication, storage requirements, and resource utilization for efficient parallelization.

Data Partitioning

The H.264 standard categorizes the syntax elements into up to three different partitions for a priority-based transmission. For example, headers, motion vectors, and other prediction information are usually transmitted with higher priority than the details of the syntax elements representing the video content. Such data partitioning was primarily designed to provide robustness against transmission errors, and was not intended for parallelization. Indeed, parallel processing of the few bytes of headers and many bytes of detailed video data would not be efficient. However, video data can be partitioned in several different ways, making it suitable for parallelization and improved performance. Both uncompressed and compressed video data can be partitioned into independent sections, so both video encoding and decoding operations can benefit from data partitioning.

Data partitioning plays an important role in the parallelization of video encoding. Temporal partitioning divides a video sequence into a number of independent subsequences, which are processed concurrently in a pipelined fashion. At least a few subsequences must be available to fill the pipeline stages. This type of partitioning is thus suitable for off-line video encoding.Footnote 18 Spatial partitioning divides a frame of video into various sections that are encoded simultaneously. Since only one frame is inputted at a time, this type of partitioning is suitable for online and low-delay encoding applications that process video on a frame-by-frame basis. It is clear that parallel encoding of the video subsequences deals with coarser grains of data that can be further partitioned into smaller grains like a section of a single frame, such as slices, slice groups, tiles, or wavefronts.

Task Parallelization

The task parallelization approach for video encoding was introduced as early as 1991 for compact disc-interactive applications.Footnote 19 This introductory approach took advantage of a multiple instruction multiple data (MIMD) parallel object-oriented computer. The video encoder was divided into tasks and one task was assigned to one or more processors of the 100-node message-passing parallel computer, where a node consisted of a data processor, memory, a communications processor, and I/O interfaces. This approach loosely used task parallelization, where some processors were running tasks with different algorithms, but others were running tasks with the same algorithm at a given time. At a higher level, the tasks were divided into two phases: a motion-estimation phase for prediction and interpolation where motion vectors were searched in each frame, and video compression where it was decided which of these motion vectors (if any) would be used.

The parallelization of the motion estimation phase was not task parallel by itself; it involved assigning each processor its own frame along with the associated reference frames. This process inevitably required copying the reference frames onto several appropriate processors, thus creating a performance overhead. Also, many frames had to have been read before all processors had some tasks to execute. The video compression phase did not have independent frames, so several parts of a frame were processed in parallel. A compression unit made up of a group of processors repeatedly received sets of consecutive blocks to encode. The tasks in the compression unit were mode decision, DCT, quantization, and variable length coding. The resulting bitstream was sent to an output manager running on a separate processor, which combined the pieces from all the compression units and sent the results to the host computer. The compression units reconstructed their own parts of the resulting bitstream to obtain the reference frames.

Note that the quantization parameter depends on the data reduction in all blocks processed previously, and one processor alone cannot compute it. Therefore, a special processor must be dedicated to computation of the quantization parameter, sending the parameter to appropriate compression units and collecting the size of the compressed data from each of the compression units for further calculation. An additional complication arises from the fact that motion vectors are usually differentially coded based on the previous motion vector. But the compression units working independently do not have access to the previous motion vector. To resolve this, compression units must send the last motion vector used in the bitstream to the compression unit that is assigned the next blocks. Figure 5-7 shows the communication structure of the task parallelization approach.

Figure 5-7.
figure 7

Communication structure in task parallelization

This idea can be used in video encoding in general, regardless of the video coding standards or the algorithms used. However, the idea can be further improved to reduce the communication overhead. For example, in a system, the processors can identify themselves in the environment and can attach their processor numbers as tags to the data they process. These tags can be subsequently removed by the appropriate destination processors, which can easily rearrange the data as needed. It is important to understand that appropriate task scheduling is necessary in the task parallelization approach, as many tasks are dependent on other tasks, owing to the frame-level dependences.


Pipelines are cascades of processing stages where each stage performs certain fixed functions over a stream of data flowing from one end to the other. Pipelines can be linear or dynamic (nonlinear). Linear pipelines are simple cascaded stages with streamlined connections, while in dynamic pipelines feedback and/or feed-forward connection paths may exist from one stage to another. Linear pipelines can be further divided into synchronous and asynchronous pipelines. In asynchronous pipelines, the data flow between adjacent stages is controlled by a handshaking protocol, where a stage Si sends a ready signal to the next stage Si+1 when it is ready to transmit data. Once the data is received by stage Si+1, it sends an acknowledge signal back to Si . In synchronous pipelines, clocked latches are used to interface between the stages. Upon arrival of a clock pulse, all latches transfer data to the next stage simultaneously. For a k-stage linear pipeline, a multiple of k clock cycles are needed for the data to flow through the pipeline.Footnote 20 The number of clock cycles between two initiations of a pipeline is called the latency of the pipeline. The pipeline efficiency is determined by the percentage of time that each pipeline stage is used, which is called the stage utilization.

Video encoding tasks can form a three-stage dynamic pipeline, as shown in Figure 5-7. The first stage consists of the motion-estimation units; the second stage has several compression units in parallel, and the third stage is the output manager. The bit rate and quantization control unit and the reference frame manager can be considered as two delay stages having feedback connections with the second-stage components.

Data Parallelization

If data can be partitioned into independent units, they can be processed in parallel with minimum communication overhead. Video data possess this characteristic. There are a few common data parallelization execution modes, including single instruction multiple data (SIMD), single program multiple data (SPMD), multiple instruction multiple data (MIMD), and so on.

SIMD is a processor-supported technique that allows an operation to be performed on multiple data points simultaneously. It provides data-level parallelism, which is more efficient than scalar processing. For example, some loop operations are independent in successive iterations, so a set of instructions can operate on different sets of data. Before starting execution of the next instruction, typically synchronization is needed among the execution units that are performing the same instruction on the multiple data sets.

SIMD is particularly applicable to image and video applications where typically the same operation is performed on a large number of data points. For example, in brightness adjustment, the same value is added to (or subtracted from) all the pixels in a frame. In practice, these operations are so common that most modern CPU designs include special instruction sets for SIMD to improve the performance for multimedia use. Figure 5-8 shows an example of SIMD technique where two source arrays of eight 16-bit short integers A and B are added simultaneously element by element to produce the result in the destination array C, where the corresponding element-wise sums are written. Using the SIMD technique, a single add instruction operates on 128-bit wide data in one clock cycle.

Figure 5-8.
figure 8

An example of SIMD technique

Procedure- or task-level parallelization is generally performed in MIMD execution mode, of which SPMD is a special case. In SPMD, a program is split into smaller independent procedures or tasks, and the tasks are run simultaneously on multiple processors with potentially different input data. Synchronization is typically needed at the task level, as opposed to at the instruction level within a task. Implementations of SPMD execution mode are commonly found on distributed memory computer architectures where synchronization is done using message passing. For a video encoding application, such an SPMD approach is presented by Akramullah et al.Footnote 21

Instruction Parallelization

Compilers translate the high-level implementation of video algorithms into low-level machine instructions. However, there are some instructions that do not depend on the previous instructions to complete; thus, they can be scheduled to be executed concurrently. The potential overlap among the instructions forms the basis of instruction parallelization, since the instructions can be evaluated in parallel. For example, consider the following code:

  1. 1

    R4 = R1 + R2

  2. 2

    R5 = R1 – R3

  3. 3

    R6 = R4 + R5

  4. 4

    R7 = R4 – R5

In this example, there is no dependence between instructions 1 and 2, or between 3 and 4, but instructions 3 and 4 depend on the completion of instructions 1 and 2. Thus, instructions 1 and 2 and instructions 3 and 4 can be executed in parallel. Instruction parallelization is usually achieved by compiler-based optimization and by hardware techniques. However, indefinite instruction parallelization is not possible; the parallelization is typically limited by data dependency, procedural dependency, and resource conflicts.

Instructions in reduced instruction set computer (RISC) processors have four stages that can be overlapped to achieve an average performance close to one instruction per cycle. These stages are instruction fetch, decode, execute, and result write-back. It is common to simultaneously fetch and decode two instructions A and B, but if instruction B has read-after-write dependency on instruction A, the execution stage of B must wait until the write is completed for A. Mainly owing to inter-instruction dependences, more than one instruction per cycle is not achievable in scalar processors that execute one instruction at a time. However, superscalar processors exploit instruction parallelization to execute more than one unrelated instructions at a time; for example, z=x+y and c=a*b can be executed together. In these processors, hardware is used to detect the independent instructions and execute them in parallel.

As an alternative to superscalar processors, very long instruction word (VLIW) processor architecture takes advantage of instruction parallelization and allows programs to explicitly specify the instructions to execute in parallel. These architectures employ an aggressive compiler to schedule multiple operations in one VLIW per cycle. In such platforms, the compiler has the responsibility of finding and scheduling the parallel instructions. In practical VLIW processors such as the Equator BSP-15, the integrated caches are small—the 32 KB data cache and 32 KB instruction cache typically act as bridges between the higher speed processor core and relatively lower speed memory. It is very important to stream in the data uninterrupted so as to avoid the wait times.

To better understand how to take advantage of instruction parallelism in video coding, let’s consider an example video encoder implementation on a VLIW platform.Footnote 22 Figure 5-9 shows a block diagram of the general structure of the encoding system.

Figure 5-9.
figure 9

A block diagram of a video encoder on a VLIW platform

Here, the macroblocks are processed in a pipelined fashion while they go through the different encoding tasks in the various pipeline stages of the encoder core. A direct memory access (DMA) controller, commonly known as the data streamer, helps prefetch the necessary data. A double buffering technique is used to continually feed the pipeline stages. This technique uses two buffers in an alternating fashion – when the data in one buffer is actively used, the next set of data is loaded onto the second buffer. When processing of the active buffer’s data is done, the second buffer becomes the new active buffer and processing of its data starts, while the buffer with used-up data is refilled with new data. Such design is useful in avoiding potential performance bottlenecks.

Fetching appropriate information into the cache is extremely important; care needs to be taken so that both the data and the instruction caches are maximally utilized. To minimize cache misses, instructions for each stage in the pipeline must fit into the instruction cache, while the data must fit into the data cache. It is possible to rearrange the program to coax the compiler to generate instructions that fit into the instruction cache. Similarly, careful consideration of data prefetch would keep the data cache full. For example, the quantized DCT coefficients can be stored in a way so as to help data prefetching in some Intra prediction modes, where only seven coefficients (either from the top row or from the left column) are needed at a given time. The coefficients have a dynamic range (-2048, 2047), requiring 13 bits each, but are usually represented in signed 16-bit entities. Seven such coefficients would fit into two 64-bit registers, where one 16-bit slot will be unoccupied. Note that a 16-bit element relevant for this pipeline stage, such as the quantizer scale or the DC scaler, can be packed together with the quantized coefficients to fill in the unoccupied slot in the register, thereby achieving better cache utilization.


A thread is represented by a program context comprising a program counter, a register set, and the context status. In a multithreaded parallel computation model, regardless of whether it is run on a SIMD, multiprocessor, or multicomputer, or has distributed or shared memory, a basic unit is composed of multiple threads of computation running simultaneously, each handling a different context on a context-switching basis. The basic structure is as follows:Footnote 23 the computation starts with a sequential thread, followed by supervisory scheduling where computation threads begin working in parallel. In case of distributed memory architectures where one or more threads typically run on each processor, interprocessor communication occurs as needed and may overlap among all the processors. Finally, the multiple threads synchronize prior to beginning the next unit of parallel work.

Multithreading improves the overall execution performance owing to the facts that a thread, even if stalled, does not prevent other threads from using available resources, and that multiple threads working on the same data can share the cache for better cache usage. However, threads usually work on independent data sets and often interfere with each other when trying to share resources. This typically results in cache misses. In addition, multithreading has increased complexity in terms of synchronization, priorities, and pre-emption handling requirements.

Simultaneously executing instructions from multiple threads is known as simultaneous multithreading in general, or Intel Hyper-Threading Technology on Intel processors. To reduce the number of dependent instructions in the pipeline, hyper-threading takes advantage of virtual or logical processor cores. For each physical core, the operating system addresses two logical processors and shares the workload and execution resources when possible.

As performance optimization using specialized media instructions alone is not sufficient for real-time encoding performance, exploiting thread-level parallelism to improve the performance of video encoders has become attractive and popular. Consequently, nowadays multithreading is frequently used for video encoder speed optimization. Asynchronously running threads can dispatch the frame data to multiple execution units in both CPU-based software and GPU-accelerated implementations. It is also possible to distribute various threads of execution between the CPU and the GPU.

Multithreading is often used together with task parallelization, data parallelization, or with their combinations, where each thread operates on different tasks or data sets. An interesting discussion on multithreading as used in video encoding can be found in Gerber et al.,Footnote 24 which exploits frame-level and slice-level parallelism using multithreading techniques.


A vector consists of multiple elements of the same scalar data types. The vector length refers to the number of elements of the vectors that are processed together, typically 2, 4, 8, or 16 elements.

For example, 128-bit wide vector registers can process eight 16-bit short integers. In this case, vector length is 8. Ideally, vector lengths are chosen by the developer or by the compiler to match the underlying vector register widths.

Vectorizationis a process to convert procedural loops that iterate over multiple pairs of data items and to assign a separate processing unit for each pair. Each processing unit belongs to a vector lane. There are the same number of vector lanes as vector lengths, so 2, 4, 8, or 16 data items can be processed simultaneously using as many vector lanes. For example, consider an array A of size 1024 elements is added to an array B, and the result is written to an array C, where B and C are of the same size as A. To implement this addition, a scalar code would use a loop of 1024 iterations. However, if 8 vector lanes are available in the processing units, vectors of 8 elements of the arrays can be processed together, so that only (1024/8) or 128 iterations will be needed. Vectorization is different from thread-level parallelism. It tries to improve performance by using more vector lanes as much as possible. Vector lanes provide additional parallelism on top of each thread running on a single processor core. The objective of vectorization is to maximize the use of available vector registers per core.

Technically, the historic vector-processing architectures are considered separate from SIMD architectures, based on the fact that vector machines used to process the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously. However, today, numerous computational units with SIMD processing capabilities are available at the hardware level, and vector processors are essentially synonymous with SIMD processors. Over the past couple of decades, there has been progressively wider vector registers available for vectorization in each processor core: for example, the 64-bit MMX registers in Pentium to support MMX extensions, 128-bit XMM registers in Pentium IV to support SSE and SSE2 extensions, 256-bit YMM registers in second generation Core processors to support AVX and AVX2 extensions, 512-bit ZMM registers in Xeon Phi co-processors to support MIC extensions. For data-parallelism friendly applications such as video encoding, these wide vector registers are useful.

Conventional programming languages are constrained by their inherent serial nature and don’t support the computation capabilities offered by SIMD processors. Therefore, extensions to conventional programming languages are needed to tap these capabilities. Vectorization of the serial codes and vector programming models are developed for this purpose. For example, OpenMP 4.0 supports vector programming models for C/C++ and FORTRAN, and provides language extensions to simplify vector programming, thereby enabling developers to extract more performance from the SIMD processors. The Intel Click Plus is another example that supports similar language extensions.

The auto-vectorization process tries to vectorize a program given its serial constraints, but ends up underutilizing the available computation capabilities. However, as both vector widths and core counts are increasing, explicit methods are developed by Intel to address the trends. With the availability of integrated graphics and co-processors in the modern CPUs, generalized programming models with explicit vector programming capabilities are being added to compilers such as the Intel compiler, GCC, and LLVM, as well as into standards such as OpenMP 4.0. The approach is similar to multithreading, which addresses the availability of multiple cores and parallelizes programs on these cores. Vectorization additionally addresses the availability of increased vector width by explicit vector programming.

Vectorization is useful in video encoding performance optimization, especially for the CPU-based software implementations. Vectors with lengths of 16 elements of pixel data can provide up to 16-fold speed improvement within critical loops—for example, for motion estimation, prediction, transform, and quantization operations. In applications such as video transcoding, some video processing tasks such as noise reduction can take advantage of the regular, easily vectorizable structure of video data and achieve speed improvement.

Compiler and Code Optimization

There are several compiler-generated and manual code optimization techniques that can result in improved performance. Almost all of these techniques offer performance improvement without affecting visual quality. However, depending on the needs of the application, the program’s critical path often needs to be optimized. In this section, a few common compiler and code optimization techniques are briefly described. The benefits of these techniques for GPU-accelerated video encoder implementations are usually limited and confined to the application and SDK levels, where the primary encoding tasks are actually done by the hardware units. Nevertheless, some of these techniques have been successfully used in speed optimizations of CPU-based software implementations,Footnote 25 resulting in significant performance gains.

Compiler optimization

Most compilers come with optional optimization flags to offer tradeoffs between compiled code size and fast execution speed. For fast speed, compilers typically perform the following:

  • Store variables in registers: Compilers would store frequently used variables and subexpressions in registers, which are fast resources. They would also automatically allocate registers for these variables.

  • Employ loop optimizations: Compilers can automatically perform various loop optimizations, including complete or partial loop unrolling, loop segmentation, and so on. Loop optimizations provide significant performance improvements in typical video applications.

  • Omit frame pointer on the call stack: Often, frame pointers are not strictly necessary on the call stack and can safely be omitted. This usually slightly improves performance.

  • Improve floating-point consistency: The consistency can be improved, for example, by disabling optimizations that could change floating-point precision. This is a tradeoff between different types of performance optimizations.

  • Reduce overhead of function calls: This can be done, for example, by replacing some function calls with the compiler’s intrinsic functions.

  • Trade off register-space saving with memory transaction: One way to realize such a tradeoff is by reloading pointer variables from memory after each function call. This is another example of choosing between different types of performance optimizations.

Code optimization

Optimizing every part of the software code is not worth the effort. It is more practical to focus on the parts where code optimization will reduce execution time the most. For this reason, profiling and analysis of execution time for various tasks in an application is often necessary.

However, the following techniques often provide significant performance improvement, especially when compilers fail to effectively use the system resources.

  • Reduction of redundant operations: Careful programming is the key to compact codes. Without loss of functionality, often redundant operations in codes can be reduced or eliminated by carefully reviewing the code.

  • Data type optimization : Choosing appropriate data types for the program’s critical path is important for performance optimization. The data types directly derived from the task definition may not yield optimum performance for various functional units. For example, using scaled floating-point constants and assigning precomputed constants to registers would give better performance than directly using mixed-mode operations of integer and floating-point variables, as defined by most DCT and IDCT algorithms. In some cases such as quantization, or introduction of temporary variables stored in registers, can provide noticeable performance gain.

  • Loop unrolling:Loop unrolling is the transformation of a loop, resulting in larger loop body size but less iteration. In addition to the automatic compiler optimizer, manual loop unrolling is frequently performed to ensure the right amount of unrolling, as over-unrolling may adversely affect performance. With the CPU registers used more effectively, this process minimizes both the number of load/store instructions and the data hazards arising, albeit infrequently, from inefficient instruction scheduling by the compiler. There are two types of loop unrolling: internal and external. Internal unrolling consists of collapsing some iterations of the innermost loop into larger and more complex statements. These statements require higher numbers of machine instructions, but can be more efficiently scheduled by the compiler optimizer. External loop unrolling consists of moving iterations from outer loops to inner loops through the use of more registers to minimize the number for memory access. In video encoding applications, motion estimation and motion compensated prediction are good candidates to take advantage of loop unrolling.

  • Arithmetic operations optimization : Divisions and multiplications are usually considered the most cycle-expensive operations. However, in most RISC processors, 32-bit based multiplications take more cycles than 64-bit based multiplications in terms of instruction execution latency and instruction throughput. In addition, floating-point divisions are less cycle-expensive compared to mixed-integer and floating-point divisions. Therefore, it is important to use fewer of these arithmetic operations, especially inside a loop.


Although not recommended, it is possible to operate a processor faster than its rated clock frequency by modifying the system parameters. This process is known as overclocking.Although speed can be increased, for stability purposes it may be necessary to operate at a higher voltage as well. Thus, most overclocking techniques result in increasing power consumption and consequently generate more heat, which must be dissipated if the processor is to remain functional. This increases the fan noise and/or the cooling complexity. Contrarily, some manufacturers underclock the processors of battery-powered equipments to improve battery life or implement systems that reduce the frequency when operating under battery. Overclocking may also be applied to a chipset, a discrete graphics card, or memory.

Overclocking allows operating beyond the capabilities of current-generation system components. Because of the increased cooling requirements, the risk of less reliability of operation and potential damage to the component, overclocking is mainly practiced by enthusiasts and hobbyists rather than professional users.

Successful overclocking needs a good understanding of power management. As we will see in Chapter 6, the process of power management is complex in modern processors. The processor hardware and the operating system collaborate to manage the power. In the process, they dynamically adjust the processor core frequencies as appropriate for the current workload. In such circumstances, pushing a certain core to 100 percent frequency may adversely affect the power consumption. In Figure 5-10, the concept is clarified with an example where a typical workload is running on a four-core (eight logical cores) Intel second-generation Core processor.

Figure 5-10.
figure 10

Frequency and power distribution showing the impact of pushing a single core to 100 percent frequency. (Courtesy: J. Feit et al., Intel Corporation, VPG Tech. Summit, 2011)

In a multi-core processor, if one CPU core is pushed to 100 percent frequency while others are idle, it generally results in higher power consumption. In the example of Figure 5-10, as much as ∼10 Watts more power is consumed with a single core running at 100 percent frequency compared to when all eight cores are in use and the average frequency distribution is ∼12.5 percent spread across all cores.

Recent Intel processors with integrated graphics allow the hardware-accelerated video encoder to automatically reach the highest frequency state for as long as necessary, and then keep it in idle state when the task is done. Details of this mechanism are discussed in Chapter 6. In a power-constrained environment using modern processors, it is best to leave the frequency adjustment to the hardware and the operating system.

Performance Bottlenecks

Performance bottlenecks occur when system performance is limited by one or more components or stages of the system. Typically, a single stage causes the entire system to slow down. Bottlenecks can be caused either by hardware limitations or inefficient software configurations or both. Although a system may have certain peak performance for a short period of time, for sustainable throughput a system can only achieve performance as fast as its slowest performing component. Ideally, a system should have no performance bottleneck so that the available resources are optimally utilized.

To identify performance bottlenecks, resource utilization needs to be carefully inspected. When one or more resources are underutilized, it is usually an indication of a bottleneck somewhere in the system. Bottleneck identification is an incremental process whereby fixing one bottleneck may lead to discovery of another. Bottlenecks should be identified in a sequential manner, during which only one parameter at a time is identified and varied, and the impact of that single change is captured. Varying more than one parameter at a time could conceal the effect of the change. Once a bottleneck has been eliminated, it is essential to measure the performance again to ensure that a new bottleneck has not been introduced.

Performance-related issues can be found and addressed by carefully examining and analyzing various execution profiles, including:

  • Execution history, such as the performance call graphs

  • Execution statistics at various levels, including packages, classes, and methods

  • Execution flow, such as method invocation statistics

It may be necessary to instrument the code with performance indicators for such profiling. Most contemporary operating systems, however, provide performance profiling tools for run-time and static-performance analysis.

For identification, analysis, and mitigation of performance bottlenecks in an application, the Intel Performance Bottleneck AnalyzerFootnote 26 framework can be used. It automatically finds and prioritizes architectural bottlenecks for the Intel Core and Atom processors. It combines the latest performance-monitoring techniques with knowledge of static assembly code to identify the bottlenecks. Some difficult and ambiguous cases are prioritized and tagged for further analysis. The tool recreates the most critical paths of instruction execution through a binary. These paths are then analyzed, searching for well-known code-generation issues based on numerous historic performance-monitoring events.

Performance Measurement and Tuning

Performance measurement is needed to verify if the achieved performance meets the design expectations. Furthermore, such measurement allows determination of the actual execution speed of tasks, identification and alleviation of performance bottlenecks, and performance tuning and optimization. It also permits comparison of two tasks—for instance, comparing two video encoding solutions in terms of performance. Thus, it plays an important role in determining the tradeoffs among performance, quality, power use, and amount of compression in various video applications.

Various approaches are available for tuning the system performance of a given application. For instance, compile-time approaches include inserting compiler directives into the code to steer code optimization, using program profilers to modify the object code in multiple passes through the compiler, and so on. Run-time approaches include collecting program traces and event monitoring.


As configurable system parameters affect the overall performance, it is necessary to fix these parameters to certain values to obtain stable, reliable, and repeatable performance measurements. For example, the BIOS settings, the performance optimization options in the operating system, the options in the Intel graphics common user interface (CUI),Footnote 27 and so on must be selected before performance measurements are taken. In the BIOS settings, the following should be considered: the PCIe latency, clock gating, ACPI settings, CPU configuration, CPU and graphics power-management control, C-state latency, interrupt response-time limits, graphics render standby status, overclocking status, and so on.

As we noted in the preceding discussion, workload characteristics can influence the performance. Therefore, another important consideration is the workload parameters. However, it is generally impractical to collect and analyze all possible compile-time and run-time performance metrics. Further, the choice of workloads and relevant parameters for performance measurement is often determined by the particular usage and how an application may use the workload. Therefore, it is important to consider practical usage models so as to select some test cases as key performance indicators. Such selection is useful, for instance, when two video encoding solutions are compared that have performance differences but are otherwise competitive.

Performance Metrics

Several run-time performance metrics are useful in different applications. For example, knowledge of the processor and memory utilization patterns can guide the code optimization. A critical-path analysis of programs can reveal the bottlenecks. Removing the bottlenecks or shortening the critical path can significantly improve overall system performance. In the literature, often system performance is reported in terms of cycles per instruction (CPI), millions of instructions per second (MIPS), or millions of floating-point operations per second (Mflops). Additionally, memory performance is reported in terms of memory cycle or the time needed to complete one memory reference, which is typically a multiple of the processor cycle.

However, in practice, performance tuning of applications such as video coding often requires measuring other metrics, such as the CPU and GPU utilization, processing or encoding speed in frames per second (FPS), and memory bandwidth in megabytes per second. In hardware-accelerated video applications, sustained hardware performance in terms of clocks per macroblock (CPM) can indicate potential performance variability arising from the graphics drivers and the video applications, so that appropriate tuning can be made at the right level for the best performance. Other metrics that are typically useful for debugging purposes include cache hit ratio, page fault rate, load index, synchronization frequency, memory access pattern, memory read and write frequency, operating system and compiler overhead, inter-process communication overhead, and so on.

Tools and Applications

The importance of performance measurement can be judged by the large number of available tools. Some performance-analysis tools support sampling and compiler-based instrumentation for application profiling, sometimes with context-sensitive call graph capability. Others support nonintrusive and low-overhead hardware-event-based sampling and profiling. Yet others utilize the hardware-performance counters offered by modern microprocessors. Some tools can diagnose performance problems related to data locality, cache utilization, and thread interactions. In this section, we briefly discuss a couple of popular tools suitable for performance measurement of video applications, particularly the GPU-accelerated applications. Other popular tools, such as Windows Perfmon, Windows Xperf, and Intel Graphics Performance Analyzer, are briefly described in Chapter 6.

V Tune Amplifier

The VTune Amplifier XE 2013 is a popular performance profiler developed by Intel.Footnote 28 It supports performance profiling for various programming languages, including C, C++, FORTRAN, Assembly, Java, OpenCL, and OpenMP 4.0. It collects a rich set of performance data for hotspots, call trees, threading, locks and waits, DirectX, memory bandwidth, and so on, and provides the data needed to meet a wide variety of performance tuning needs.

Hotspot analysis provides a sorted list of the functions using high CPU time, indicating the locations where performance tuning will yield the biggest benefit. It also supports tuning of multiple threads with locks and wait analysis. It enables users to determine the causes of slow performance in parallel programs by quickly finding such common information as when a thread is waiting too long on a lock while the cores are underutilized during the wait. Profiles like hotspot and locks and waits use a software data collector that works on both Intel and compatible processors. The tool also provides advanced hotspot analysis that uses the on-chip Performance Monitoring Unit (PMU) on Intel processors to collect data by hardware event sampling with very low overhead and increased resolution of 1 msec, making it suitable to identify small and quick functions as well. Additionally, the tool supports advanced hardware event profiles like memory bandwidth analysis, memory access, and branch mispredictions to help find tuning opportunities. An optional stack sample collection is supported in the latest version to identify the calling sequence. Furthermore, profiling a remote system and profiling without restarting the application are also supported.


Matthew Fisher and Steve Pronovost originally developed GPUView, which is a tool for determining the performance of the GPU and the CPU. Later, this tool was incorporated into the Windows Performance Toolkit, and can be downloaded as part of the Windows SDK.Footnote 29 It looks at performance with regard to direct memory access (DMA) buffer processing and all other video processing on the video hardware. For GPU-accelerated DirectX applications, GPUView is a powerful tool for understanding the relationship between the works done on the CPU and those done on the GPU. It uses an Event Tracing for Windows (ETW) mechanism for measuring and analyzing detailed system and application performance and resource usage. The data-collection process involves enabling trace capture, running the desired test application scenario for which performance analysis is needed and stopping the capture, which saves the data in an event trace log (ETL) file. The ETL file can be analyzed on the same or a different machine using GPUView, which presents the ETL information in a graphic format, as shown in Figure 5-11.

Figure 5-11.
figure 11

A screenshot from GPUView showing activity in different threads

GPUView is very useful in analysis and debugging of hardware-accelerated video applications. For example, if a video playback application is observed to drop video frames, the user experience will be negatively affected. In such cases, careful examination of the event traces using GPUView can help identify the issue. Figure 5-12 illustrates an example event trace of a normal video playback, where workload is evenly distributed in regular intervals. The blue vertical lines show the regular vsync and red vertical lines show the present events.

Figure 5-12.
figure 12

Event trace of a regular video playback

Figure 5-13 shows event traces of the same video playback application, but when it drops video frames as the frame presentation deadline expires. The profile appears much different compared to the regular pattern seen in Figure 5-12. In the zoomed-in version, the present event lines are visible, from which it is not difficult to realize that there are long delays happening from time to time when the application sends video data packets to the GPU for decoding. Thus it is easy to identify and address the root cause of an issue using GPUView.

Figure 5-13.
figure 13

Event trace of video playback with frame drops


In this chapter we discussed the CPU clock speed and the extent of the possible increase in clock speed. We noted that the focus in modern processor design has shifted from purely increasing clock speed toward a more useful combination of power and performance. We then highlight the motivation for achieving high performance for video coding applications, and the tradeoffs necessary to achieve such performance.

Then we delved into a discussion of resource utilization and the factors influencing encoding speed. This was followed by a discussion of various performance-optimization approaches, including algorithmic optimization, compiler and code optimization, and several parallelization techniques. Note that some of these parallelization techniques can be combined to obtain even higher performance, particularly in video coding applications. We also discussed overclocking and common performance bottlenecks in the video coding applications. Finally, we presented various performance-measurement considerations, tools, applications, methods, and metrics.