The ubiquity of video has been possible owing to the establishment of a common representation of video signals through international standards, with the goal of achieving common formats and interoperability of products from different vendors.

In the late 1980s, the need for standardization of digital video was recognized, and special expert groups from the computer, communications, and video industries came together to formulate practical, low-cost, and easily implementable standards for digital video storage and transmission. To determine these standards, the expert groups reviewed a variety of video data compression techniques, data structures, and algorithms, eventually agreeing upon a few common technologies, which are described in this chapter in some detail.

In the digital video field, many international standards exist to address various industry needs. For example, standard video formats are essential to exchange digital video between various products and applications. Since the amount of data necessary to represent digital video is huge, the data needs to be exchanged in compressed form—and this necessitates video data compression standards. Depending on the industry, standards were aimed at addressing various aspects of end-user applications. For example, display resolutions are standardized in the computer industry, digital studio standards are standardized in the television industry, and network protocols are standardized in the telecommunication industry. As various usages of digital video have emerged to bring these industries ever closer together, standardization efforts have concurrently addressed those cross-industry needs and requirements. In this chapter we discuss the important milestones in international video coding standards, as well as other video coding algorithms popular in the industry.

Overview of International Video Coding Standards

International video coding standards are defined by committees of experts from organizations like the International Standards Organization (ISO) and the International Telecommunications Union (ITU). The goal of this standardization is to have common video formats across the industry, and to achieve interoperability among different vendors and video codec related hardware and software manufacturers.

The standardization of algorithms started with image compression schemes such as JBIG (ITU-T Rec. T82 and ISO/IEC 11544, March 1993) for binary images used in fax and other applications, and the more general JPEG (ITU-T Rec. T81 and ISO/IEC 10918-1), which includes color images as well. The JPEG standardization activities started in 1986, but the standard was ratified in 1992 by ITU-T and in 1994 by ISO. The main standardization activities for video compression algorithms started in the 1980s, with the ITU-T H.261, ratified in 1988, which was the first milestone standard for visual telecommunication. Following that effort, standardization activities increased with the rapid advancement in the television, film, computer, communication, and signal processing fields, and with the advent of new usages requiring contributions from all these diverse industries. These efforts subsequently produced MPEG-1, H.263, MPEG-2, MPEG-4 Part 2, AVC/H.264, and HEVC/H.265 algorithms. In the following sections, we briefly describe the major international standards related to image and video coding.


The JPEG is a continuous-tone still image compression standard, designed for applications like desktop publishing, graphic arts, color facsimile, newspaper wirephoto transmission, medical imaging, and the like. The baseline JPEG algorithm uses a DCT-based coding scheme where the input is divided into 8×8 blocks of pixels. Each block undergoes a two-dimensional forward DCT, followed by a uniform quantization. The resulting quantized coefficients are scanned in a zigzag order to form a one-dimensional sequence where high-frequency coefficients, likely to be zero-valued, are placed later in the sequence to facilitate run-length coding. After run-length coding, the resulting symbols undergo more efficient entropy coding.

The first DCT coefficient, often called the DC coefficient as it is a measure of the average value of the pixel block, is differentially coded with respect to the previous block. The run length of the AC coefficients, which have non-zero frequency, are coded using variable-length Huffman codes, assigning shorter codewords for more probable symbols. Figure 3-1 shows the JPEG codec block diagram.

Figure 3-1.
figure 1

Baseline JPEG codec block diagram


H.261 is the first important and practical digital video coding standard adopted by the industry. It is the video coding standard for audiovisual services at p×64 kbps, where p = 1, . . ., 30, primarily aimed at providing videophone and video-conferencing services over ISDN, and unifying all aspects of video transmission for such applications in a single standard.Footnote 1The objectives include delivering video at real time, typically at 15–30 fps, with minimum delay (less than 150 ms). Although a successful standard providing industry-wide interoperability, common format, and compression techniques, H.261 is obsolete and is rarely used today.

In H.261, it is mandatory for all codecs to operate at quarter-CIF (QCIF) video format, while the use of CIF is optional. Since the uncompressed video bit rates for CIF and QCIF at 29.97 fps are 26.45 Mbps and 9.12 Mbps, respectively, it is extremely difficult to transport these video signals using an ISDN channel while providing reasonable video quality. To accomplish this goal, H.261 divides a video into a hierarchical block structure comprising pictures, groups of blocks (GOB), macroblocks (MB), and blocks. A macroblock consists of four 8×8 luma blocks and two 8×8 chroma blocks; a 3×11 array of macroblocks, in turn, constitutes a GOB. A QCIF picture has three GOBs, while a CIF picture has 12. The hierarchical data structure is shown in Figure 3-2.

Figure 3-2.
figure 2

Data structures of the H.261 video multiplex coder

The H.261 source coding algorithm is a hybrid of intra-frame and inter-frame coding, exploiting spatial and temporal redundancies. Intra-frame coding is similar to baseline JPEG, where block-based 8×8 DCT is performed, and the DCT coefficients are quantized. The quantized coefficients undergo entropy coding using variable-length Huffman codes, which achieves bit-rate reduction using statistical properties of the signal.

Inter-frame coding involves motion-compensated inter-frame prediction and removes the temporal redundancy between pictures. Prediction is performed only in the forward direction; there is no notion of bi-directional prediction. While the motion compensation is performed with integer-pel accuracy, a loop filter can be switched into the encoder to improve picture quality by removing coded high-frequency noise when necessary. Figure 3-3 shows a block diagram of the H.261 source encoder.

Figure 3-3.
figure 3

Block diagram of the source encoder of ITU-T H.261


In the 1990s, instigated by the market success of compact disc digital audio, CD-ROMs made remarkable inroads into the data storage domain. This prompted the inception of the MPEG-1 standard, targeted and optimized for applications requiring 1.2 to 1.5 Mbps with video home system (VHS)-quality video. One of the initial motivations was to fit compressed video into widely available CD-ROMs; however, a surprisingly large number of new applications have emerged to take advantage of the highly compressed video with reasonable video quality provided by the standard algorithm. MPEG-1 remains one of the most successful developments in the history of video coding standards. Arguably, however, the most well-known part of the MPEG-1 standard is the MP3 audio format that it introduced. The intended applications for MPEG-1 include CD-ROM storage, multimedia on computers, and so on. The MPEG-1 standard was ratified as ISO/IEC 11172 in 1991. The standard consists of the following five parts:

  1. 1.

    Systems: Deals with storage and synchronization of video, audio, and other data.

  2. 2.

    Video: Defines standard algorithms for compressed video data.

  3. 3.

    Audio: Defines standard algorithms for compressed audio data.

  4. 4.

    Conformance: Defines tests to check correctness of the implementation of the standard.

  5. 5.

    Reference Software: Software associated with the standard as an example for correct implementation of the encoding and decoding algorithms.

The MEPG-1 bitstream syntax is flexible and consists of six layers, each performing a different logical or signal-processing function. Figure 3-4 depicts various layers arranged in an onion structure.

Figure 3-4.
figure 4

Onion structure of MPEG-1 bitstream syntax

MPEG-1 is designed for coding progressive video sequences, and the recommended picture size is 360×240 (or 352×288, a.k.a. CIF) at about 1.5 Mbps. However, it is not restricted to this format, and can be applied to higher bit rates and larger image sizes. The intended chroma format is 4:2:0 with 8 bits of pixel depth. The standard mandates real-time decoding and supports features to facilitate interactivity with stored bitstreams. It only specifies syntax for the bitstream and the decoding process, allowing sufficient flexibility for the encoder implementation. Encoders are usually designed to meet specific usage needs, but they are expected to provide sufficient tradeoffs between coding efficiency and complexity.

The main goal of the MPEG-1 video algorithm, as in any other standard, is to achieve the highest possible video quality for a given bit rate. Toward this goal, the MPEG-1 compression approach is similar to that of H.261: it is also a hybrid of intra- and inter-frame redundancy-reduction techniques. For intra-frame coding, the frame is divided into 8×8 pixel blocks, which are transformed to frequency domain using 8×8 DCT, quantized, zigzag scanned, and the run length of the generated bits are coded using variable-length Huffman codes.

Temporal redundancy is reduced by computing a difference signal, namely the prediction error, between the original frame and its motion-compensated prediction constructed from a reconstructed reference frame. However, temporal redundancy reduction in MPEG-1 is different from H.261 in a couple of significant ways:

  • MPEG-1 permits bi-directional temporal prediction, providing higher compression for a given picture quality than would be attainable using forward-only prediction. For bi-directional prediction, some frames are encoded using either a past or a future frame in display order as the prediction reference. A block of pixels can be predicted from a block in the past reference frame, from a block in the future reference frame, or from the averge of two blocks, one from each reference frame. In bi-directional prediction, higher compression is achieved at the expense of greater encoder complexity and additional coding delay. However, it is still very useful for storage and other off-line applications.

  • Further, MPEG-1 introduces half-pel (a.k.a. half-pixel) accuracy for motion compensation and eliminates the loop filter. The half-pel accuracy partly compensates for the benefit provided by the H.261 loop filter in that high-frequency coded noise does not propagate and coding efficiency is not sacrificed.

The video sequence layer specifies parameters such as the size of the video frames, frame rate, bit rate, and so on. The group of pictures (GOP) layer provides support for random access, fast search, and editing. The first frame of a GOP must be intra-coded (I-frame), where compression is achieved only in the spatial dimension using DCT, quantization, and variable-length coding. The I-frame is followed by an arrangement of forward-predictive coded frames (P-frames) and bi-directionally predictive coded frames (B-frames). I-frames provide ability for random access to the bitstream and for fast search (or VCR-like trick play, such as fast-forward and fast-rewind), as they are coded independently and serve as entry points for further decoding.

The picture layer deals with a particular frame and contains information of the frame type (I, P, or B) and the display order of the frame. The bits corresponding to the motion vectors and the quantized DCT coefficients are packages in the slice layer, the macroblock layer, and the block layer. A slice is a contiguous segment of the macroblocks. In the event of a bit error, the slice layer helps resynchronize the bitstream during decoding. The macroblock layer contains the associated motion vector bits and is followed by the block layer, which consists of the coded quantized DCT coefficients. Figure 3-5 shows the MPEG picture structure in coding and display order, which applies to both MPEG-1 and MPEG-2.

Figure 3-5.
figure 5

I/P/B frame structure, prediction relationships, coding order, and display order


MPEG-2 was defined as the standard for generic coding of moving pictures and associated audio. The standard was specified by a joint technical committee of the ISO/IEC and ITU-T, and was ratified in 1993, both as the ISO/IEC international standard 13818 and as the ITU-T Recommendation H.262.

With a view toward resolving the existing issues in MPEG-1, the standardization activity in MPEG-2 focused on the following considerations:

  • Extend the number of audio compression channels from 2 channels to 5.1 channels.

  • Add standardization support for interlaced video for broadcast applications.

  • Provide more standard profiles, beyond the Constrained Parameters Bitstream available in MPEG-1, in order to support higher-resolution video contents.

  • Extend support for color sampling from 4:2:0, to include 4:2:2 and 4:4:4.

For MPEG standards, the standards committee addressed video and audio compression, as well as system considerations for multiplexing the compressed audio-visual data. In MPEG-2 applications, the compressed video and audio elementary streams are multiplexed to construct a program stream; several program streams are packetized and combined to form a transport stream before transmission. However, in the following discussion, we will focus on MPEG-2 video compression.

MPEG-2 is targeted for a variety of applications at a bit rate of 2 Mbps or more, with a quality ranging from good-quality NTSC to HDTV. Although widely used as the format of digital television signal for terrestrial, cable, and direct-broadcast satellite TV systems, other typical applications include digital videocassette recorders (VCR), digital video discs (DVD), and the like. As a generic standard supporting a variety of applications generally ranging from 2 Mbps to 40 Mbps, MPEG-2 targets a compression ratio in the range of 30 to 40. To provide application independence, MPEG-2 supports a variety of video formats with resolutions ranging from source input format (SIF) to HDTV. Table 3-1 shows some typical video formats used in MPEG-2 applications.

Table 3-1. Typical MPEG-2 Paramters

The aim of MPEG-2 is to provide better picture quality while keeping the provisions for random access to the coded bitstream. However, it is a rather difficult task to accomplish. Owing to the high compression demanded by the target bit rates, good picture quality cannot be achieved by intra-frame coding alone. Contrarily, the random-access requirement is best satisfied with pure intra-frame coding. This dilemma necessitates a delicate balance between the intra- and inter-picture coding. And this leads to the definition of I, P, and B pictures, similar to MPEG-1. I-frames are the least compressed, and contain approximately the full information of the picture in a quantized form in frequency domain, providing robustness against errors. The P-frames are predicted from past I- or P-frames, while the B-frames offer the greatest compression by using past and future I- or P-frames for motion compensation. However, B-frames are the most vulnerable to channel errors.

An MPEG-2 encoder first selects an appropriate spatial resolution for the signal, followed by a block-matching motion estimation to find the displacement of a macroblock (16×16 or 16×8 pixel area) in the current frame relative to a macroblock obtained from a previous or future reference frame, or from their average. The search for the best matching block is based on the mean absolute difference (MAD) distortion criterion; the best matching occurs when the accumulated absolute values of the pixel differences for all macroblocks are minimized. The motion estimation process then defines a motion vector representing the displacement of the current block’s location from the best matched block’s location. To reduce temporal redundancy, motion compensation is used both for causal prediction of the current picture from a previous reference picture and for non-causal, interpolative prediction from past and future reference pictures. The prediction of a picture is constructed based on the motion vectors.

To reduce spatial redundancy, the difference signal—that is, the prediction error—is further compressed using the block transform coding technique that employs the two-dimensional orthonormal 8×8 DCT to remove spatial correlation. The resulting transform coefficients are ordered in an alternating or zigzag scanning pattern before they are quantized in an irreversible process that discards less important information. In MPEG-2, adaptive quantization is used at the macroblock layer, allowing smooth bit-rate control and perceptually uniform video quality. Finally, the motion vectors are combined with the residual quantized coefficients, and are transmitted using variable-length Huffman codes. The Huffman coding tables are pre-determined and optimized for a limited range of compression ratios appropriate for some target applications. Figure 3-6 shows an MPEG-2 video encoding block diagram.

Figure 3-6.
figure 6

The MPEG-2 video encoding block diagram

The bitstream syntax in MPEG-2 is divided into subsets known as profiles, which specify constraints on the syntax. Profiles are further divided into levels, which are sets of constraints imposed on parameters in the bitstream. There are five profiles defined in MPEG-2:

  • Main: Aims at the maximum quality of standard definition pictures.

  • Simple: Is directed to memory savings by not interpolating pictures.

  • SNR scalable: Aims to provide better signal-to-noise ratio on demand by using more than one layer of quantization.

  • Spatially scalable: Aims to provide variable resolution on demand by using additional layers of weighted and reconstructed reference pictures.

  • High: Intended to support 4:2:2 chroma format and full scalability.

Within each profile, up to four levels are defined:

  • Low: Provides compatibility with H.261 or MPEG-1.

  • Main: Corresponds to conventional TV.

  • High 1440: Roughly corresponds to HDTV, with 1,440 samples per line.

  • High: Roughly corresponds to HDTV, with 1,920 samples per line.

The Main profile, Main level (MP @ ML) reflects the initial focus of MPEG-2 with regard to entertainment applications. The permitted profile-level combinations are: Simple profile with Main level, Main profile with all levels, SNR scalable profile with Low and Main levels, Spatially scalable profile with High 1440 level, and High profile with all levels except Low level.

The bitstream syntax can also be divided as follows:

  • Non-scalable syntax: A super-set of MPEG-1, featuring extra compression tools for interlaced video signals along with variable bit rate, alternate scan, concealment motion vectors, intra-DCT format, and so on.

  • Scalable syntax: A base layer similar to the non-scalable syntax and one or more enhancement layers with the ability to enable the reconstruction of useful video.

The structure of the compressed bitstream is shown in Figure 3-7. The layers are similar to those of MPEG-1. A compressed video sequence starts with a sequence header containing picture resolutions, picture rate, bit rate, and so on. There is a sequence extension header in MPEG-2 containing video format, color primaries, display resolution, and so on. The sequence extension header may be followed by an optional GOP header having the time code, which is subsequently followed by a frame header containing temporal reference, frame type, video buffering verifier (VBV) delay, and so on. The frame header can be succeeded by a picture coding extension containing interlacing, DCT type and quantizer-scale type information, which is usually followed by a slice header to facilitate resynchronization. Inside a slice, several macroblocks are grouped together, where the macroblock address and type, motion vector, coded block pattern, and so on are placed before the actual VLC-coded quantized DCT coefficients for all the blocks in a macroblock. The slices can start at any macroblock location, and they are not restricted to the beginning of macroblock rows.

Figure 3-7.
figure 7

Structure of MPEG-2 video bitstream syntax

Since the MPEG-2 base layer is a super-set of MPEG-1, standard-compliant decoders can decode MPEG-1 bitstreams providing backward compatibility. Furthermore, MPEG-2 is capable of selecting the optimum mode for motion-compensated prediction, such that the current frame or field can be predicted either from the entire reference frame or from the top or bottom field of the reference frame, thereby finding a better relationship of the fields. MPEG-2 also adds the alternate scanning pattern, which suits interlaced video better than the zigzag scanning pattern. Besides, a choice is offered between linear and nonlinear quantization tables, and up to 11 bits DC precision is supported for intra macroblocks. These are improvements on MPEG-1, which does not support nonlinear quantization tables and provides only 8 bits of intra-DC precision. At the same bit rate, MPEG-2 yields better quality than MPEG-1, especially for interlaced video sources. Moreover, MPEG-2 is more flexible for parameter variation at a given bit rate, helping a smoother buffer control. However, these benefits and improvements come at the expense of increased complexity.


H.263 defined by ITU-T is aimed at low-bit-rate video coding but does not specify a constraint on video bit rate; such constraints are given by the terminal or the network. The objective of H.263 is to provide significantly better picture quality than its predecessor, H.261. Conceptually, H.263 is network independent and can be used for a wide range of applications, but its target applications are visual telephony and multimedia on low-bit-rate networks like PSTN, ISDN, and wireless networks. Some important considerations for H.263 include small overhead, low complexity resulting in low cost, interoperability with existing video communication standards (e.g., H.261, H.320), robustness to channel errors, and quality of service (QoS) parameters. Based on these considerations, an efficient algorithm is developed, which gives manufacturers the flexibility to make tradeoffs between picture quality and complexity. Compared to H.261, it provides the same subjective image quality at less than half the bit rate.

Similar to other standards, H.263 uses inter-picture prediction to reduce temporal redundancy and transform coding of the residual prediction error to reduce spatial redundancy. The transform coding is based on 8×8 DCT. The transformed signal is quantized with a scalar quantizer, and the resulting symbol is variable length coded before transmission. At the decoder, the received signal is inverse quantized and inverse transformed to reconstruct the prediction error signal, which is added to the prediction, thus creating the reconstructed picture. The reconstructed picture is stored in a frame buffer to serve as a reference for the prediction of the next picture. The encoder consists of an embedded decoder where the same decoding operation is performed to ensure the same reconstruction at both the encoder and the decoder.

H.263 supports five standard resolutions: sub-QCIF (128×96), QCIF (176×144), CIF (352×288), 4CIF (704×576), and 16CIF (1408×1152), covering a large range of spatial resolutions. Support for both sub-QCIF and QCIF formats in the decoder is mandatory, and either one of these formats must be supported by the encoder. This requirement is a compromise between high resolution and low cost.

A picture is divided into 16×16 macroblocks, consisting of four 8×8 luma blocks and two spatially aligned 8×8 chroma blocks. One or more macroblocks rows are combined into a group of blocks (GOB) to enable quick resynchronization in the event of transmission errors. Compared to H.261, the GOB structure is simplified; GOB headers are optional and may be used based on the tradeoff between error resilience and coding efficiency.

For improved inter-picture prediction, the H.263 decoder has a block motion compensation capability, while its use in the encoder is optional. One motion vector is transmitted per macroblock. Half-pel precision is used for motion compensation, in contrast to H.261, where full-pel precision and a loop filter is used. The motion vectors, together with the transform coefficients, are transmitted after variable-length coding. The bit rate of the coded video may be controlled by preprocessing or by varying the following encoder parameters: quantizer scale size, mode selections, and picture rate.

In addition to the core coding algorithm described above, H.263 includes four negotiable coding options, as mentioned below. The first three options are used to improve inter-picture prediction, while the fourth is related to lossless coding. The coding options increase the complexity of the encoder but improve picture quality, thereby allowing tradeoff between picture quality and complexity.

  • Unrestricted motion vector (UMV) mode: In the UMV mode, motion vectors are allowed to point outside the coded picture area, enabling a much better prediction, particularly when a reference macroblock is partly located outside the picture area and part of it is not available for prediction. Those unavailable pixels would normally be predicted using the edge pixels instead. However, this mode allows utilization of the complete reference macroblock, producing a gain in quality, especially for the smaller picture formats when there is motion near the picture boundaries. Note that, for the sub-QCIF format, about 50 percent of all the macroblocks are located at or near the boundary.

  • Advanced prediction (AP) mode: In this optional mode, the overlapping block motion compensation (OBMC) is used for luma, resulting in a reduction in blocking artifacts and improvement in subjective quality. For some macroblocks, four 8×8 motion vectors are used instead of a 16×16 vector, providing better prediction at the expense of more bits.

  • PB-frames (PB) mode: The principal purpose of the PB-frames mode is to increase the frame rate without significantly increasing the bit rate. A PB-frame consists of two pictures coded as one unit. The P-picture is predicted from the last decoded P-picture, and the B-picture is predicted both from the last and from the current P-pictures. Although the names “P-picture” and “B-picture” are adopted from MPEG, B-pictures in H.263 serve an entirely different purpose. The quality of the B-pictures is intentionally kept low, in particular to minimize the overhead of bi-directional prediction, while such overhead is important for low-bit-rate applications. B-pictures use only 15 to 20 percent of the allocated bit rate, but result in better subjective impression of smooth motion.

  • Syntax-based arithmetic coding (SAC) mode: H.263 is optimized for very low bit rates. As such, it allows the use of optional syntax-based arithmetic coding mode, which replaces the Huffman codes with arithmetic codes for variable-length coding. While Huffman codes must use an integral number of bits, arithmetic coding removes this restriction, thus producing a lossless coding with reduced bit rate.

The video bitstream of H.263 is arranged in a hierarchical structure composed of the following layers: picture layer, group of blocks layer, macroblock layer, and block layer. Each coded picture consists of a picture header followed by coded picture data arranged as group of blocks. Once the transmission of the pictures is completed, an end-of-sequence (EOS) code and, if needed, stuffing bits (ESTUF) are transmitted. There are some optional elements in the bitstream. For example, temporal reference of B-pictures (TRB) and the quantizer parameter (DBQUANT) are only available if the picture type (PTYPE) indicates a B-picture. For P-pictures, a quantizer parameter PQUANT is transmitted.

The GOB layer consists of a GOB header followed by the macroblock data. The first GOB header in each picture is skipped, while for other GOBs, a header is optional and is used based on available bandwidth. Group stuffing (GSTUF) may be necessary for a GOB start code (GBSC). Group number (GN), GOB frame ID (GFID), and GOB quantizer (GQUANT) can be present in the GOB header.

Each macroblock consists of a macroblock header followed by the coded block data. A coded macroblock is indicated by a flag called COD; for P-pictures, all the macroblocks are coded. A macroblock type and coded block pattern for chroma (MCBPC) are present when indicated by COD or when PTYPE indicates an I-picture. A macroblock mode for B-pictures (MODB) is present for non-intra macroblocks for PB-frames. The luma coded block pattern (CBPY), and the codes for the differential quanitizer (DQUANT) and motion vector data (MVD or MVD2-4 for advanced prediction), may be present according to MCBPC. The CBP and motion vector data for B-blocks (CBPB and MVDB) are present only if the coding mode is B (MODB). As mentioned before, in the normal mode a macroblock consists of four luma and two chroma blocks; however, in PB-frames mode a macroblock can be thought of as containing 12 blocks. The block structure is made up of intra DC followed by the transform coefficients (TCOEF). For intra macroblocks, intra DC is sent for every P-block in the macroblock. Figure 3-8 shows the structure of various H.263 layers.

Figure 3-8.
figure 8

Structure of various layers in H.263 bitstream

MPEG-4 (Part 2)

MPEG-4, formally the standard ISO/IEC 14496, was ratified by ISO/IEC in March 1999 as the standard for multimedia data representation and coding. In addition to video and audio coding and multiplexing, MPEG-4 addresses coding of various two- or three-dimensional synthetic media and flexible representation of audio-visual scene and composition. As the usage of multimedia developed and diversified, the scope of MPEG-4 was extended from its initial focus on very low bit-rate coding of limited audio-visual materials to encompass new multimedia functionalities.

Unlike pixel-based treatment of video in MPEG-1 or MPEG-2, MPEG-4 supports content-based communication, access, and manipulation of digital audio-visual objects, for real-time or non-real-time interactive or non-interactive applications. MPEG-4 offers extended functionalities and improves upon the coding efficiency provided by previous standards. For instance, it supports variable pixel depth, object-based transmission, and a variety of networks including wireless networks and the Internet. Multimedia authoring and editing capabilities are particularly attractive features of MPEG-4, with the promise of replacing existing word processors. In a sense, H.263 and MPEG-2 are embedded in MPEG-4, ensuring support for applications such as digital TV and videophone, while it is also used for web-based media streaming.

MPEG-4 distinguishes itself from earlier video coding standards in that it introduces object-based representation and coding methodology of real or virtual audio-visual (AV) objects. Each AV object has its local 3D+T coordinate system serving as a handle for the manipulation of time and space. Either the encoder or the end-user can place an AV object in a scene by specifying a co-ordinate transformation from the object’s local co-ordinate system into a common, global 3D+T co-ordinate system, known as the scene co-ordinate system. The composition feature of MPEG-4 makes it possible to perform bitstream editing and authoring in compressed domain.

One or more AV objects, including their spatio-temporal relationships, are transmitted from an encoder to a decoder. At the encoder, the AV objects are compressed, error-protected, multiplexed, and transmitted downstream. At the decoder, these objects are demultiplexed, error corrected, decompressed, composited, and presented to an end user. The end user is given an opportunity to interact with the presentation. Interaction information can be used locally or can be transmitted upstream to the encoder.

The transmitted stream can either be a control stream containing connection setup, the profile (subset of encoding tools), and class definition information, or be a data stream containing all other information. Control information is critical, and therefore it must be transmitted over reliable channels; but the data streams can be transmitted over various channels with different quality of service.

Part 2 of the standard deals with video compression. As the need to support various profiles and levels was growing, Part 10 of the standard was introduced to handle such demand, which soon became more important and commonplace in the industry than Part 2. However, MPEG-4 Part 10 can be considered an independent standardization effort as it does not provide backward compatibility with MPEG-4 Part 2. MPEG-4 Part 10, also known as advanced video coding (AVC), is discussed in the next section.

MPEG-4 Part 2 is an object-based hybrid natural and synthetic coding standard. (For simplicity, we will refer to MPEG-4 Part 2 simply as MPEG-4 in the following discussion.) The structure of the MPEG-4 video is hierarchical in nature. At the top layer is a video session (VS) composed of one or more video objects (VO). A VO may consist of one or more video object layers (VOL). Each VOL consists of an ordered time sequence of snapshots, called video object planes (VOP). The group of video object planes (GOV) layer is an optional layer between the VOL and the VOP layer. The bitstream can have any number of the GOV headers, and the frequency of the GOV header is an encoder issue. Since the GOV header indicates the absolute time, it may be used for random access and error-recovery purposes.

The video encoder is composed of a number of encoders and corresponding decoders, each dedicated to a separate video object. The reconstructed video objects are composited together and presented to the user. The user interaction with the objects such as scaling, dragging, and linking can be handled either in the encoder or in the decoder.

In order to describe arbitrarily shaped VOPs, MPEG-4 defines a VOP by means of a bounding rectangle called a VOP window. The video object is circumscribed by the tightest VOP window, such that a minimum number of image macroblocks are coded. Each VO consists of three main functions: shape coding, motion compensation, and texture coding. In the event of a rectangular VOP, the MPEG-4 encoder structure is similar to that of the MPEG-2 encoder, and shape coding can be skipped. Figure 3-9 shows the structure of a video object encoder.

Figure 3-9.
figure 9

Video object encoder structure in MPEG-4

The shape information of VOP is referred to as the alpha plane in MPEG-4. The alpha plane has the same format as the luma and its data indicates the characteristics of the relevant pixels, whether or not the pixels are within a video object. The shape coder compresses the alpha plane. Binary alpha planes are encoded by modified content-based arithmetic encoding (CAE), while gray-scale alpha planes are encoded by motion-compensated DCT, similar to texture coding. The macroblocks that lie completely outside the object (transperant macroblocks) are not processed for the motion or texture coding; therefore, no overhead is required to indicate this mode, since this transperancy information can be obtained from shape coding.

Motion estimation and compensation are used to reduce temporal redundancies. A padding technique is applied on the reference VOP that allows polygon matching instead of block matching for rectangular images. Padding methods aim at extending arbitrarily shaped image segments to a regular block grid by filling in the missing data corresponding to signal extrapolation such that common block-based coding techniques can be applied. In addition to the basic motion compensation technique, unrestricted motion compensation, advanced prediction mode, and bi-directional motion compensation are supported by MPEG-4 video to obtain a significant improvement in quality at the expense of very little increased complexity.

The intra and residual data after motion compensation of VOPs are coded using a block-based DCT scheme, similar to previous standards. Macroblocks that lie completely inside the VOP are coded using a technique identical to H.263; the region outside the VOP within the contour macroblocks (i.e., macroblocks with an object edge) can either be padded for regular DCT transformation or can use shape adaptive DCT (SA-DCT). Transperant blocks are skipped and are not coded in the bitstream.

MPEG-4 supports scalable coding of video objects in spatial and temporal domains, and provides error resilience across various media. Four major tools, namely video packet re-synchronization, data partitioning, header extension code, and reversible VLC, provide loss-resilience properties such as resynchronization, error detection, data recovery, and error concealment.


The Advanced Video Coding (AVC), also known as the ITU-T H.264 standard (ISO/IEC 14496-10), is currently the most common video compression format used in the industry for video recording and distribution. It is also known as MPEG-4 Part 10. The AVC standard was ratified in 2003 by the Joint Video Team (JVT) of the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG) standardization organizations. One of the reasons the AVC standard is so well known is that it is one of the three compression standards for Blu-ray (the others being MPEG-2 and VC-1), and it is also widely used by Internet streaming applications like YouTube and iTunes, software applications like Flash Player, software frameworks like Silverlight, and various HDTV broadcasts over terrestrial, cable, and satellite channels.

The AVC video coding standard has the same basic functional elements as previous standards MPEG-4 Part 2, MPEG-2, H.263, MPEG-1, and H.261. It uses a lossy predictive, block-based hybrid DPCM coding technique. This involves transform for reduction of spatial correlation, quantization for bit-rate control, motion-compensated prediction for reduction of temporal correlation, and entropy encoding for reduction of statistical correlation. However, with a goal of achieving better coding performance than previous standards, AVC incorporates changes in the details of each functional element by including in-picture prediction, a new 4×4 transform, multiple reference pictures, variable block sizes, and a quarter-pel precision for motion compensation, a deblocking filter, and improved entropy coding. AVC also introduces coding concepts such as generalized B-slices, which supports not only bidirectional forward-backward prediction pair but also forward-forward and backward-backward prediction pairs. There are several other tools, including direct modes and weighted prediction, defined by AVC to obtain a very good prediction of the source signal so that the error signal has a minimum energy. These tools help AVC perform significantly better than prior standards for a variety of applications. For example, compared to MPEG-2, AVC typically obtains the same quality at half the bit rate, especially for high-resolution contents coded at high bit rates.

However, the improved coding efficiency comes at the expense of additional complexity to the encoder and decoder. So, to compensate, AVC utilizes some methods to reduce the implementation complexity—for example, multiplier-free integer transform is introduced where multiplication operations for the transform and quantization are combined. Further, to facilitate applications on noisy channel conditions and error-prone environments such as the wireless networks, AVC utilizes some methods to exploit error resilience to network noise. These include flexible macroblock ordering (FMO), switched slice, redundant slice methods, and data partitioning.

The coded AVC bitstream has two layers, the network abstraction layer (NAL) and video coding layer (VCL). The NAL abstracts the VCL data to help transmission on a variety of communication channels or storage media. A NAL unit specifies both byte-stream and packet-based formats. The byte-stream format defines unique start codes for the applications that deliver the NAL unit stream as an ordered stream of bytes or bits, encapsulated in network packets such as MPEG-2 transport streams. Previous standards contained header information about slice, picture, and sequence at the start of each element, where loss of these critical elements in a lossy environment would render the rest of the element data useless. AVC resolves this problem by keeping the sequence and picture parameter settings in the non-VCL NAL units that are transmitted with greater error-protection. The VCL unit contains the core video coded data, consisting of video sequence, picture, slice, and macroblock.

Profile and Level

A profile is a set of features of the coding algorithm that are identified to meet certain requirements of the applications. This means that some features of the coding algorithm are not supported in some profiles. The standard defines 21 sets of capabilities, targeting specific classes of applications.

For non-scalable two-dimensional video applications, the following are the important profiles:

  • Constrained Baseline Profile: Aimed at low-cost mobile and video communication applications, the Constrained Baseline Profile uses the subset of features that are in common with the Baseline, Main, and High Profiles.

  • Baseline Profile: This profile is targeted for low-cost applications that require additional error resiliency. As such, on top of the features supported in the Constrained Baseline Profile, it has three features for enhanced robustness. However, in practice, Constrained Baseline Profile is more commonly used than Baseline Profile. The bitstreams for these two profiles share the same profile identifier code value.

  • Extended Profile: This is intended for video streaming. It has higher compression capability and more robustness than Baseline Profile, and it supports server stream switching.

  • Main Profile: Main profile is used for standard-definition digital TV broadcasts, but not for HDTV broadcasts, for which High Profile is primarily used.

  • High Profile: It is the principal profie for HDTV broadcast and for disc storage, such as the Blu-ray Disc storage format.

  • Progressive High Profile: This profile is similar to High profile, except that it does not support the field coding tools. It is intended for applications and displays using progressive scanned video.

  • High 10 Profile: Mainly for premium contents with 10-bit per sample decoded picture precision, this profile adds 10-bit precision support to the High Profile.

  • High 4:2:2 Profile: This profile is aimed at professional applications that use interlaced video. On top of the High 10 Profile, it adds support for the 4:2:2 chroma subsampling format.

  • High 4:4:4 Predictive Profile: Further to the High 4:2:2 Profile, this profile supports up to 4:4:4 chroma sampling and up to 14 bits per sample precision. It additionally supports lossless region coding and the coding of each picture as three separate color planes.

In addition to the above profiles, the Scalable Video Coding (SVC) extension defines five more scalable profiles: Scalable Constrained Baseline, Scalable Baseline, Scalable High, Scalable Constrained High, and Scalable High Intra profiles. Also, the Multi-View Coding (MVC) extension adds three more profiles for three-dimensional video—namely Stereo High, Multiview High, and Multiview Depth High profiles. Furthermore, four intra-frame-only profiles are defined for professional editing applications: High 10 Intra, High 4:2:2 Intra, High 4:4:4 Intra, and CAVLC 4:4:4 Intra profiles.

Levels are constraints that specify the degree of decoder performance needed for a profile; for example, a level designates the maximum picture resolution, bit rate, frame rate, and so on that the decoder must adhere to within a profile. Table 3-2 shows some examples of level restrictions; for full description, see the standard specification.Footnote 2

Table 3-2. Examples of Level Restrictions in AVC

Picture Structure

The video sequence has frame pictures or field pictures. The pictures usually comprise three sample arrays, one luma and two chroma sample arrays (RGB arrays are supported in High 4:4:4 Profile only). AVC supports either progressive-scan or interlaced-scan, which may be mixed in the same sequence. Baseline Profile is limited to progressive scan.

Pictures are divided into slices. A slice is a sequence of a flexible number of macroblocks within a picture. Multiple slices can form slice groups; there is macroblock to slice group mapping to determine which slice group includes a particular macroblock. In the 4:2:0 format, each macroblock has one 16×16 luma and two 8×8 chroma sample arrays, while in the 4:2:2 and 4:4:4 formats, the chroma sample arrays are 8 ×16 and 16 ×16, respectively. A picture may be partitioned into 16×16 or smaller partitions with various shapes such as 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4. These partitions are used for prediction purposes. Figure 3-10 shows the different partitions.

Figure 3-10.
figure 10

Macroblock and block partitions in AVC

Coding Algorithm

In the AVC algorithm, the encoder may select between intra and inter coding for various partitions of each picture. Intra coding (I) provides random access points in the bitstream where decoding can begin and continue correctly. Intra coding uses various spatial prediction modes to reduce spatial redundancy within a picture. In addition, AVC defines inter coding that uses motion vectors for block-based inter-picture prediction to reduce temporal redundancy. Inter coding are of two types: predictive (P) and bi-predictive (B). Inter coding is more efficient as it uses inter prediction of each block of pixels relative to some previously decoded pictures. Prediction is obtained from a deblocked version of previously reconstructed pictures that are used as references for the prediction. The deblocking filter is used in order to reduce the blocking artifacts at the block boundaries. Motion vectors and intra prediction modes may be specified for a variety of block sizes in the picture. Further compression is achieved by applying a transform to the prediction residual to remove spatial correlation in the block before it is quantized. The intra prediction modes, the motion vectors, and the quantized transform coefficient information are encoded using an entropy code such as context-adaptive variable length codes (CAVLC) or context adaptive binary arithmetic codes (CABAC). A block diagram of the AVC coding algorithm, showing the encoder and decoder blocks, is presented in Figure 3-11.

Figure 3-11.
figure 11

The AVC codec block diagram

Intra Prediction

In the previous standards, intra-coded macroblocks were coded independently without any reference, and it was necessary to use intra macroblocks whenever a good prediction was not available for a predicted macroblock. As intra macroblocks use more bits than the predicted ones, this was often less efficient for compression. To alleviate this drawback—that is, to reduce the number of bits needed to code an intra picture—AVC introduced intra prediction, whereby a prediction block is formed based on previously reconstructed blocks belonging to the same picture. Fewer bits are needed to code the residual signal between the current and the predicted blocks, compared to the coding the current block itself.

The size of an intra prediction block for the luma samples may be 4×4, 8×8, or 16×16. There are several intra prediction modes, out of which one mode is selected and coded in the bitstream. AVC defines a total of nine intra prediction modes for 4×4 and 8×8 luma blocks, four modes for a 16×16 luma block, and four modes for each chroma block. Figure 3-12 shows an example of intra prediction modes for a 4×4 block. In this example, [a, b, . . ., p] are the predicted samples of the current block, which are predicted from already decoded left and above blocks with samples [A, B, . . ., M]; the arrows show the direction of the prediction, with each direction indicated as a intra prediction mode in the coded bitstream. For mode 0 (vertical), the prediction is formed by extrapolation

Figure 3-12.
figure 12

An example of intra prediction modes for a 4×4 block (few modes are shown as examples)

of the samples above, namely [A, B, C, D]. Similarly for mode 1 (horizontal), left samples [I, J, K, L] are extrapolated. For mode 2 (DC prediction), the average of the above and left samples are used as the prediction. For mode 3 (diagonal down left), mode 4 (diagonal down right), mode 5 (vertical right), mode 6 (horizontal down), mode 7 (vertical left), and mode 8 (horizontal up), the predicted samples are formed from a weighted average of the prediction samples A—M.

Inter Prediction

Inter prediction reduces temporal correlation by using motion estimation and compensation. As mentioned before, AVC partitions the picture into several shapes from 16×16 down to 4×4 for such predictions. The motion compensation results in reduced information in the residual signal, although for the smaller partitions, an overhead of bits is incurred for motion vectors and for signaling the partition type.

Intra prediction can be applied to blocks as small as 4×4 luma samples with up to a quarter-pixel (a.k.a. quarter-pel) motion vector accuracy. Sub-pel motion compensation gives better compression efficiency than using integer-pel alone; while quarter-pel is better than half-pel, it involves more complex computation. For luma, the half-pel samples are generated first and are interpolated from neighboring integer-pel samples using a six-tap finite impulse response (FIR) filter with weights (1, -5, 20, 20, -5, 1)/32. With the half-pel samples available, quarter-pel samples are produced using bilinear interpolation between neighboring half- or integer-pel samples. For 4:2:0 chroma, eighth-pel samples correspond to quarter-pel luma, and are obtained from linear interpolation of integer-pel chroma samples. Sub-pel motion vectors are differentially coded relative to predictions from neighboring motion vectors. Figure 3-13 shows the location of sub-pel predictions relative to full-pel.

Figure 3-13.
figure 13

Locations of sub-pel prediction

For inter prediction, reference pictures can be used from a list of previously reconstructed pictures, which are stored in the picture buffer. The distance of a reference picture from the current picture in display order determines whether it is a short-term or a long-term reference. Long-term references help increase the motion search range by using multiple decoded pictures. As a limited size of picture buffer is used, some pictures may be marked as unused for reference, and may be deleted from the reference list in a controlled manner to keep the memory size practical.

Transform and Quantization

The AVC algorithm uses block-based transform for spatial redundancy removal, as the residual signal from intra or inter prediction is divided into 4×4 or 8×8 (High profile only) blocks, which are converted to transform domain before they are quantized. The use of 4×4 integer transform in AVC results in reduced ringing artifacts compared to those produced by previous standards using fixed 8×8 DCT. Also, multiplications are not necessary at this smaller size. AVC introduced the concept of hierarchical transform structure, in which the DC components of neighboring 4×4 luma transforms are grouped together to form a 4×4 block, which is transformed again using a Hadamard transform for further improvement in compression efficiency.

Both 4×4 and 8×8 transforms in AVC are integer transforms based on DCT. The integer transform, post-scaling, and quantization are grouped together in the encoder, while in the decoder the sequence is inverse quantization, pre-scaling, and inverse integer transform. For a deeper understanding of the process, consider the matrix H below.

A 4×4 DCT can be done using this matrix and the formula: X = HFH T , where H T is the transpose of the matrix H, F is the input 4×4 data block, and X is the resulting 4×4 transformed block. For DCT, the variables a, b, and c are as follows:

The AVC algorithm simplifies these coefficients with approximations, and still maintains orthogonaity property by using:

Further simplification is made to avoid multiplication by combining the transform with the quantization step, using a scaled transform X = ĤFĤ TSF, where,

and SF is a 4×4 matrix representing the scaling factors needed for orthonormality, and ⊗ represents element-by-element multiplication. The transformed and quantized signal Y with components Yi,j is obtained by appropriate quantization using one of the 52 available quantizer levels (a.k.a. quantization step size, Qstep) as follows:

In the decoder, the received signal Y is scaled with Qstep and SF as the inverse quantization and a part of inverse transform to obtain the inverse transformed block X′ with components X′i ,j:

The 4×4 reconstructed block is: where the integer inverse transform matrix is given by:

In addition, in the hierarchical transform approach for 16×16 intra mode, the 4×4 luma intra DC coefficients are further transformed using a Hadamard transform:

In 4:2:0 color sampling, for the chroma DC coefficients, the transform matrix is as follows:

In order to increase the compression gain provided by run-length coding, two scanning orders are defined to arrange the quantized coefficients before entropy coding, namely zigzag scan and field scan, as shown in Figure 3-14. While zigzag scanning is suitable for progressively scanned sources, alternate field scanning helps interlaced contents.

Figure 3-14.
figure 14

Zigzag and alternate field scanning orders for 4×4 blocks

Entropy Coding

Earlier standards provided entropy coding using fixed tables of variable-length codes, where the tables were predefined by the standard based on the probability distributions of a set of generic videos. It was not possible to optimize those Huffman tables for specific video sources. In contrast, AVC uses different VLCs to find a more appropriate code for each source symbol based on the context characteristics. Syntax elements other than the residual data are encoded using the Exponential-Golomb codes. The residual data is rearranged through zigzag or alternate field scanning, and then coded using context-adaptive variable length codes (CAVLC) or, optionally for Main and High profiles, using context-adaptive binary arithmetic codes (CABAC). Compared to CAVLC, CABAC provides higher coding efficiency at the expense of greater complexity.

CABAC uses an adaptive binary arithmetic coder, which updates the probability estimation after coding each symbol, and thus adapts to the context. The CABAC entropy coding has three main steps:

  • Binarization: Before arithmetic coding, a non-binary symbol such as a transform coefficient or motion vector is uniquely mapped to a binary sequence. This mapping is similar to converting a data symbol into a variable-length code, but in this case the binary code is further encoded by the arithmetic coder prior to transmission.

  • Context modeling: A probability model for the binarized symbol, called the context model, is selected based on previously encoded syntax element.

  • Binary arithmetic coding: In this step, an arithmetic coder encodes each element according to the selected context model, and subsequently updates the model.

Flexible Interlaced Coding

In order to provide enhanced interlaced coding capabilities, AVC supports macroblock-adaptive frame-field (MBAFF) coding and picture-adaptive frame-field (PAFF) coding techniques. In MBAFF, a macroblock pair structure is used for pictures coded as frames, allowing 16×16 macroblocks in field mode. This is in contrast to MPEG-2, where field mode processing in a frame-coded picture could only support 16×8 half-macroblocks. In case of PAFF, it is allowed to mix pictures coded as complete frames with combined fields with those coded as individual single fields.

In-Loop Deblocking

Visible and annoying blocking artifacts are produced owing to block-based transform in intra and inter prediction coding, and the quantization of the transform coefficients, especially for higher quantization scales. In an effort to mitigate such artifacts at the block boundaries, AVC provides deblocking filters, which also prevents propagation of accumulated coded noise.

A deblocking filter is not new; it was introduced in H.261 as an optional tool, and had some success in reducing temporal propagation of coded noise, as integer-pel accuracy in motion compensation alone was insufficient in reducing such noise. However, in MPEG-1 and MPEG-2, a deblocking filter was not used owing to its high complexity. Instead, the half-pel accurate motion compensation, where the half-pels were obtained by bilinear filtering of integer-pel samples, played the role of smoothing out the coded noise.

However, despite the complexity, AVC uses a deblocking filter to obtain higher coding efficiency. As it is part of the prediction loop, with the removal of the blocking artifacts from the predicted pictures, a much closer prediction is obtained, leading to a reduced-energy error signal. The deblocking filter is applied to horizontal or vertical edges of 4×4 blocks. The luma filtering is performed on four 16-sample edges, and the chroma filtering is performed on two 8-sample edges. Figure 3-15 shows the deblocking boundaries.

Figure 3-15.
figure 15

Deblocking along vertical and horizontal boundaries in macroblock

Error Resilience

AVC provides features for enhanced resilience to channel errors, which include NAL units, redundant slices, data partitioning, flexible macroblock ordering, and so on. Some of these features are as follows:

  • Network Abstraction Layer (NAL): By defining NAL units, AVC allows the same video syntax to be used in many network environments. In previous standards, header information was part of a syntax element, thereby exposing the entire syntax element to be rendered useless in case of erroneous reception of a single packet containing the header. In contrast, in AVC, self-contained packets are generated by decoupling information relevant to more than one slice from the media stream. The high-level crucial parameters, namely the Sequence Parameter Set (SPS) and Picture Parameter Set (PPS), are kept in NAL units with a higher level of error protection. An active SPS remains unchanged throughout a coded video sequence, and an active PPS remains unchanged within a coded picture.

  • Flexible macroblock ordering (FMO): FMO is also known as slice groups. Along with arbitrary slice ordering (ASO), this technique re-orders the macroblocks in pictures, so that losing a packet does not affect the entire picture. Missing macroblocks can be regenerated by interpolating from neighboring reconstructed macroblocks.

  • Data partitioning (DP): This is a feature providing the ability to separate syntax elements according to their importance into different packets of data. It enables the application to have unequal error protection (UEP).

  • Redundant slices (RS): This is an error-resilience feature in AVC that allows an encoder to send an extra representation of slice data, typically at lower fidelity. In case the primary slice is lost or corrupted by channel error, this representation can be used instead.


The High Efficiency Video Coding (HEVC), or the H.265 standard (ISO/IEC 23008-2), is the most recent joint video coding standard ratified in 2013 by the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG) standardization organizations. It follows the earlier standard known as AVC or H.264, also defined by the same MPEG and VCEG Joint Collaborative Team on Video Coding (JCT-VC), with a goal of addressing the growing popularity of ever higher resolution videos, high-definition (HD, 1920 × 1080), ultra-high definition (UHD, 4k × 2k), and beyond. In particular, HEVC addresses two key issues: increased video resolution and increased use of parallel processing architectures. As such, HEVC algorithm has a design target of achieving twice the compression efficiency achievable by AVC.

Picture Parititioning and Structure

In earlier standards, macroblocks were the basic coding building block, which contains a 16×16 luma block, and typically two 8×8 chroma blocks for 4:2:0 color sampling. In HEVC, the analogous structure is the coding tree unit (CTU), also known as the largest coding unit (LCU), containing a luma coding tree block (CTB), corresponding chroma CTBs, and syntax elements. In a CTU, the luma block size can be 16×16, 32×32, or 64×64, specified in the bitstream sequence parameter set. CTUs can be further partitioned into smaller square blocks using a tree structure and quad-tree signaling.

The quad-tree specifies the coding units (CU), which forms the basis for both prediction and transform. The coding units in a coding tree block are traversed and encoded in Z-order. Figure 3-16 shows an example of ordering in a 64×64 CTB.

Figure 3-16.
figure 16

An example of ordering of coding units in a 64×64 coding tree block

A coding unit has one luma and two chroma coding blocks (CB), which can be further split in size and can be predicted from corresponding prediction blocks (PB), depending on the prediction type. HEVC supports variable PB sizes, ranging from 64×64 down to 4×4 samples. The prediction residual is coded using the transform unit (TU) tree structure. The luma or chroma coding block residual may be identical to the corresponding transform block (TB) or may be further split into smaller transform blocks. Transform blocks can only have square sizes 4×4, 8×8, 16×16, and 32×32. For the 4×4 transform of intra-picture prediction residuals, in addition to the regular DCT-based integer transform, an integer transform based on a form of discrete sine transform (DST) is also specified as an alternative. This quad-tree structure is generally considered the biggest contributor for the coding efficiency gain of HEVC over AVC.

HEVC simplies coding and does not support any interlaced tool, as interlaced scanning is no longer used in displays and as interlaced video is becoming substantially less common for distribution. However, interlaced video can still be coded as a sequence of field pictures. Metadata syntax is available in HEVC to allow encoders to indicate that interlace-scanned video has been sent by coding one of the following:

  • Each field (i.e., the even or odd numbered lines of each video frame) of interlaced video as a separate picture

  • Each interlaced frame as an HEVC coded picture

This provides an efficient method of coding interlaced video without inconveniencing the decoders with a need to support a special decoding process for it.

Profiles and Levels

There are three profiles defined by HEVC: the Main profile, the Main 10 profile, and the Still picture profile, of which currently Main is the most commonly used. It requires 4:2:0 color format and imposes a few restrictions; for instance, bit depth should be 8, groups of LCUs forming rectangular tiles must be at least 256×64, and so on (tiles are elaborated later in this chapter in regard to parallel processing tools). Many levels are specified, ranging from 1 to 6.2. A Level-6.2 bitstream could support as large a resolution as 8192×4320 at 120 fps.Footnote 3

Intra Prediction

In addition to the planar and the DC prediction modes, intra prediction supports 33 directional modes, compared to eight directional modes in H.264/AVC. Figure 3-17 shows the directional intra prediction modes.

Figure 3-17.
figure 17

Directional intra prediction modes in HEVC

Intra prediction in a coding unit exactly follows the TU tree such that when an intra coding unit is coded using an N×N partition mode, the TU tree is forcibly split at least once, ensuring a match between the intra coding unit and the TU tree. This means that the intra operation is always performed for sizes 32×32, 16×16, 8×8, or 4×4. Similar to AVC, intra prediction requires two one-dimensional arrays that contain the upper and left neighboring samples, as well as an upper-left sample. The arrays are twice as long as the intra block size, extending below and to the right of the block. Figure 3-18 shows an example array for an 8×8 block.

Figure 3-18.
figure 18

Luma intra samples and prediction structure in HEVC

Inter Prediction

For inter-picture prediction, HEVC provides two reference lists, L0 and L1, each with a capacity to hold 16 reference frames, of which a maximum of eight pictures can be unique. This implies that some reference pictures will be repeated. This would facilitate predictions from the same picture with different weights.

Motion Vector Prediction

Motion vector prediction in HEVC is quite complex, as it builds a list of candidate motion vectors and selects one of the candidates from the list using an index of list that is coded in the bitstream. There are two modes for motion vector prediction: merge and advanced motion vector prediction (AMVP). For each prediction unit (PU), the encoder decides which mode to use and indicates it in the bitstream with a flag. The AMVP process uses a delta motion vector coding and can produce any desired value of motion vector. HEVC subsamples the temporal motion vectors on a 16×16 grid. This means that a decoder only needs to allocate space for two motion vectors (L0 and L1) in the temporal motion vector buffer for a 16×16 pixel area.

Motion Compensation

HEVC specifies motion vectors in a quarter-pel granularity, but uses an eight-tap filter for luma, and a four-tap eighth-pel filter for chroma. This is an improvement over the six-tap luma and bilinear (two-tap) chroma filters defined in AVC. Owing to the longer length of the eight-tap filter, three or four extra pixels on all sides are needed to be read for each block. For example, for an 8×4 block, a 15×11 pixel area needs to be read into the memory, and the impact would be more for smaller blocks. Therefore, HEVC limits the smallest prediction unit to be uni-directional and larger than 4×4. HEVC supports weighted prediction for both uni- and bi-directional PUs. However, the weights are always explicitly transmitted in the slice header; unlike AVC, there is no implicit weighted prediction.

Entropy Coding

In HEVC, entropy coding is performed using Context-Adaptive Binary Arithmetic Codes (CABAC) at the CTU level. The CABAC algorithm in HEVC improves upon that of AVC with a few minor enhancements. There are about half as many context-state variables as in AVC, and the initialization process is much simpler. The bitstream syntax is designed such that bypass-coded bins are grouped together as much as possible. CABAC decoding is inherently a sequential operation; therefore, parallelization or fast hardware implementation is difficult. However, it is possible to decode more than one bypass-coded bin at a time. This, together with the bypass-bin grouping, greatly facilitates parallel implementation in hardware decoders.

In-loop Deblocking and SAO

In HEVC, two filters could be applied on reconstructed pixel values: the in-loop deblocking (ILD) filter and the sample adaptive offset (SAO) filter. Either or both of the filters can be optionally applied across the tile- and slice-boundaries. The in-loop deblocking filter in HEVC is similar to that of H.264/AVC, while the SAO is a new filter and is applied following the in-loop deblock filter.

Unlike AVC, which deblocks at every 4×4 grid edge, in HEVC, deblocking is performed on the 8×8 grid only. All vertical edges in the picture are deblocked first, followed by all horizontal edges. The filter itself is similar to the one in AVC, but in the case of HEVC, only the boundary strengths 2, 1, and 0 are supported. With an 8-pixel separation between the edges, there is no dependency between them, enabling a highly parallelized implementation. For example, the vertical edge can be filtered with one thread per 8-pixel column in the picture. Chroma is only deblocked when one of the PUs on either side of a particular edge is intra-coded.

As a secondary filter after deblocking, the SAO performs a non-linear amplitude mapping by using a lookup table at the CTB level. It operates once on each pixel of the CTB, a total of 6,144 times for each 64×64 CTB. (64×64 + 32×32 + 32×32 = 6144). For each CTB, a filter type and four offset values, ranging from -7 to 7 for 8-bit video for example, are coded in the bitstream. The encoder chooses these parameters with a view toward better matching the reconstructed and the source pictures.

Parallel Processing Syntax and Tools

There are three new features in the HEVC standard to support enhanced parallel processing capability or to modify the structure of slice data for purposes of packetization.


There is an option to partition a picture into rectangular regions called tiles. Tiles are independently decodable regions of a picture that are encoded with some shared header information. Tiles are specified mainly to increase the parallel processing capabilities, although some error resilience may also be attributed to them. Tiles provide coarse-grain parallelism at picture and sub-picture level, and no sophisticated synchronization of threads is necessary for their use. The use of tiles would require reduced-size line buffers, which is advantageous for high-resolution video decoding on cache-constrained hardware and cheaper CPUs.

Wavefront Parallel Processing

When wavefront parallel processing (WPP) is enabled, a slice is divided into rows of CTUs. The first row is processed in a regular manner; but processing of the second row can be started only after a few decisions have been made in the first row. Similarly, processing of the third row can begin as soon as a few decisions have been made in the second row, and so on. The context models of the entropy coder in each row are inferred from those in the preceding row, with a small fixed processing lag. WPP provides fine-grained parallel processing within a slice. Often, WPP provides better compression efficiency compared to tiles, while avoiding potential visual artifacts resulting from the use of tiles.

Slice Segments and Dependent Slices

A sequence of coding tree blocks is called a slice. A picture constituting a video frame can be split into any number of slices, or the whole picture can be just one slice. In turn, each slice is split up into one or more slice segments, each in its own NAL unit. Only the first slice segment of a slice contains the full slice header, and the rest of the segments are referred to as dependent slice segments. As such, a decoder must have access to the first slice segment for successful decoding. Such division of slices allows low-delay transmission of pictures without paying any coding efficiency penalty that would have otherwise incurred owing to many slice headers. For example, a camera can send out a slice segment belonging to the first CTB row so that a playback device on the other side of the network can start decoding before the camera sends out the next CTB row. This is useful in low-latency applications such as video conferencing.

International Standards for Video Quality

Several standards for video quality have been specified by the ITU-T and ITU-R visual quality experts groups. Although they are not coding standards, they are worth mentioning as they relate to the subject of this book. Further, as the IEEE standard 1180 relates to the accuracy of computation of the common IDCT technology used in all the aforementioned standards, it is also briefly described here.

VQEG Standards

In 1997, a small group of video quality experts from the ITU-T and ITU-R study groups formed the Visual Quality Experts Group (VQEG), with a view toward advancing the field of video quality assessment. This group investigated new and advanced subjective assessment methods and objective quality metrics and measurement techniques.

VQEG took a systematic approach to validation testing that typically includes several video databases for which objective models are needed to predict the subjective visual quality, and they defined the test plans and procedures for performing objective model validation. The initial standard was published in 2000 by the ITU-T Study Group 9 as Recommendation J.144, but none of the various methods studied outperformed the well-known peak signal to noise ratio (PSNR). In 2003, an updated version of J.144 was published, where four objective models were recommended for cable television services. A mirror standard by the ITU-R Study Group 6 was published as ITU-R Recommendation BT.1683 for baseband television services.

The most recent study aimed at digital video and multimedia applications, known as Multimedia Phase I (MM-I), was completed in 2008. The MM-I set of tests was used to validate full-reference (FR), reduced reference (RR), and no reference (NR) objective models. (These models are discussed in Chapter 4 in some detail.) Subjective video quality was assessed using a single-stimulus presentation method and the absolute category rating (ACR) scale, where the video sequences including the source video are presented for subjective evaluation one at a time without identifying the videos, and are rated independently on the ITU five-grade quality scale. A mean opinion score (MOS) and a difference mean opinion score (DMOS) were computed, where the DMOS was the average of the arithmetic difference of the scores of processed videos compared to the scores of the source videos, in order to remove any hidden reference. The software used to control and run the VQEG multimedia tests is known as AcrVQWin.Footnote 4 (Details of subjective quality assessment methods and techniques are captured in ITU-T Recommendations P.910, P.912, and so on, and are discussed in Chapter 4.)

IEEE Standard 1180-1990

Primarily intended for use in visual telephony and similar applications where the 8x8 inverse discrete cosine transform (IDCT) results are used in a reconstruction loop, the IEEE Standard 1180-1990Footnote 5 specifies the numerical characteristics of the 8x8 IDCT. The specifications ensure the compatibility between different implementations of the IDCT. A mismatch error may arise from the different IDCT implementations in the encoders and decoders from different manufacturers; owing to the reconstruction loop in the system, the mismatch error may propagate over the duration of the video. Instead of restricting all manufacturers to a single strict implementation, the IEEE standard allows a small mismatch for a specific period of time while the video is refreshed periodically using the intra-coded frame—for example, for ITU-T visual telephony applications, the duration is 132 frames.

The standards specify that the mismatch errors shall meet the following requirements:

  • For any pixel location, the peak error (ppe) shall not exceed 1 in magnitude.

  • For any pixel location, the mean square error (pmse) shall not exceed 0.06.

  • Overall, the mean square error (omse) shall not exceed 0.02.

  • For any pixel location, the mean error (pme) shall not exceed 0.015 in magnitude.

  • Overall, the mean error (ome) shall not exceed 0.0015 in magnitude.

  • For all-zero input, the proposed IDCT shall generate all-zero output.

Overview of Other Industry Standards and Formats

In addition to the international standards defined by the ISO, the ITU, or the Institute of Electrical and Electronics Engineers (IEEE), there are standards well known in the industry. A few of these standards are described below.


Initially developed as a proprietary video format by Microsoft, the well-known VC-1 format was formally released as the SMPTE 421M video codec standard in 2006, defined by the Society of Motion Pictures and Television Engineers (SMPTE). It is supported by Blu-ray, currently obsolete HD-DVD, Microsoft Windows Media, Microsoft Silverlight framework, Microsoft X-Box 360, and Sony PlayStation 3 video game consoles, as well as various Windows-based video applications. Hardware decoding of VC-1 format is available in Intel integrated processor graphics since second-generation Intel (R) Core (TM) processor (2011) and in Raspberry Pi (2012).

VC-1 uses the conventional DCT-based design similar to the international standards, and supports both progressive and interlaced video. The specification defines three profiles: Simple, Main, and Advanced, and up to five levels. Major tools supported by each profile are shown in Table 3-3.

Table 3-3. VC-1 Profiles and Levels


With the acquisition of On2 Technologies, Google became the owner of the VP8 video compression format. In November 2011, the VP8 data format and decoding guide was published as RFC 6386Footnote 6 by the Internet Engineering Task Force (IETF). The VP8 codec library software, libvpx, is also released by Google under a BSD license. VP8 is currently supported by Opera, FireFox, and Chrome browsers, and various hardware and software-based video codecs, including the Intel integrated processor graphics hardware.

Like many modern video compression schemes, VP8 is based on decomposition of frames into square subblocks of pixels, prediction of such subblocks using previously constructed blocks, and adjustment of such predictions using a discrete cosine transform (DCT), or in one special case, a Walsh-Hadamard transform (WHT). The system aims to reduce data rate through exploiting the temporal coherence of video signals by specifying the location of a visually similar portion of a prior frame, and the spatial coherence by taking advantage of the frequency segragtion provided by DCT and WHT and the tolerance of the human visual system to moderate losses of fidelity in the reconstituted signal. Further, VP8 augments these basic concepts with, among other things, sophisticated usage of contextual probabilities, resulting in a significant reduction in data rate at a given quality.

The VP8 algorithm exclusively specifies fixed-precision integer operations, preventing the reconstructed signal from any drift that might have been caused by truncation of fractions. This helps verify the correctness of the decoder implementation and helps avoid inconsistencies between decoder implementations. VP8 works with 8-bit YUV 4:2:0 image formats, internally divisible into 16×16 macroblocks and 4×4 subblocks, with a provision to support a secondary YUV color format. There is also support of an optional upscaling of internal reconstruction buffer prior to output so that a reduced-resolution encoding can be done, while the decoding is performed at full resolution. Intra or key frames are defined to provide random access while inter frames are predicted from any prior frame up to and including the most recent key frame; no bi-directional prediction is used. In general, the VP8 codec uses three different reference frames for inter-frame prediction: the previous frame, the golden frame, and the altref frame to provide temporal scalability.

VP8 codecs apply data partitioning to the encoded data. Each encoded VP8 frame is divided into two or more partitions, comprising an uncompressed section followed by compressed header information and per-macroblock information specifying how each macroblock is predicted. The first partition contains prediction mode parameters and motion vectors for all macroblocks. The remaining partitions all contain the quantized DCT or WHT coefficients for the residuals. There can be one, two, four, or eight DCT/WHT partitions per frame, depending on encoder settings. Details of the algorithm can be found in the RFC 6386.

An RTP payload specificationFootnote 7 applicable to the transmission of video streams encoded using the VP8 video codec has been proposed by Google. The RTP payload format can be used both in low-bit-rate peer-to-peer and high-bit-rate video conferencing applications. The RTP payload format takes the frame partition boundaries into consideration to improve robustness against packet loss and to facilitate error concealment. It also uses advanced reference frame structure to enable efficient error recovery and temporal scalability. Besides, marking of the non-reference frames is done to enable servers or media-aware networks to discard appropriate data as needed.

The IETF Internet Draft standard for browser application programming interface (API), called the Web Real Time Communication (WebRTC),Footnote 8 specifies that if VP8 is supported, then the bilinear and the none reconstruction filters, a frame rate of at least 10 frames per second, and resolutions ranging from 320×240 to 1280×720 must be supported. Google Chrome, Mozilla, FireFox, and Opera browsers support the WebRTC APIs, intended for browser-based applications including video chatting. Google Chrome operating system also supports WebRTC.


The video compression standard VP9 is a successor to the VP8 and is also an open standard developed by Google. The latest specification was released in February 2013 and is currently available as an Internet-DraftFootnote 9 from the IETF; the final specification is not ratified yet. The VP9 video codec is developed specifically to meet the demand for video consumption over the Internet, including professional and amateur-produced video-on-demand and conversational video content. The WebM media container format provides royalty-free, open video compression for HTML5 video, by primarily using the VP9 codec, which replaces the initially supported codec VP8.

The VP9 draft includes a number of enhancements and new coding tools that have been added to the VP8 codec to improve the coding efficiency. The new tools described in the draft include larger prediction block sizes up to 64×64, various forms of compound inter prediction, more intra prediction modes, one-eighth-pixel motion vectors, 8-tap switchable sub-pixel interpolation filters, improved motion reference generation and motion vector coding, improved entropy coding including frame-level entropy adaptation for various symbols, improved loop filtering, the incorporation of the Asymmetric Discrete Sine Transform (ADST), larger 16×16 and 32×32 DCTs, and improved frame-level segmentation. However, VP9 is currently under development and the final version of the VP9 specification may differ considerably from the draft specification, of which some features are described here.

Picture Partitioning

VP9 partitions the picture into 64 × 64 superblocks (SB), which are processed in raster-scan order, from left to right and top to bottom. Similar to HEVC, superblocks can be subdivided down to a minimum of 4×4 using a recursive quad-tree, although 8×8 block sizes are the most typical unit for mode information. In contrast to HEVC, however, the slice structure is absent in VP9.

It is desirable to be able to carry out encoding or decoding tasks in parallel, or to use multi-threading in order to effectively utilize available resources, especially on resource-constrainted personal devices like smartphones. To this end, VP9 offers frame-level parallelism via the frame_parallel_mode flag and two- or four-column based tiling, while allowing loop filtering to be performed across tile boundaries. Tiling in VP9 is done in vertical direction only, while each tile has an integral number of blocks. There is no data dependency across adjacent tiles, and any tile in a frame can be processed in any order. At the start of every tile except the last one, a 4-byte size is transmitted, indicating the size of the next tile. This allows a multi-threaded decoder to start a particular decoding thread by skipping ahead to the appropriate tile. There are four tiles per frame, facilitating data parallelization in hardware and software implementations.

Bitstream Features

The VP9 bitstream is usually available within a container format such as WebM, which is a subset of the Matroska Media Container. The container format is needed for random access capabilities, as VP9 does not provide start codes for this purpose. VP9 bitstreams start with a key frame containing all intra-coded blocks, which is also a decoder reset point. Unlike VP8, there is no data partitioning in VP9; all data types are interleaved in superblock coding order. This change is made to facilitate hardware implementations. However, similar to VP8, VP9 also compresses a bitstream using an 8-bit non-adaptive arithmetic coding (a.k.a. bool-coding), for which the probability model is fixed and all the symbol probabilities are known a priori before the frame decoding starts. Each probability has a known default value and is stored as a 1 byte data in the frame context. The decoder maintains four such contexts, and the bitstream signals which one to use for the frame decode. Once a frame is decoded, based upon the occurrence of certain symbols in the decoded frame, a context can be updated with new probability distributions for use with future frames, thus providing limited context adaptability.

Each coded frame has three sections:

  • Uncompressed header: Few bytes containing picture size, loop-filter strength, etc.

  • Compressed header: Bool-coded header data containing the probabilities for the frame, expressed in terms of differences from default probability values.

  • Compressed frame data: Bool-coded frame data needed to reconstruct the frame, including partition information, intra modes, motion vectors, and transform coefficients.

In addition to providing high compression efficiency with reasonable complexity, the VP9 bitstream includes features designed to support a variety of specific use-cases involving delivery and consumption of video over the Internet. For example, for communication of conversational video with low latency over an unreliable network, it is imperative to support a coding mode where decoding can continue without corruption even when arbitrary frames are lost. Specifically, the arithmetic decoder should be able to continue decoding of symbols correctly even though frame buffers have been corrupted, leading to encoder-decoder mismatch.

VP9 supports a frame level error_resilient_mode flag to allow coding modes where a manageable drift between the encoder and decoder is possible until a key frame is available or an available reference picture is selected to correct the error. In particular, the following restrictions apply under error resilient mode while a modest performance drop is expected:

  • At the beginning of each frame, the entropy coding context probabilities are reset to defaults, preventing propagation of forward or backward updates.

  • For the motion vector reference selection, the co-located motion vector from a previously encoded reference frame can no longer be included in the reference candidate list.

  • For the motion vector reference selection, sorting of the initial list of motion vector reference candidates based on searching the reference frame buffer is disabled.

The VP9 bitstream does not offer any security functions. Integrity and confidentiality must be ensured by functions outside the bistream, although VP9 is independent of external objects and related security vulnerabilities.

Residual Coding

If a block is not a skipped block (indicated at 8×8 granularity), a residual signal is coded and transmitted for it. Similar to HEVC, VP9 also supports different sizes (32×32, 16×16, 8×8, and 4×4) for an integer transform approximated from the DCT. However, depending on specific characteristics of the intra residues, either or both the vertical and the horizontal transform pass can be ADST instead. The transform size is coded in the bitstream such that a 32×16 block using a 8×8 transform would have luma residual made up of a 4×2 grid of 8×8 transform coefficients, and the two 16×8 chroma residuals, each consisting of a 2×1 grid of 8×8 transform coefficients.

Transform coefficients are scanned starting at the upper left corner, following a “curved zig-zag” pattern toward the higher frequencies, while transform blocks with mixed DCT/DST use a scan pattern skewed accordingly.Footnote 10 However, the scan pattern is not straightforward and requires a table lookup. Furthermore, each transform coefficient is coded using bool-coding and has several probabilities associated with it, resulting from various parameters such as position in the block, size of the transform, value of neighboring coefficients, and the like.

Inverse quantization is simply a multiplication by one of the four scaling factors for luma and chroma DC and AC coefficients, which remain the same for a frame; block-level QP adjustment is not allowed. Additionally, VP9 offers a lossless mode at frame level using 4×4 Walsh-Hadamard transform.

Intra Prediction

Intra prediction in VP9 is similar to the intra prediction method in AVC and HEVC, and is performed on partitions the same as are the transform block partitions. For example, a 16×8 block with 8×8 transforms will result in two 8×8 luma prediction operations. There are 10 different prediction modes: the DC, the TM (True Motion), vertical, horizontal, and six angular predictions approximately corresponding to the 27, 45, 63, 117, 135, and 153 degree angles. Like other codecs, intra prediction requires two one-dimensional arrays that contain the reconstructed left and upper pixels of the neighboring blocks. For block sizes above 4×4, the second half of the horizontal array contains the same value as the last pixel of the first half. An example is given in Figure 3-19.

Figure 3-19.
figure 19

Luma intra samples, mode D27_PRED

Inter Prediction

Inter prediction in VP9 uses eighth-pixel motion compensation, offering twice the precision of most other standards. For motion compensation, VP9 primarily uses one motion vector per block, but optionally allows a compound prediction with two motion vectors per block resulting in two prediction samples that are averaged together. Compound prediction is only enabled in non-displayable frames, which are used as reference frames.Footnote 11 VP9 allows these non-displayable frames to be piggy-backed with a displayable frame, together forming a superframe to be used in the container.

VP9 defines a family of three 8-tap filters, selectable at either the frame or block level in the bitstream:

  • 8-tap Regular: An 8-tap Lagrangian interpolation filter

  • 8-tap Sharp: A DCT-based interpolation filter, used mostly around sharper edges

  • 8-tap Smooth (non-interpolating): A smoothing non-interpolating filter, in the sense that the prediction at integer pixel-aligned locations is a smoothed version of the reference frame pixels

A motion vector, points to one of three possible reference frames, known as the Last, the Golden, and the AltRef frames. The reference frame is applied at 8×8 granularity—for example, two 4×8 blocks, each with their own motion vector, will always point to the same reference frame.

In VP9, motion vectors are predicted from a sorted list of candidate reference motion vectors. The candidates are built using up to eight surrounding blocks that share the same reference picture, followed by a temporal predictor of co-located motion vector from the previous frame. If this search process does not fill the list, the surrounding blocks are searched again but this time the reference doesn’t have to match. If this list is still not full, then (0, 0) vectors are inferred.

Associated with a block, one of the four motion vector modes is coded:

  • NEW_MV: This mode uses the first entry of the prediction list along with a delta motion vector which is transmitted in the bitstream.

  • NEAREST_MV: This mode uses the first entry of the prediction list as is.

  • NEAR_MV: This mode uses the second entry of the prediction list as is.

  • ZERO_MV: This mode uses (0, 0) as the motion vector value.

A VP9 decoder maintains a list of eight reference pictures at all times, of which three are used by a frame for inter prediction. The predicted frame can optionally insert itself into any of these eight slots, evicting the existing frame. VP9 supports reference frame scaling; a new inter frame can be coded with a different resolution than the previous frame, while the reference data is scaled up or down as needed. The scaling filters are 8-tap filters with 16th-pixel accuracy. This feature is useful in variable bandwidth environments, such as video conferencing over the Internet, as it allows for quick and seamless on-the-fly bit-rate adjustment.

Loop Filter

VP9 introduces a variety of new prediction block and transform sizes that require additional loop filtering options to handle a large number of combinations of boundary types. VP9 also incorporates a flatness detector in the loop filter that detects flat regions and varies the filter strength and size accordingly.

The VP9 loop filter is applied to a decoded picture. The loop filter operates on a superblock, smoothing the vertical edges followed by the horizontal edges. The superblocks are processed in raster-scan order, regardless of any tile structure that may be signaled. This is different from the HEVC loop filter, where all vertical edges of the frame are filtered before any horizontal edges. There are four different filters used in VP9 loop filtering: 16-wide, 8-wide, 4-wide, and 2-wide, where on each side of the edge eight, four, two, and one pixels are processed, respectively. Each of the filters is applied according to a threshold sent in the frame header. A filter is attempted with the conditions that the pixels on either side of the edge should be relatively smooth, and there must be distinct brightness difference on either side of the edge. Upon satisfying these conditions, a filter is used to smooth the edge. If the condition is not met, the next smaller filter is attempted. Block sizes 8×8 or 4×4 start with the 8-wide or smaller filter.


In general, the segmentation mechanism in VP9 provides a flexible set of tools that can be used in a targeted manner to improve perceptual quality of certain areas for a given compression ratio. It is an optional VP9 feature that allows a block to specify a segment ID, 0 to 7, to which it belongs. The frame header can convey any of the following features, applicable to all blocks with the same segment ID:

  • AltQ: Blocks belonging to a segment with the AltQ feature may use a different inverse quantization scale factor than blocks in other segments. This is useful in many rate-control scenarios, especially for non-uniform bit distribution in foreground and background areas.

  • AltLF: Blocks belonging to a segment with the AltLF feature may use a different smoothing strength for loop filtering. This is useful in application specific targeted smoothing of particular set of blocks.

  • Mode: Blocks belonging to a segment with an active mode feature are assumed to have the same coding mode. For example, if skip mode is active in a segment, none of the blocks will have residual information, which is useful for static areas of frames.

  • Ref: Blocks belonging to a segment that have the Ref feature enabled are assumed to point to a particular reference frame (Last, Golden, or AltRef). It is not necessary to adopt the customary transmission of the reference frame information.

  • EOB: Blocks belonging to a segment with the coefficient end of block (EOB) marker coding feature may use the same EOB marker coding for all blocks belonging to the segment. This eliminates the need to decode EOB markers separately.

  • Transform size: The block transform size can also be indicated for all blocks in a segment, which may be the same for a segment, but allows different transform sizes to be used in the same frame.

In order to minimize the signaling overhead, the segmentation map is differentially coded across frames. Segmentation is independent of tiling.


This chapter presented brief overviews of major video coding standards available in the industry. With a view toward guaranteeing interoperability, ease of implementation, and industry-wide common format, these standards specified or preferred certain techniques over others. Owing to the discussions provided earlier in this chapter, these predilections would be easy to understand. Another goal of the video coding standards was to address all aspects of practical video transmission, storage, or broadcast within a single standard. This was accomplished in standards from H.261 to MPEG-2. MPEG-4 and later standards not only carried forward the legacy of success but improved upon the earlier techniques and algorithms.

Although in this chapter we did not attempt to compare the coding efficiencies provided by various standards’ algorithms, such studies are available in the literature; for example, those making an interesting comparison between MPEG-2, AVC, WMV-9, and AVS.Footnote 12 Over the years such comparisons—in particular, determination of bit-rate savings of a later-generation standard compared to the previous generation, have become popular, as demonstrated by Grois et al.Footnote 13 in their comparison of HEVC, VP9, and AVC standards.