1 Introduction

The H.264/AVC standard [1] allows more compression-efficient coding compared to its predecessors. The main coding improvement stems from the advanced prediction techniques. Particularly, some intra prediction methods are introduced. They apply the extrapolation of reconstructed pixel samples from neighboring blocks to approximate actual pixel samples from a given block. Architectures of H.264/AVC encoders must support at least the DC intra prediction for both luma and chroma. However, the best compression efficiency can be achieved when all predictions are checked in terms of the rate-distortion (RD) criterion. This leads to a high computational burden of video encoders. As many multimedia applications require real-time coding, hardware acceleration of video encoders is indispensible, especially for High Definition Television (HDTV).

There are some challenges in design of intra prediction architectures. Firstly, the selected prediction mode should be optimal or near optimal. Secondly, the consumption of hardware resources must be as small as possible. Thirdly, the latency between processing of successive blocks must be minimized to achieve high-speed encoding. The last problem addresses the throughput of both the intra predictor and the encoder reconstruction loop. In particular, prediction samples within a block can be computed only when reconstructed samples of blocks adjacent to the left and top side are ready for the reference.

In literature, there are some architectures of the H.264/AVC encoder intra prediction [214]. However, none of them supports all intra prediction modes available in High Profile. Particularly, they do not support either 8 × 8 or Plane predictions, and some designs are dedicated only for a selected block size [10, 11]. The architectures apply separate design units to generate prediction for blocks of different sizes, which increases resource consumption. Although view-parallel MB-interleaved scheduling and open-closed loop intra prediction decomposed into two pipeline stages [14] enable a high processing speed, they have some limitations. Particularly, the first method is dedicated for multi-view or B frames, whereas the second involves quality losses for blocks predicted from original pixels. Intra-frame encoder chips described in [24, 68, 10, 12, 14] achieve high processing speed. However, they apply basic cost functions based on transformed residuals for mode selection, and their dataflow is optimized only for intra modes. The advanced mode selection based on actual rates (estimated according to binarization schemas of quantized coefficients and associated control data) and distortions (estimated as squared errors of reconstructed samples) would deteriorate significantly the throughput, especially when the reconstruction loop is shared with Inter modes. This limitation is caused by the fact that all prediction modes must be processed by the long-delay path including transformations, the (de)quantization, the reconstruction, the error estimation, and the cost evaluation.

In this paper, the hardware architecture of intra predictor for the encoder is proposed. The design supports all prediction modes for all chroma formats. 4 × 4 and 8 × 8 modes can be computed for several quantization parameters. This feature allows more efficient compression when the advanced mode selection based on the rate-distortion criterion is used. The same logic is used for the generation of all prediction modes, leading to the reduction of hardware resources. The architecture applies three main techniques in order to maximize the processing speed. Firstly, the 16 prediction samples can be generated in one clock cycle. Secondly, predictions for 4 × 4 blocks are interleaved with remaining intra modes. Thirdly, the generation order of 4 × 4 blocks is modified to decrease the impact of their prediction dependencies on the latency. The obtained throughput allows the encoder to support 1080p 30 fps video at 200 MHz when intra and inter modes are checked based on the rate-distortion analysis. The following sections describe the proposed architecture, and the last one provides implementations results.

2 Architecture Design

The 4 × 4 block processing order defined by the H.264/AVC is modified to start prediction of the next block before the reconstruction of the previous one is available. The 8 × 8, 16 × 16, and chroma prediction modes are fully independent of 4 × 4 ones. Therefore, it is possible to calculate them while waiting for the reconstruction of the previously-processed 4 × 4 block. In the proposed module, predictions of different types are scheduled to minimize waiting for reconstructed samples when the reconstruction loop delay is long. Final block scheduling used in the design is as follows: B4(0) → B8(0) → B4(1) → L16(H) → L16(V) → B4(2) → B4(4) → B8(1) → B4(3) → B4(5) → L16(DC) → B4(8) → B4(6) → L16(PL) → B4(9) → B4(7) → C(H) → C(V) → B4(10) → B4(12) → B8(2) → B4(11) → B4(13) → C(DC) → C(PL) → B4(14) → B8(3) → B4(15); Note that B4, B8, L16, and C labels correspond to 4 × 4, 8 × 8, 16 × 16, and chroma blocks. Parentheses contain block number or prediction mode (H-Horizontal, V-Vertical, DC-DC, and PL-plane). If the encoder checks several QPs, the generation of 4 × 4 and 8 × 8 partitions is repeated for each QP.

Figure 1 presents the block diagram of the intra predictor with distinguished buffers. The buffers include two sets of registers and the on-chip dual-port RAM module. The first set keeps 25 reference samples from left and upper macroblocks. Additionally, there are three corner registers for each color component. They prevent the loss of corner samples when reconstructed data for the left macroblock are written into the memory. The memory keeps reconstructed samples neighboring with the currently processed macroblock and inside the macroblock for up to seven QP values. The raster order of macroblocks involves keeping the whole picture line in the RAM to provide reconstructed samples from upper macroblocks. In particular, most of the memory space (6 KB) is assigned to the picture line including three color components (2 KB for each). Since both 4 × 4 and 8 × 8 predictions are computed in the interleaved manner, reconstructed samples for the two modes (and seven QPs) must be stored. They are kept in an additional 2 KB memory space. Whenever the next 4 × 4 or 8 × 8 block is started, samples neighboring with the block are fetched from the RAM and stored in registers called LEFT and UPPER, as shown in Fig. 1. In order to achieve fast transfer, four adjacent samples can be read and written through memory ports concurrently.

Figure 1
figure 1

Intra prediction architecture.

In the case of the 8 × 8 prediction modes, the first step is prefiltering of neighboring samples, which is performed using the intra prediction core. Samples needed to calculate the next prediction mode, or those that should be prefiltered, are selected and loaded into nine intermediate-level registers. The in-depth study of the standard leads to the conclusion that nine neighboring samples are enough to calculate prediction for all 4 × 4 and 8 × 8 modes with the exception of two Intra 8 × 8 directions (i.e. horizontal down and vertical right). For these cases, one additional prediction value should be computed. This is performed by buffering a prediction value in an auxiliary register (not shown in Fig. 1 for the simplicity) as the missing value is identical to some values calculated for the previous 4 × 4 block within the 8 × 8 one.

The prediction core consists of the two levels of adders and multiplexers. The first level of adders is responsible for computations according to Eq. (1), whereas the second level supports Eq. (2). As the result of the calculation, 15 different prediction values are obtained, out of which only up to 10 are valid for a 4 × 4 block in a given mode. These 10 are selected by the output multiplexer (MUX).

$$ \mathrm{pred}\left[\mathrm{x},\mathrm{y}\right]=\left(\mathrm{p}\left[{\mathrm{x}}_1,{\mathrm{y}}_1\right]+\mathrm{p}\left[{\mathrm{x}}_2,{\mathrm{y}}_2\right]+1\right)>>1 $$
(1)
$$ \mathrm{p}\mathrm{red}\left[\mathrm{x},\mathrm{y}\right]=\left(\left.\mathrm{p}\left[{\mathrm{x}}_1,{\mathrm{y}}_1\right]+2\times \mathrm{p}\left[{\mathrm{x}}_2,{\mathrm{y}}_2\right]+\right]+\mathrm{p}\left[{\mathrm{x}}_3,{\mathrm{y}}_3\right]+2\right)>>2 $$
(2)

The DC mode requires the reconfiguration of the adder structure, which is accomplished by multiplexers colored dark grey. Particularly, sums of sample pairs obtained at the first level are directed to adders at the second level to compute two sums of four samples. The two sums correspond to the left and upper side. The new configuration, together with the extra adder (outside the core), allows the calculation of the prediction of the whole 4 × 4 block in one clock cycle. As a consequence, the prediction for 4 × 4, 8 × 8, and 16 × 16 DC modes takes one, two, and four clock cycles, respectively. The results are written or accumulated in the DC_N × N register. An additional register is used to avoid overwriting the 16 × 16 mode result before releasing prediction samples.

3 Plane Parameter Generator

The plane mode is described by the following equations:

$$ H={\displaystyle \sum_{x\prime =0}^7\left(x\prime +1\right)\cdot \left(p\left[8+x\prime, -1\right]-p\left[6-x\prime, -1\right]\right)} $$
(3)
$$ V={\displaystyle \sum_{y\prime =0}^7\left(y\prime +1\right)\cdot \left(p\left[-1,8+y\prime \right]-p\left[-1,6-y\prime \right]\right)} $$
(4)
$$ a=16\cdot \left(p\left[-1,15\right]+p\left[15,-1\right]\right) $$
(5)
$$ b=\left(5\cdot H+32\right)>>6 $$
(6)
$$ c=\left(5\cdot V+32\right)>>6 $$
(7)
$$ \mathrm{pred}\left[\mathrm{x},\mathrm{y}\right]=\mathrm{Clip}\left(a+b\cdot \left(x-7\right)+c\cdot \left(y-7\right)+16\right)>>5 $$
(8)

Plane prediction mode parameters H, V, a, b, and c are calculated in a separate sub-module in parallel with the calculation of 16 × 16 luma (or chroma) vertical and horizontal prediction modes. This allows a significant complexity reduction of the calculations of plane mode parameters as the multiplication that appears in (3) and (4) can be replaced by the series of shifts, additions, and accumulations. The sub-module is shown in Fig. 2. It can compute the sum of four selected shifted values in one clock cycle. The shifting and the selection is controlled by a dedicated Plane-mode Finite State Machine (FSM) synchronized with the main FSM of the intra predictor. The shifting replaces the multiplication by powers of two, and the addition of results for different shifts enables the multiplication by all required values. The selection using input multiplexers is performed on the set of reconstructed samples of adjacent macroblocks. They are kept in the corner, left, and upper registers. The addition result can be either accumulated (ACC) or written to seed registers (seeds are plane-predicted samples before clipping). Accumulated values H and V are multiplied by 5 by adding shifted values according to Eqs. (6) and (7). If a subsampled chroma is processed in a given dimension, the accumulated value is multiplied by 34 (the equations are modified according to the H.264/AVC specification). The result is rounded and directed to an appropriate register (b or c). A separate circuit is employed to compute the a value according to (5). The circuit accumulates only two samples when they are available. Values a, b, and c are computed and registered in two versions for both chroma components, and they can also be selected by input multiplexers to compute seeds. Eight seeds for the first block (upper-most 4 × 4 block) are computed in the plane parameter generator (see Fig. 2). In order to calculate the prediction for the whole macroblock, the seeds need to be updated. In particular, the addition of +4b or +4c in successive clock cycles gives seed values of successive 4 × 4 blocks to the left or bottom side, respectively (see Fig. 3). All these updates, as well as proper prediction values, are calculated by the intra prediction core. Before processing blocks in one row, one clock cycle is utilized to compute seed for the first block in the row below (+4c). The result is written back to eight seed registers in the plane parameter generator.

Figure 2
figure 2

Architecture of the plane parameter generator.

Figure 3
figure 3

Computations of seeds in the prediction core.

Most multiplexers in the prediction core (white in Fig. 1) are used to reconfigure the core for the plane computations. The plane mode employs eight of nine intermediate-level registers (numbered from 0 to 7 in Fig. 1) to keep seeds for two adjacent columns. The seeds are clipped (clip modules) to limit the range. To get seed values for positions shifted by two to the right, 2b parameter value is added to values kept in registers, the results are rounded, and then clipped as well. This way, all prediction values within the 4 × 4 block are determined. Seeds obtained at the first level (increased by 2b to correspond to two right columns) are finally increased by 2b in the second level of adders. This operation yields eight seeds in the 4 × 4 block located to the right. If it is necessary to obtain the seeds located in the block below, b is replaced by c.

4 Implementation Results and Conclusion

The intra prediction unit for the encoder is described using VHDL. The design is validated through the comparison with data produced by the JM.12 reference model. The module is synthesized for ASIC and FPGA technologies: TSMC 0.13 μm and Altera Arria II GX. The FPGA synthesis is performed with Altera Quartus II whereas the ASIC synthesis uses Synopsys Design Compiler. The intra prediction module is integrated with other parts of the hardware video encoder, and the whole system operates in the Arria II GX device on a development board. The design can work at the frequency of 100 MHz for the speed grade equal to 5. Smaller speed grades (e.g. 4 for Arria II GX) allow better performance. The implementation in the TSMC 0.13 μm technology allows better performance as the maximal frequency can be set to 200 MHz. Table 1 provides the resource consumption for different configurations of supported modes. Note that luma 16 × 16 and chroma modes except the Plane prediction are always available. If all modes are supported, the design consumes 23658 gates. The exclusion of particular modes provides moderate savings for the sake of the resource sharing. In particular, the hardware cost of the Plane mode is about 4.8 K gates. Removing either Intra 4 × 4 or Intra 8 × 8 provides smaller savings of gates as these modes shares the same resources. The need for the filtering causes Intra 8 × 8 requires more resources as compared to Intra 4 × 4.

Table 1 Synthesis results for different configurations.
Table 2 Comparison of encoder intra predictors for ASIC technologies.

Figure 4 depicts the comparison of the number of clock cycles required for one macroblock versus the delay of the reconstruction loop for three configurations of intra 4 × 4 and 8 × 8 modes. Note that plots assume that intra 16 × 16 and chroma are always processed. Numbers of clock cycles in Figs. 4 and 5 are obtained in successive simulations by the introduction of a constant delay between the intra predictor output and its input for reconstructed samples. If the mode selection is based on SAD and the throughput of the reconstruction loop is 64 samples per clock cycle, the delay is about 10 clock cycles (e.g. 10 pipeline stages: residuals(1), forward transform(2), quantization (2), dequantization(2), inverse transform(2), reconstruction (1)). In this case, the number of clock cycles required for one macroblock is about 340/318 and 448 for the configuration supporting one (either 4 × 4 or 8 × 8) and both block sizes, respectively. The encoder used for validation employs the advanced mode selection based on the rate-distortion criterion. All inter and intra modes are multiplexed in the common reconstruction loop [15]. It is measured that in such conditions the delay is equal to 40, on average. The corresponding number of clock cycles per macroblock increases to about 392, 599, and 608 for intra 8 × 8, intra 4 × 4, and both, respectively. Figure 5 depicts the number of clock cycles required for one macroblock versus the delay of the reconstruction loop for checking different QPs. If their number increases, the impact of the delay becomes smaller, and the main limitation results from the generation rate of the intra predictor. The module can also support smaller resolutions than 1080p 30 fps. If the generation rate were eight samples per clock cycles, 384 additional clock cycles would be needed to produce all predictions. In general, the reordering and the interleaving enable the reduction of the number of clock cycles in dependence on the delay of the reconstruction loop. If the delay is close to zero, the gain is small. The reordering removes six of 16 iterations of the reconstruction loop for 4 × 4. For delays greater than 40, the interleaving reduces the number of clock cycles close to that needed only for 4 × 4 blocks (see Fig. 4).

Figure 4
figure 4

Number of clock cycles per macroblock vs. the delay of the reconstruction loop for different mode configurations.

Figure 5
figure 5

Number of clock cycles per macroblock vs. the delay of the reconstruction loop for different numbers of QP. All modes are supported.

The comparison of the synthesis results of the designed module with other works is presented in Table 2. The comparison shows that the proposed intra predictor is slightly larger than some other designs [24]. However, the proposed supports both the Plane mode and intra 8 × 8 whereas other designs support only one [3, 57] or neither of them [2, 4]. Moreover, the proposed architecture supports all chroma formats and multi-QP coding for 4 × 4 and 8 × 8 modes. Its sample-generation rate is equal to 16, which is the highest among the designs. Figure 4 shows that if we remove either 4 × 4 or 8 × 8 intra predictions as in [5, 6], the number of clock cycles per macroblock for low-delay reconstructions loop decreases close to the state-of-the-art. The architecture described in [9] needs only 160 cycles per macroblock. However, the design support only some prediction modes (all 8 × 8, three 16 × 16, and one chroma). Although the proposed architecture can work at 200 MHz, 1080p@30fps sequences can be encoded at lower frequencies. In particular, 448 (rate-distortion off) and 608 (rate-distortion on) clock cycles correspond to 110 and 168 MHz. In practice, it is indispensible to apply a little higher frequencies to take into account encoder-level control and inter mode processing. Apart from [6], the reference designs cooperate with off-chip memories to save pixel line from the upper macroblocks, and some of them [24] assume the controller external to the intra predictor. These features involve the use of additional resources not taken into account in Table 2.

The proposed architecture for the H.264/AVC encoder supports all intra prediction modes specified in High Profile of the H.264/AVC standard. Additionally, it supports coding for several QPs and all chroma formats. The fact that all High Profile intra prediction modes can be generated should result in more efficiently encoded H.264/AVC streams compared to other designs. The architecture is able to generate predictions in real time for 1080p@30fps processing at 200 MHz. The high throughput is achieved through the high degree of the parallelism. Particularly, the architecture calculates a prediction for the whole 4 × 4 block in one clock cycle for each mode. Moreover, the 4 × 4 block reordering and the interleaving with other block sizes enable prediction computations in parallel with the long-delay reconstruction loop used for the RD-based mode selection. The design reuses the same combinational unit for processing of 4 × 4, 8 × 8, 16 × 16 modes.