# A Fast H.264 Intra Frame Encoder with Serialized Execution of 4 × 4 and 16 × 16 Predictions and Early Termination

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s11265-010-0574-6

- Cite this article as:
- Jung, JS., Jo, YJ. & Lee, HJ. J Sign Process Syst (2011) 64: 161. doi:10.1007/s11265-010-0574-6

- 6 Citations
- 920 Downloads

## Abstract

This paper presents a fast H.264 intra frame encoder that processes a single macroblock of 1920 × 1080 size video in 334 cycles on average which is 20% faster than the previous best design. The speed-up is mainly achieved by early termination of either 4 × 4 intra prediction or 16 × 16 intra prediction. The executions of intra 4 × 4 and 16 × 16 predictions are serialized and the second prediction is terminated early by using the cost of the first prediction as the stop criterion. A simple and efficient algorithm by making use of spatial locality is proposed to select the mode that is processed first. To avoid the bubble cycles caused by this serialized execution of 4 × 4 and 16 × 16 predictions, the modified processing order presented in (Jung et al. 2008) is employed for intra 4 × 4 prediction in order to schedule dependent 4 × 4 blocks apart from each other. To further reduce the execution time of 4 × 4 prediction, neighboring pixels with the same value are grouped, and only one prediction mode in the group is evaluated. Experimental results show that the PSNR drop is 0.0619 dB and the bitrate increase is 0.842% when compared with the JM reference software. The additional hardware cost to support the proposed methods is less than eight thousand gates which are very small when compared with the hardware size of a whole intra frame encoder.

### Keywords

H.264 Intra prediction Intra frame encoder Early termination Mode selection## 1 Introduction

The H.264/Advanced Video Coding (AVC) standard [1] introduces aggressive compression tools such as spatial prediction, adaptive block size motion compensation and 4 × 4 block based prediction. As a result, the H.264/AVC standard outperforms previous video coding standards in compression efficiency [2]. Intra frame prediction is one of those tools for compression enhancement that uses neighboring pixels to predict the current coding block. H.264/AVC compression with only intra frame prediction is especially suitable for low cost and low power applications such as a digital still camera or a video recorder, which cannot afford the complexity of inter frame prediction.

Extensive research efforts have been made to reduce the computational complexity of intra prediction [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. One of the most popular techniques for complexity reduction is an early decision of prediction modes among the nine prediction modes for the 4 × 4 block size and the four prediction modes for the 16 × 16 block size. A number of previous techniques utilize the fact that a prediction mode is strongly related to the edge, texture, or direction of the contents of the block. Therefore, the contents of the block are analyzed first, and then only a subset of prediction modes are computed according to the contents [4, 5, 6]. In other techniques for fast intra prediction, only one of the 4 × 4 and 16 × 16 predictions is evaluated [7, 8, 9]. The smoothness of a macroblock is estimated and then 16 × 16 prediction is chosen when the block is smooth whereas 4 × 4 prediction is performed otherwise.

The complexity reduction techniques based on early mode decision are widely used for the software implementation of intra prediction. However, they are seldom employed by a hardware implementation because of two main reasons. The first reason is that a precise early decision often requires a complex algorithm which is too expensive to be implemented in hardware. Thus, a hardware-based mode decision using a relatively simple algorithm often makes an inaccurate selection. To avoid a performance drop-off by a wrong decision, an early termination algorithm is employed in [10]. The risk of performance drop is somewhat reduced in the early termination scheme which does not completely discard the unselected mode but terminates the computation only when further computation of the unwanted mode leads to a very small chance for the mode to be determined as the final mode.

The second reason that prevents a hardware implementation from employing early mode decision is that the hardware utilization may be decreased when early mode decision is employed. For efficient utilization of hardware resources for intra prediction, one of the main obstacles is the dependence between consecutive 4 × 4 predictions. Intra prediction of a 4 × 4 block depends on the reconstructed pixels in its neighboring 4 × 4 blocks, and therefore, intra prediction hardware must remain idle while the reconstruction of the neighboring blocks is being completed. Hence, the execution of intra prediction and reconstruction are often serialized. In [3, 11, 12], the idle cycles (often called bubbles) are avoided by performing 16 × 16 intra prediction (denoted by I16 hereafter) during the bubbles of 4 × 4 intra prediction (denoted by I4 hereafter). This interleaved execution of I4 and I16 is reasonable when both I4 and I16 are always executed. However, this interleaved execution makes it almost impossible to employ the early mode decision between I4 and I16 because interleaved execution implies that both I4 and I16 must be executed in parallel.

This paper attempts to solve the hardware under-utilization problem when an early mode decision/termination is employed and to achieve a speed-up of hardware-based intra prediction without a significant degradation of compression efficiency. To this end, I4 and I16 are executed in a serial manner and the speed-up is achieved by early termination of the mode that is processed second between I4 and I16 with the termination criterion obtained from the cost of the mode processed first. The processing order is determined from the intra prediction modes of neighboring macroblocks. The serialized execution of I4 after I16 (or I16 after I4) prevents the widely-used technique that interleaves I4 and I16 for removing bubble cycles of I4 [3, 11]. In order to reduce the bubble cycles even for the serialized execution of I4 and I16, this paper employs the modified processing order of I4 presented in [10]. In the hardware implementation of intra prediction in [10], the execution order of 4 × 4 blocks are changed to avoid the dependence between consecutive intra predictions and consequently allow consecutive executions of independent intra predictions resulting in the reduction of the bubble cycles without interleaved execution with I16. An additional speedup technique is also proposed for the case where predictor pixels have identical values (see Section 3.5). As a result, the average execution time for a single macroblock is reduced to 334 cycles for a 1920 × 1080 size video whereas the previous best design requires 417 cycles.

The rest of this paper is organized as follows. Section 2 briefly introduces previous hardware-based intra prediction techniques and Section 3 presents the proposed fast intra prediction. Section 4 explains the details of the pipeline schedule of the proposed intra prediction. Section 5 presents the hardware implementation of the proposed pipeline and Section 6 gives comparisons with previous works. Section 7 concludes the paper.

## 2 Previous Pipeline Schedules for Fast Intra Prediction

Figure 1(c) shows the processing order of the 16 4 × 4 blocks in a macroblock. In this figure, each small square represents a 4 × 4 block and a large square represents a 16 × 16 macroblock. The number inside the small square represents the processing order of the 4 × 4 block prediction defined in the H.264/AVC standard. There exists dependence between consecutively-executed 4 × 4 blocks such that the intra prediction of one block depends on that of the previous 4 × 4 block. For example, Block 1 depends on Block 0 because some of the predictor pixels of Block 1 belong to Block 0. In other words, the intra prediction of Block 1 needs the pixels in Block 0. Note that the H.264 standard requires these predictor pixels to be the reconstructed pixels from the result of intra prediction of Block 0. Thus, Block 1 needs to wait for the completion of both intra prediction and reconstruction of Block 0. This implies that there exists a period between the intra predictions of Block 0 and Block 1 and the reconstruction of Block 0 is performed during this period. This period is called a bubble in [11] as the hardware resource for intra prediction remains idle during this period.

In [10], additional speed-up is achieved by early termination of I4 using the cost of I16 as the stop criterion. This schedule is shown in Fig. 2(b). I4 and I16 are performed in parallel as the upper pipeline represents the I4 whereas the lower pipeline does I16 and Chroma execution. In the upper pipeline, B0, B1, and B15, represent Block 0, Block 1, and Block 15, respectively. In the lower pipeline, I16-M0~M3 represents the execution of the four prediction modes (from M0 to M3) for I16. The next box Chroma-M0~M3/I16-Best represents the parallel execution of the four prediction modes for Chroma prediction and the DCTQ for the best I16 mode. Note that the parallel executions of I4 and I16 require hardware resources to be doubled when compared with them for the pipeline in Fig. 2(a). As I16 requires less computation time than I4, I16 completes earlier than I4. By using the final cost of I16 as the stop criterion, the speed up of I4 is achieved by early termination [10]. To this end, the cost of I16 is compared with the expected cost of I4 estimated from the intermediate result of I4 and then I4 is terminated early whenever the estimated cost of I4 is larger than the cost of I16.

## 3 Proposed Fast Intra Prediction

This section presents a new fast intra prediction algorithm that overcomes the limited speed up achieved by the algorithm in [10]. The limitation of this schedule lies in the fact that the early-termination rate of I4 is not high. When the I4 mode is chosen as the final mode, the execution of I4 cannot be terminated early because the cost of I4 is smaller than the cost of I16. In general, the I4 mode is chosen more frequently than the I16 mode. Therefore, the early termination in the pipeline as shown in Fig. 2(b) achieves limited speed-up. In order to overcome this limitation, it is necessary to have a scheme that allows the frequently-chosen mode is performed first so that the resulting cost is used for the early termination of the seldom-chosen mode. Section 3.1 presents the outline of the algorithm whereas the details of the algorithm are presented from Sections 3.2 to 3.5.

### 3.1 Flow of the Proposed Fast Intra Prediction

The advantage of the proposed schedule over the previous schedule in Fig. 2(b) is that the selection between I4 and I16 for early termination is possible whereas the previous schedule always selects I4 for the candidate of early termination resulting in a limited speed up in the case where I16 is chosen as the final mode. Adopting 8-pixel parallel hardware implementation, the previous pipeline can process both I4 and I16 in parallel, with 4-pixel parallel hardware dedicated to each of I4 and I16. As I4 takes longer than I16, the hardware for I16 is often wasted when I4 is not terminated early. On the other hand, the proposed pipeline processes I4 with 8-pixel parallel hardware dedicated to I4 almost twice faster than the previous pipeline in Fig. 2 (b) does. This is because only 4-pixel parallel architecture is used for I4 in the previous pipeline (the other 4-pixel architecture is dedicated to I16). After I4, the 8-pixel parallel hardware is dedicated to I16, and speed up is achieved by early termination. Therefore, the proposed pipeline schedule achieves better hardware utilization than that in Fig. 2(b), leading to faster execution time than the schedule in Fig. 2(b). Similarly, the proposed pipeline uses 8-pixel parallel hardware for I16 first and then later for I4 when I16 is over. The hardware under-utilization is minimized with the proposed pipeline.

For further speed-up by reducing the number of prediction modes for I4, this paper also employs the modified three step algorithm which is proposed in [3]. Recall that the algorithm always discards two prediction modes among the nine prediction modes without a much degradation of R-D performance. The plane mode for I16 and Chroma predictions is omitted to reduce the complexity of the hardware as in [3]. In addition, this paper proposes two additional techniques: early termination among the I16 modes and additional prediction mode reduction for I4. The details of these two additional techniques are discussed in Sections 3.4 and 3.5, respectively.

### 3.2 Mode Selection Between I4 and I16

In the algorithm in Fig. 5, the first step is a selection between I4 and I16. The selection is made by observing the intra prediction modes of neighboring macroblocks. The prediction modes of the upper and left macroblocks are checked first, and I16 is selected if one of the two neighboring modes is I16. Otherwise, I4 is selected. The top-leftmost macroblock in a frame has no neighboring macroblocks. In this case, I4 is always selected because I4 is selected more frequently as the best mode than I16 does.

Accuracy of prediction between I4 and I16.

Test sequence | Accuracy (%) |
---|---|

Blue sky | 92.02 |

Tractor | 90.07 |

Pedestrian area | 79.24 |

Rush hour | 79.13 |

### 3.3 Early Termination of I4

*C4*

_{accum}

*(N) = C4(N) + C4*

_{accum}

*(N−1)*, where

*N*represents the number of the current 4 × 4 block,

*C4*

_{accum}

*(N)*represents the cost accumulated for

*N*4 × 4 blocks, and

*C4(N)*represents the cost of the 4 × 4 intra prediction of the N

^{th}block. The cost of 4 × 4 intra prediction is the sum of absolute transformed differences (SATDs) of each 4 × 4 block. In the next step, the accumulated cost is compared with the early termination threshold,

*Th(N)*. The selection of the threshold is to be discussed in details in the next paragraph. If the current accumulated cost

*C4*

_{accum}

*(N)*is larger than the threshold

*Th(N)*, the total cost of I4 is expected to be larger than that of I16, and I4 is early terminated. If the accumulated cost is smaller than the threshold,

*N*is incremented by one and the next iteration of the loop is performed again. If

*N*reaches 16 without being early terminated, I4 is completed.

*Th(N)*, determines the amount of computation saving by early termination so that the selection of

*Th(N)*is important for an effective trade-off between computation saving and compression efficiency. The most rigorous threshold that ensures no loss in compression efficiency would be the total cost of I16 which enforces I4 to simply terminate when the intermediate cost of I4 is larger than the total cost of I16. On the other end, a flexible threshold is the intermediate cost of I16 for the corresponding 4 × 4 blocks that enforces early termination when the I4 intermediate cost is larger than the I16 intermediate cost of the equivalent 4 × 4 blocks. The threshold function used in [10] is a value between the two thresholds and is defined as follows [10]:

*Cost*

_{I16}denotes the total cost of I16,

*N*is the index of the 4 × 4 block currently being processed, and

*M(N)*is a margin considering the cost variation of the remaining 4 × 4 blocks. From experimental results,

*M(N)*is defined as follows:

*M(0)*of 0.75 is experimentally chosen for the early termination of I4 from the result of I16. The computation of (1) is not very complex as it can be implemented with a table look-up operation, one addition, and one multiplication.

### 3.4 Early Termination of I16

*Th(N)*. When the accumulated cost is greater than the threshold, the prediction mode is terminated early. Then, the next prediction mode is performed. The outer loop in Fig. 6(b) represents the iteration for performing three prediction modes. Note that the transform of DC coefficients in an I16 mode is not included in Fig. 6(b) for simplicity. Also note that the fourth prediction mode (Plane mode) is often excluded for I16 prediction [3, 10, 12]. The function given in (3) is used for the threshold

*M(N)*is the same as (2) with the value of

*M(0)*chosen as 0.5 by experiments.

I16 is performed mode by mode, and consequently, the cost of one prediction mode is available before the start of the next mode and it can be used as the stop criterion for the early termination of the next mode. Thus, the early termination of I16 is attempted from the second prediction mode using the cost of the first prediction mode as the stop criterion. For early termination to be effective, it is important to choose the prediction mode to be performed first. The processing order of the three I16 modes is decided as follows. First, the best I16 mode of the left macroblock is selected as the first mode to be performed. If the first mode is chosen as 0, then the second and third modes are 1 and 2, respectively. If the first mode is 1, the second and third modes are 0 and 2, respectively. If the first mode is 2, the second and third modes are 0 and 1, respectively. For the threshold, the same function similar to (3) is used again and the value of *M(0)* is chosen as 0.25 in this case. Note that the reference cost (cost of I4) and *M(0)* are updated whenever the cost of one prediction mode is smaller than the cost of the previous best mode. For example, suppose that the best mode becomes I16 after evaluating the first mode which uses the cost of I4 for *Th(N)*. Then, the second I16 mode uses the cost of the first I16 mode, instead of the cost of I4, as the stop criterion. Due to the change of the termination type, *M(0)* is also changed to 0.25.

### 3.5 Prediction Mode Reduction

The modified three-step algorithm in [3] reduces the number of I4 modes down to seven with less than 1% increase in the bit rate. As this algorithm is simple enough for hardware implementation, the intra prediction in this paper also employs the modified three-step algorithm.

*“representative mode”*among the modes with the identical predictors. With only the representative mode to be predicted, the computational complexity of I4 is significantly reduced. The complexity reduction can be achieved even when all predictor pixels are not identical, but when a certain group of prediction modes are identical. For example, if predictor pixels from A to D, from I to J, and M are all equal, modes 0, 1, 2, 4, 5, 6, and 8 result in the same SATD. Thus, only one mode among the seven modes is necessary to be performed. Table 2 summarizes the relationship between the identical predictor pixels and the prediction modes that result in the same SATD. For example, the fourth row shows that identical predictor pixels from A to H lead to the same SATDs of modes 0, 3, and 7. Hereafter, the identical predictor group is denoted by IPG and the IPG-based mode selection algorithm is denoted by MS-IPG.

Identical pixel group and I4 prediction modes with identical SATD.

Group number | Identical predictor group | I4 modes with identical SATD |
---|---|---|

0 | A, B, C, D, E, F, G, H, I, J, K, L, M | all modes |

1 | A, B, C, D, I, J, K, L, M | 0, 1, 2, 4, 5, 6, 8 |

2 | A, B, C, D, I, J, K, L | 0, 1, 2, 8 |

3 | A, B, C, D, E, F, G, H | 0, 3, 7 |

4 | I, J, K, L | 1, 8 |

The modified three-step algorithm requires I4 modes 0 and 1 to be always executed whereas mode 0 or 1 may be excluded in MS-IPG. Therefore, the modified three-step algorithm cannot be performed when mode 0 or 1 is excluded by MS-IPG. In this case, it is necessary to select the algorithm between the modified 3-step algorithm and the MS-IPG. As the modified 3-step algorithm excludes 2 prediction modes, the MS-IPG is more efficient only when the number of excluded modes is larger than 2. As shown in Table 2, the MS-IPG excludes more than two prediction modes when the group number is between 0 and 3. Therefore, the MS-IPG is chosen over the 3-step algorithm when the group number is less than 4. Otherwise, the modified 3-step algorithm is selected. In this manner, the number of I4 modes is always smaller than or equal to 7.

## 4 Pipelined Execution of the Proposed Intra Prediction

### 4.1 Intra 4 × 4 Prediction

To reduce the bubble cycles, this paper proposes two optimizations. The first optimization takes advantage of the fact that four pixels are reconstructed in 1 cycle. The reconstruction hardware is designed in such a way that the rightmost four pixels are generated in the first cycle (cycle 29). Note that only the rightmost four pixels are necessary for the generation of predictors for Block 1. Thus, after the first reconstruction cycle, the IP of Block 1 can begin its ‘St’ operation. Thus, 3 cycles can be removed from the bubble cycles between Blocks 0 and 1. The second optimization is performed based on the reason that the predictors for Mode 0 are irrelevant to the reconstructed pixels in Block 0 (the predictors for Mode 0 are constructed from the pixels in the upper 4 × 4 block). Thus, Mode 0 of Block 1 can begin before the completion of the ‘R’ step of Block 0. On the other hand, Mode 1 of Block 1 depends on the reconstructed pixels of Block 0. Therefore, the ‘St’ operation of Mode 1 can begin after the first ‘R’ operation of Block 0. Thus, the ‘St’ operation of Mode 1 can begin at cycle 30 which implies that the ‘St’ operation of Mode 0 can begin at cycle 28, 2 cycles earlier than Mode 1. With the two optimizations, five bubble cycles can be reduced. However, just 4 cycles are removed from the bubble deliberately and the ‘St’ operation of Mode 0 begins at cycle 29 as shown in Fig. 7. The reason of the deliberate removal of only 4 cycles is for a simple control of Chroma scheduling which includes normal 4 × 4 block prediction for 2 cycles and 2 × 2 DC Hadamard for 1 cycle. Therefore, bubble cycles need to be an odd number when DC Hadamard is performed during the bubble whereas bubble cycles need to be an even number otherwise. Ignoring these rules causes additional buffer to store temporal result of a 4 × 4 block to be used at the beginning of the next bubble. As a result, 14 bubble cycles are inserted between Block 0 and 1 as shown in Fig. 7 (from cycle 15 to cycle 28 represented by the gray area). Note that Block 15 is also dependent on its left block (Block 14). Thus, the pipeline execution is almost the same as that shown in Fig. 7. The only difference is that it is not necessary to take into consideration of the Chroma scheduling. Hence, 13 bubble cycles are inserted between Block 14 and Block 15.

Due to the upper-right dependence between Blocks 1 and 2, a bubble is also inserted between the two blocks. For the up-right dependence, Modes 0, 1, 2, 4, 5, 6, and 8 are irrelevant with its previous block. Due to the restriction in the processing order by the 3-step algorithm, Modes 5, 6, and 8 cannot be scheduled first. Thus, Modes 0, 1, 2, and 4 can be scheduled before the reconstruction of the previous block. As a result, eight bubble cycles are removed. The first optimization applied for Blocks 0 and 1 cannot be adopted in this case because Mode 0 needs the bottom four pixels of Block 1 which are available only after the last reconstruction step (Cycle 60). Thus, ten bubble cycles (from 43 to 52) are inserted between Blocks 1 and 2 as shown in Fig. 7. Note that there exist another two cases (between Blocks 7 and 8 and also between Blocks 13 and 14) that have upper-right dependence. Thus, ten bubble cycles are also inserted for these two cases.

### 4.2 Pipeline Schedule of the Proposed Intra Prediction

In Fig. 9(a), B0 and B1 represent the I4 predictions of Blocks 0 and 1, respectively. The numbers inside parentheses of boxes represent the execution cycles. Note that the execution time is twice faster than that in Fig. 2(b). This is because the design in Fig. 9 employs 8-pixel parallel data path entirely dedicated to I4 whereas the design in Fig. 2(b) shares the data path by both I4 and I16, and consequently, only 4-pixel parallel data path is used for I4. B2-7 represents the consecutive execution of I4 predictions from B2 and B7 whereas B8-13 represents I4 predictions from B8 and B13. During the execution of I4, five bubbles are generated between B0 and B15. These bubbles are 14 cycles between B0 and B1, 10 cycles between B1 and B2, 10 cycles between B7 and B8, 10 cycles between B13 and B14, and 13 cycles between B15 and B16 (see details about these bubble cycles in the previous subsection). To avoid a waste of hardware resources, these bubbles are interleaved with Chroma and I16 predictions. For example, the bubble between B0 and B1 is utilized with the execution of Mode 0 of the Chroma prediction (denoted by C-M0). In Fig. 9(a) and (b), C-M0, C-M1 and C-M2 represent intra predictions of Mode 0, 1, and 2 for Chroma data, respectively. It takes 17 cycles to process one mode of Chroma prediction (Each of Chroma U and V has 4 4 × 4 blocks (64 pixels) and, with eight pixel parallel implementation, 16 cycles are necessary to process 128 Chroma pixels (64 pixels for each of U and V). It takes additional 1 cycle to 2 × 2 transform the DC coefficients of U and V, making 17 cycles to process each mode). Recall that the bubble between B0 and B1 is only 14 cycles. Thus, C-M0 is not completed during the first bubble cycles and the remaining computation for the C-M0 is performed in the second bubble between B1 and B2. The fourth box from the left denoted by C-M0/M1 represents the sequential execution of C-M0 (for 3 cycles) followed by C-M1 (for 7 cycles). The remaining execution of C-M1 is performed after the execution of B2-7. C-M2 is performed for 10 cycles during the bubble between B13 and B14. Then, the remaining 7 cycles of C-M2 are performed during the bubble between B14 and B15. Recall that the bubble cycles between B14 and B15 are 13 cycles. Thus, the bubble includes 6 cycles after the completion of C-M2. These remaining 6 cycles are consumed by the execution of I16-M0 (the first mode of I16 prediction). The execution order of the modes of I16 may vary (see Section 4.3). Thus, I16-M0, I16-M1, and I16-M2 do not represent Mode 0, 1 and 2, of I16. Instead, they represent the I16 mode that is executed first, second, and third, respectively.

Each mode of I16 takes 34 cycles (I16 prediction has 16 4 × 4 blocks which take 32 cycles to predict the 256 Luma pixels and additional 2 cycles are required to generate the 4 × 4 transform of the I16 DC coefficients). Thus, the remaining part of I16-M0 is performed after B15. Then, the other modes of I16 predictions and DCTQ/reconstuction for the best Chroma mode (C-Best) are performed. The execution of C-Best requires 32 cycles to re-predict and reconstruct 8 4 × 4 blocks. However, the prediction hardware must remain idle for 16 cycles during C-Best computation because 4-pixel parallel hardware is applied for reconstruction. To avoid these idle cycles, C-Best and I16 are performed in an interleaved manner (Interleaved I16 and C-Best in Fig. 9). Then, I16-Best in Fig. 9 represents the execution cycles for the decision of the best I16 mode. This step is necessary only when I16 is selected as the best macroblock mode. It requires 64 cycles for prediction. Note that the 64 cycles for I16 selection are not needed when I4 is selected.

In the pipeline schedule shown in Fig. 9(b), the first two modes of I16 (denoted by I16-M0/M1) are performed first. Then, I4 and Chroma prediction are performed in the interleaved manner as the schedule presented in Fig. 9(b). This implies the third mode of I16 is not completed before the start of I4. Thus, the stop criterion for I4 early termination is chosen as the smaller cost of only the first two modes. It is possible for the cost of the third mode is less than the smaller cost of the first two modes. Thus, the early termination rate of I4 may be slightly decreased by this delayed evaluation of the third mode although experimental results show that this decrease is not significant. During the execution of I4, the execution order from B0 to B15 is the same as that shown in Fig. 9(a). Then, I16-M2 and C-Best are performed in an interlaced manner. The last two steps are same as those in Fig. 9(a).

The main advantage of the serialized execution of I4 and I16 over the previous schedules is that an early mode decision between I4 and I16 can be effectively used to speed up the computation time of the unselected mode with early termination. If I4 is selected over I16, then the first schedule (Fig. 9(a)) is adopted so that I4 is performed first. Then, I16 is terminated early by using the cost of I4 as the stop criterion. On the other hand, if I16 is selected over I4, then I16 is performed first and I4 may be terminated early by using the result of I16. In a software-based implementation of an early mode decision for H.264 intra prediction, it is often the case that only the selected mode is executed. In this case, the unselected mode is simply discarded. The performance drop by this discard is often not very significant because a software-based implementation often uses a sophisticated algorithm to select the better mode. In the hardware-based implementation as in this paper, a complicated algorithm is not easy to design so that only a simple algorithm is allowed for the selection of the better mode. As a result, the discard of the unselected mode may often substantially degrade the compression efficiency. To avoid such degradation, it is desirable to use an early termination scheme which avoids the discard of the unselected mode from the beginning. Instead, the unselected mode is discarded by comparing the cost of the selected mode with the estimated cost of unselected mode and terminating the execution of the unselected mode only when the estimated cost is greater than the cost of the selected mode.

In both schedules, the optimal I4 order as shown in Fig. 3 is adopted for the further speed-up for I14 execution. One change made by the new pipeline schedules in comparison with that in Fig. 2(b) is that the new schedules do not exclude the two I4 modes (modes 3 and 7) for blocks 2, 8, and 14. This is because of the observation that the exclusion of the two I4 modes (modes 3 and 7) for blocks 2, 8 and 14 as in Fig. 2(b) incurs a bitrate increase of 0.49%. The proposed schedules attempt to improve the compression efficiency by avoiding this performance loss and placing bubbles between up-right dependencies as well as between left dependencies. As a result, the new schedules place five bubbles instead of two as in Fig. 2(b).

### 4.3 Pipeline Schedule for MS-IPG

This subsection revisits the pipeline executions shown in Figs. 7 and 8 and explains issues when MS-IPG is employed. A decision whether to perform MS-IPG or the modified three-step is made right after R (at cycle 30 in Fig. 7). Since mode 0 is already performed at cycle 30 and mode 0 is always a candidate for the representative mode, mode 0 is always chosen as the representative mode.

For the up-right dependence as shown in Fig.7, 4 modes are processed before R. Therefore three bubbles may be generated after mode 0 if MS-IPG is selected after R. To reduce these bubbles, MS-IPG is performed in a two-step manner. In the first step, predictors of IPG 2 are compared at the beginning of Block 2. This is possible because predictors are available from the beginning of Block 2. If predictors of IPG 2 are all identical, the bubbles can be replaced by modes 4, 5, and 6. Otherwise, the first four modes (mode 0, 1, 2, and 4) are performed. In the second step predictors of IPG 3 are compared after R. Then, modes 3 and 7 are excluded if up and up-right predictors are all identical. If the predictors differ, the three-step algorithm is selected. Note that the predictors of IPG 1 are also compared in the first step. However, the bubble replacement scheme cannot be applied in this case. For IPG 1, modes 3 and 7 are the candidate modes for the replacement. However, these modes have to be done after R because they require up-right predictors. As no I4 mode can be scheduled in these bubbles, I16 blocks are performed during the bubbles.

MS-IPG can also generate bubbles in the normal pipeline schedule (Fig. 8). A block may depend on predictors from two blocks ahead. For example, reconstructed pixels of Block 2 are used as left predictors of Block4 as shown in Fig. 8. Even if Block 3 ends early by MS-IPG, Block 4 cannot start immediately after Block 3 because Block 4 has to wait for R of Block 2. Therefore, bubble generation is unavoidable between Block 3 and Block 4. This bubble is also filled by I16 blocks.

## 5 Hardware Implementation

Additional control logics are necessary to select the processing order of I16 modes and also to determine early termination which requires the computation of Eqs. 1, 2 and 3. Equation 1 or 3 is implemented with a 1 adder and a 1 multiplier. Equation 2 is implemented with a look-up table. As there are three kinds of early terminations, three look-up tables are constructed. The processing order in Fig. 3 requires additional buffers to store neighboring pixels of 4 × 4 blocks. For example, Block 4 needs pixels from Blocks 0, 1, and 2. Thus, the results of Blocks 0, 1, and 2 must be stored until the intra prediction of Block 4 begins. Reconstructed pixels are stored in different buffers (reorder buffers) in a manner that minimizes the number of buffers.

*M(0) x (16−N)*in Eq. 2 is denoted as

*M’(N)*in Fig. 11.

*M’(N)*is pre-computed for all

*N*and stored in a table. I4_M, I16_M, I16_M_Intra are the three tables for I4 termination, I16 termination according to the I4 cost, I16 termination according to the I16 cost, respectively. The table,

*N*and reference cost are selected according to the mode to be terminated (current mode) and the mode whose cost is used as the reference cost (current best mode). The remaining parts calculate the threshold for early termination. Figure 12 shows the reorder buffers (hrec0-5, vrec0-1) explained in the previous paragraph. The hrec stores bottommost pixels in a 4 × 4 block while vrec stores leftmost pixels. The numbers in the left are the block number in Fig. 3. The arrow represents lifetime of reconstructed pixels. To minimize the number of buffers, reconstructed pixels are stored as shown in Fig. 12.

Gate counts of hardware modules.

Components | Gate count |
---|---|

Boundary buffers | 11720 |

Intra prediction | 4185 |

Transform | 10150 |

Mode decision | 14207 |

FIFO | 4165 |

Q and IQ | 14414 |

Inverse transform | 7640 |

Reconstruction controller | 4488 |

Scheduler | 5792 |

Early termination | 806 |

Reorder buffers | 1745 |

Total | 79313 |

## 6 Comparison with Previous Works

Early termination rates for various image sizes.

Video type | I16 selection (%) | I4 termination (%) | I16 termination (%) |
---|---|---|---|

1920 × 1080 | 30.515 | 19.815 | 66.809 |

1280 × 720 | 9.479 | 4.607 | 75.823 |

352 × 288 | 9.659 | 6.055 | 72.656 |

Performance improvement by identical pixel group.

Video type | Number of excluded modes/MB |
---|---|

1920 × 1080 | 15.137 |

1280 × 720 | 3.484 |

352 × 288 | 8.679 |

Comparison with previous designs.

Design feature | This work | [3] | [10] | [11] |
---|---|---|---|---|

Gate count | 79 K | 94 K | 126 K | 85 K |

Target size | HD1080p | HD1080p | HD720p | 720x480 |

Maximum cycles | 464 cycles | 441 cycles | 624 cycles | 1060 cycles |

Average cycles for | ||||

1920 × 1080 | 334 cycles | 417 cycles | 475 cycles | 1017 cycles |

1280 × 720 | 342 cycles | 409 cycles | 587 cycles | 1002 cycles |

352 × 288 | 343 cycles | 407 cycles | 574 cycles | 1001 cycles |

PSNR and bitrate changes.

PSNR (dB) | Bitrate (%) | |
---|---|---|

Plane mode skip | −0.0071 | 0.0974 |

DCT SATD | 0.0008 | 0.0331 |

3-Step | −0.0463 | 0.5547 |

MS-IPG | 0.0231 | −0.3991 |

Early terminations | −0.0324 | 0.5561 |

Final result | −0.0619 | 0.8422 |

## 7 Conclusion

About 20% of the execution cycles for H.264 intra prediction are saved by the proposed pipeline schedule, early termination, and the mode selection based on IPG. Experimental results show that the proposed schedule with early termination is effective for various video sizes and quality. The mode selection based on IPG also provides substantial computation saving in large videos with large QPs. In spite of the significant reduction in computation time, PSNR drop is 0.0619 dB and the bit rate increase is less than 0.842%. Although this paper is mainly for a specific hardware, the proposed methodology can be applied to a wide range of platforms.

### Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.