Architecture design of the highthroughput compensator and interpolator for the H.265/HEVC encoder
 2.1k Downloads
 10 Citations
Abstract
This paper presents the architecture of the highthroughput compensator and the interpolator used in the motion estimation of the H.265/HEVC encoder. The architecture can process 8×8 blocks in each clock cycle. The design allows the random order of checked coding blocks and motion vectors. This feature makes the architecture suitable for different search algorithms. The interpolator embeds 64 multiplierless reconfigurable filter cores to support computations for different fractionalpel positions. Synthesis results show that the design can operate at 200 and 400 MHz when implemented in FPGA Arria II and TSMC 90 nm, respectively. The computational scalability enables the proposed architecture to trade the throughput for the compression efficiency. If 2160p@30fps video is encoded, the design clocked at 400 MHz can check about 100 motion vectors for 8×8 blocks.
Keywords
Video coding Interpolation Motion estimation H.265/HEVC FPGA Very largescale integration (VLSI)1 Introduction
Filter coefficients in the H.265/HEVC interpolator
Filter type  Reference sample index  

−3  −2  −1  0  1  2  3  4  
Luma 1/4  −1  4  −10  58  17  −5  1  
Luma 1/2  −1  4  −11  40  40  −11  4  −1 
Luma 3/4  1  −5  17  58  −10  4  −1  
Chroma 1/8  −2  58  10  −2  
Chroma 1/4  −4  54  16  −2  
Chroma 3/8  −6  46  28  −4  
Chroma 1/2  −4  36  36  −4  
Chroma 5/8  −4  28  46  −6  
Chroma 3/4  −2  16  54  −4  
Chroma 7/8  −2  10  58  −2 
Design and implementation of digital filters is a thoroughly explored issue, especially for smalltap filters. Moreover, digital signal processors are well suited for highspeed filtering since their architectures are adjusted to vector and matrix computations using many parallel multiplyandaccumulate units. Nevertheless, computational resources of processors are often not sufficient to support compressionefficient encoding, especially for high resolutions. Therefore, dedicated hardware accelerators are necessary.
Except one design [5], architectures for the motion estimation consist of two parts assigned to the integerpel and fractionalpel search [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. This approach requires separate referencepixel buffers for each part. The integerpel part usually applies the hierarchical strategies to extend the search range, which involves quality losses. Most architectures use nonadaptive search patterns and their resource consumption is large [6, 7, 8, 9, 10]. The architecture supporting Multipoint Diamond Search proposed in [11] requires less resource. However, it supports only 16×16 blocks, limiting the compression efficiency.
Some highthroughput interpolators have been proposed in the literature for H.264/AVC [5, 6, 7, 8, 9]. Their scheduling assumes two successive steps, one for the halfpel interpolation and another for the quarterpel interpolation. This approach is natural in terms of the specification of quarterpel computations which refer to the results of halfpel computations. This dataflow cannot be applied directly in H.265/HEVC since quarterpel samples are computed using separate filters. In particular, more filters are needed in the second step. Furthermore, the hardware cost increases due to a larger number of filter taps and much higher throughputs required (more partitioning modes). Some interpolator architectures have been described in publications [12, 13, 14, 15]. They achieve throughputs suitable for video resolutions from 1080 to 4320p. Their scheduling assumes that interpolation is performed only around one point selected at the integerpel stage for a given prediction unit. This limits the search flexibility and scalability. Moreover, three designs [12, 13, 14] are based on the assumption that the size of prediction units is selected. If the processing of more sizes has to be performed, the throughput is decreased accordingly. One design [15] supports three prediction block sizes (16×16, 32×32, and 64×64). On the other hand, it consumes a large amount of hardware resources. In general, obtaining a compressionefficient and highthroughput implementation requires more hardware resources and increases power consumption. Therefore, there is a need for solutions optimizing these parameters.
This work presents the highthroughput architecture of the compensator and the interpolator dedicated to the H.265/HEVC encoder. Similarly to the previous work dedicated to H.264/AVC [5], the modules can check one motion vector for an 8×8 block in each clock cycle. This feature makes the architecture suitable for different fast search algorithms as the order of motion vectors does not affect the architecture. Moreover, the computational scalability enables the tradeoff between the throughput for the compression efficiency. The present work has three novel contributions. Firstly, the finelevel search range is extended, whereas the memory cost is reduced. Secondly, the interpolation for a given prediction block can be performed in parallel with the integerpel search. Thirdly, the proposed reconfigurable filter core for H.265/HEVC can process both luma and chroma.
The proposed dataflow reduces the size of onchip memories 16 times while preserving the random order of checked coding blocks and fractionalaccuracy motion vectors. The interpolator embeds 64 multiplierless reconfigurable filter cores to support computations for different fractionalpel positions.
The rest of the paper is organized as follows: Sect. 2 reviews previous developments on the hardware design of the adaptive motion estimation. Section 3 describes the new architecture of the motion estimation system for H.265/HEVC. Scheduling of the interpolation is presented in Sect. 4. Section 5 concentrates on the design of the reconfigurable filter core used in the interpolator. Section 6 provides implementations results. Finally, the paper is concluded in Sect. 7.
2 Design for adaptive motion estimation
Original pixels and intra predictions are buffered in the same memories as the interpolated reference area. As a consequence, the joint size of the memories is significant even when the search range is small. For example, each of the 64 memory modules needs 16k bits when using one reference frame and the search range of [−8, 7] in both dimensions. In the case of the H.265/HEVC encoder, the memory capacity would be much greater since the processing on 16×16pixel macroblocks is replaced by coding tree units with sizes up to 64×64 pixels. Taking into account the wider search range of [−32, 31] in both dimensions, the memory capacity could be increased 16 times. If the architecture is restricted to the luma component, the increase can be reduced to four. However, such a design still would be inefficient in terms of silicon area and power consumption. The interpolator developed for the considered H.265/HEVC architecture is about three to four times more complex [4] than that used in H.264/AVC. This increase stems from the greater number and order of interpolation filters applied.
3 New architecture
The dataflow is modified to reduce the memory capacity in the H.265/HEVC adaptive ME system described in the previous section. Instead of storing interpolated pixels in the memories, the architecture can compute fractionalpel samples after reading integerpel ones from the memories. For a given search range, this approach decreases the memory space assigned to luma samples 16 times. The interpolated area of each chroma component occupies the same memory space as luma one regardless of the chroma subsampling format. If the interpolation is performed after reading (as in the modified architecture), the reduction of the memory space for chroma is greater than for luma. Particularly, the 4:2:0 video format enables the 64fold reduction. The lower memory cost allows wider search ranges, subsequently removing the need for the hierarchical search.
In the adaptive computationally scalable motion estimation system applied for the H.264/AVC [5], original and reference 8×8 blocks are read alternately from the same memories. This dataflow is inconvenient for highresolution videos since the throughput is limited by memory access. In the proposed architecture, separate memories for original and reference pixels allow a doubled access rate. If MVs are checked for the same 8×8 blocks continuously, the memory with original samples should remain unchanged. This gives the opportunity to reduce power consumption.
4 Scheduling
Figure 10 depicts the 2D interpolation for luma and chroma. The architecture performs the horizontal and vertical processing in two consecutive phases. In the first phase for luma, four adjacent 8×8 blocks included in the 16×16 area are loaded into the 15×8 input register in four clock cycles. When each of the two block pairs is written, the horizontal interpolation is started. 1D results appear at the filter output with the twocycle delay. They are fed back to input registers. The vertical interpolation starts when both results are written into input registers. 8×8 predictions appear at the interpolator output with a twocycle delay. Three vertically neighboring fractionalpel positions can be computed in successive clock cycles.
The interpolation for chroma (4:2:0) differs from that for luma. Only one 8×8 is received at the beginning. It is sufficient to compute the 4×4 fractionalpel output. The input block is obtained for the MVD equal to (−1, −1) and must be written to registers with a shift by two horizontal positions. This operation adjusts the input domain to active filter taps, as access to two left register columns is required only for luma. The analogous operation is accomplished while feeding the registers before the vertical processing. Only four columns of the filter array are used due to the smaller size of the output block. Moreover, the bottom row of filters is also not utilized. Therefore, the architecture employs only 4×7 filters supporting the chroma interpolation.
The 1D and 2D interpolation introduces the delay between receiving of input blocks and releasing of fractionalpel results. In the meantime, data appearing in the main path of the compensator bypass the interpolator. Such data are in grey in Figs. 9 and 10. Since the integerpel path in the compensator does not communicate with the interpolator, corresponding clock cycles can be utilized to check some integeraccuracy motion vectors for other prediction blocks. Such interleaved processing requires the twothread motion vector generator.
5 Reconfigurable filter structure
FPGA devices usually have DSP units dedicated to filter implementations. Such units embed multipliers, adder trees, and internal pipeline registers allowing the operation at high frequencies. Such resources enable the straightforward implementation of the fractional sample interpolation. Apart from high speed, implementations based on DSP units utilize few generalpurpose resources such as logic elements. However, the number of DSP units in FPGA devices is limited, and they can be utilized in other modules of the H.265/HEVC encoder, i.e., the forward/inverse transform and the (de)quantization. Therefore, the implementation based on regular logic elements can better fit FPGA resources.
In ASIC implementations, the incorporation of regular multipliers is inefficient when coefficients are constant or limited to a narrow set of values. A better approach is to design adder/subtractor trees with dataflow reconfigured by multiplexers. This approach can also be used in FPGA devices when there is insufficient availability of DSP units. The multiplexers can change the actual filter coefficients by shifting and switching between the intermediate branches of the tree. For example, it is possible to add samples corresponding to the same coefficient values at the first stage (the property of symmetrical filters). In general, there are many possible filter implementations based on the adder/subtractor tree. However, preferred solutions should minimize the amount of utilized resources. The design of particular filters exploits some wellknown optimization techniques. Firstly, filter coefficients which are equal to a power of two do not require multiplication circuits as hardwiredshifting can yield the result. Secondly, multiplications are equivalent to additions/subtractions of the same sample upshifted by different position numbers. Thirdly, an input sample or intermediate results can be directed to different adder/subtractor nodes to balance the tree. Fourthly, the reconfiguration of the filter coefficients can be realized by multiplexing. Fifthly, when two filters have the same coefficient values with the inverted order, it is possible to share resources by changing the assignment of the input samples with multiplexers.

NI multiplexers select between noninverted and inverted order of input samples. This selection allows the same dataflow for filter pairs with the mirrored order of coefficients. In particular, the I input is active when processing element implements the 3/4 luma filter or the 5/8, 6/8, or 7/8 chroma filters. Otherwise, the N input is selected.

OE multiplexers are utilized only for chroma to select between odd (O) and even (E) numerators of fractionalpel positions. The O input is selected when the circuit is configured as the 1/8, 3/8, 5/8, or 7/8 filters. Otherwise, the E input is selected.

HQ multiplexers select between half and quarterpel configurations. The H input is selected when the circuit is configured as the halfpel luma filter or the 3/8, 4/8, or 5/8 chroma filters. Otherwise, the Q input is selected.

LC multiplexers select between luma (L) and chroma (C) interpolation modes.

The SAT multiplexer accomplishes the saturation of the final result to avoid overflow and underflow.
Interpolation filters in H.265/HEVC involve the increase of the output bit accuracy to avoid underflow and overflow. In extreme cases, the output requires additional seven bits compared to the input accuracy. Additionally, the sign bit must be taken into account. Therefore, 16 bits are needed to represent samples after the 1D interpolation before the rounding operation. In the second phase of the 2D interpolation on 16bit signed data, the output accuracy is increased by seven bits to 23 bits. Consecutive stages of the adder/subtractor tree gradually increase the range to 23 bits at the 2D output. To keep the same output range for the 1D and 2D processing, eightbit integerpel samples are written to input registers before the interpolation with shifting up by six bit positions.
6 Implementation results
Synthesis results
Technology  Units  Module  

Compensator  Interpolator  
Arria II GX  Logic (ALUT)  4 247  28 757 
Clock (MHz)  200  200  
TSMC 90 nm  Logic (gate)  42 519  277 074 
Clock (MHz)  400  400  
Memory (kB)  64 + 12  – 
The compensator incorporates 64 twoport memory modules to store reference pixels. Each module has the size of 1kB. This capacity allows the search range of (−64, 63) × (−48, 48) for both luma and chroma. Wider ranges are possible at the cost of the increased memory size. The original pixels are stored in a separate memory of a capacity of 12 kB.
Byun et al. [10] presented the H.265/HEVC integerpel full search architecture supporting all prediction unit sizes at the range of (−32, 31) × (−32, 31). The design consumes 3.56 Mgates and 23 kB memories. The hardware cost of the motion estimation system described in this paper is much smaller even if the motion vector generator would be several times more complex than that developed for H.264/AVC [5]. Moreover, the search range is wider. Lowpower integerpel design was proposed by Sanchez et al. [11]. Its resource consumption is relatively low (50k gates and 82k bit memories). However, it supports only 16×16 blocks and a narrow search range, which does not exploit the compression potential of H.265/HEVC.
Comparison with other FPGA architectures
Comparison with other ASIC architectures
Design  Diniz [13]  Guo [14]  He [15]  Pastuszak [4]  This work 

Technology (nm)  TSMC 150  SMIC 90  65  TSMC 90  TSMC 90 
Clock (MHz)  312  250  188  400  400 
Resources (gate)  30 209  32 496  1 183 000  224 094  277 074 
Parallelism (sample/clock)  12  8  16×12  64  64 
1,000 × parallelism/resources  0.3972  0.2462  0.1623  0.2856  0.2310 
Throughput  2160p@30fps  2160p@60fps  4320p@30fps  1080p@30fps  2160p@30fps 
Features  Luma and chroma  Luma  Luma  Luma and chroma  Luma and chroma 
Evaluation results for BDrate for different numbers of checked integerpel MVs
Sequence  BDrate  

50 MVs  100 MVs  
Blue sky  1.36  0.46 
Ducks take off  0.32  0.04 
Station2  1.65  0.26 
Rush hour  4.02  0.38 
Tractor  2.43  0.74 
Average  1.96  0.38 
The realtime H.265/HEVC encoding of 1080p videos can also be obtained using the parallel processing on CPU and GPU platforms [17, 18]. On the other hand, the compression efficiency of such implementations is significantly lower compared to the HM reference software (0.7–0.92 dB).
7 Conclusion
The architecture of the compensator and the interpolator is developed for the H.265/HEVC adaptive motion estimation. The design performs interpolation on reference pixels read from onchip memories. This allows much wider search ranges and reduced memory sizes. The design achieves the high utilization of hardware resources, thanks to interleaving of integer and fractionalpel computations. The design can check about 100 motion vectors for each 8×8 block when encoding 2160p@30fps video at the 400 MHz. Future research will concentrate on the development of an efficient search strategy able to support different prediction block sizes.
References
 1.ITUT Recommendation H.265 and ISO/IEC 230082 MPEGH Part 2, High efficiency video coding (HEVC), (2013)Google Scholar
 2.HEVC software repository: HM10.0 reference model. Available: https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/branches/HM10.0dev/
 3.ITUT Rec. H.264 and ISO/IEC 1449610 MPEG4 Part 10, Advanced video coding (AVC), (2005)Google Scholar
 4.Pastuszak, G., Trochimiuk, M.: Architecture design and efficiency evaluation for the highthroughput interpolation in the hevc encoder. Euromicro Conf. Digit. Syst. Des. (DSD) 2013, 423–428 (2013)Google Scholar
 5.Pastuszak, G., Jakubowski, M.: Adaptive computationallyscalable motion estimation for the hardware H.264/AVC encoder. IEEE Trans. Circuits Syst. Video Technol. 23(5), 802–812 (2013)CrossRefGoogle Scholar
 6.Chen, T.C., Chien, S.Y., Huang, Y.W., Tsai, C.H., Chen, C.Y., Chen, T.W., Chen, L.G.: Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder. IEEE Trans. Circuits Syst. Video Technol. 16(6), 673–688 (2006)CrossRefGoogle Scholar
 7.Liu, Z., Song, Y., Shao, M., Li, S., Li, L., Ishiwata, S., Nakagawa, M., Goto, S., Ikenaga, T.: HDTV1080p H.264/AVC encoder chip design and performance analysis. IEEE J. SolidState Circuits 44(2), 594–608 (2009)CrossRefGoogle Scholar
 8.Yang, C., Goto, S., Ikenaga, T.: High performance VLSI architecture of fractional motion estimation in H.264 for HDTV, In: IEEE International Symposium on Circuits and Systems (ISCAS 2006), pp. 21–24 (2006)Google Scholar
 9.Oktem, S., Hamzaoglu, I.: An efficient hardware Architecture for quarterpixel accurate H.264 motion estimation, In: 10th Euromicro Conference on Digital System Design, pp. 1142–1143 (2007)Google Scholar
 10.Byun, J., Jung, Y., Kim, J.: Design of integer motion estimator of HEVC for asymmetric motionpartitioning mode and 4 KUHD. Electron. Lett. 49(18), 1142–1143 (2013)CrossRefGoogle Scholar
 11.Sanchez, G., Porto, M., Agostini, L.: A hardware friendly motion estimation algorithm for the emergent HEVC standard and its low power hardware design, In: IEEE International Conference on Image Processing, pp. 1991–1994 (2013)Google Scholar
 12.Afonso, V., Maich, H., Agostini, L., Franco, D.: Low cost and high throughput FME interpolation for the HEVC emerging video coding standard, In: IEEE Fourth Latin American Symposium on Circuits and Systems (LASCAS) (2013)Google Scholar
 13.Diniz, C. M., Shafique, M., Bampi, S., Henkel, J.: Highthroughput interpolation hardware architecture with coarsegrained reconfigurable datapaths for HEVC, In: IEEE International Conference on Image Processing, pp. 2091–2095 (2013)Google Scholar
 14.Guo, Z., Zhou, D., Goto, S.: An optimized MC interpolation architecture for HEVC, In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1117–1120 (2012)Google Scholar
 15.He, G., Zhou, D., Chen, Z., Zhang, T., Goto, S.: A 995Mpixels/s 0.2nJ/pixel fractional motion estimation architecture in HEVC for UltraHD, In: IEEE Asian SolidState Circuits Conference, pp. 301–304 (2013)Google Scholar
 16.Jakubowski, M., Pastuszak, G.: An adaptive computationaware algorithm for multiframe variable blocksize motion estimation in H.264/AVC, In: International Conference on Signal Processing and Multimedia Applications (SIGMAP ‘09), pp. 122–125 (2009)Google Scholar
 17.Wang, X., Song, L., Chen, M., Yang, J.: Paralleling variable block size motion estimation of HEVC on CPU plus GPU platform, In: IEEE International Conference on Multimedia and Expo Workshops (ICMEW) (2013)Google Scholar
 18.Wang, X., Song, L., Chen, M., Yang, J.: Paralleling variable block size motion estimation of HEVC on multicore CPU plus GPU platform, In: IEEE International Conference on Image Processing (ICIP), pp. 1836–1839 (2013)Google Scholar
 19.Xiph.org: Test media, available online at http://media.xiph.org/video/derf/ (2011)
 20.Bjontegaard, G.: Calculation of Average PSNR differences Between RDCurves, In: ITUT VCEGM33, VCEG 13th Meeting (2001)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.