Algorithm and architecture design of the motion estimation for the H.265/HEVC 4KUHD encoder
Abstract
This paper presents the algorithm and the architecture of the highthroughput motion estimation system for the H.265/HEVC encoder. The design allows the processing of 2160p@30fps videos at the clock frequency of 400 MHz. The architecture embeds two parallel processing paths for the integerpel and the fractionalpel motion estimation. The paths share the same memories. Access conflicts are avoided by the use of dualport modules and register buffers for reused samples. In each clock cycle, the integerpel and the fractionalpel path can evaluate one and four motion vectors for an 8 × 8 luma block, respectively. A separate interpolator for chroma additionally increases the throughput. The integerpel path supports test zone search for 8 × 8 prediction blocks. The motion estimation for larger blocks is performed by the utilization of results of the 8 × 8 search. The search for rectangular PUs is performed only at the fractionalpel level and reuses partial costs computed for square PUs. As a consequence, a significant amount of computation is saved. Synthesis results show that the design can operate at 200 and 400 MHz when implemented in FPGA Arria II and TSMC 90 nm, respectively. The implemented algorithm is verified in the HM16 software. If 2160p@30fps videos are encoded with the lowdelay configuration, BDPSNR and BDrate are equal to −0.026 dB and 1.64 %, respectively.
Keywords
Video coding Motion estimation Interpolation H.265/HEVC FPGA Very largescale integration (VLSI)1 Introduction
Research and standardization efforts in video coding led to the specification of the H.265/HEVC standard [1, 2] in 2013. At the same quality of the reconstructed video, the standard provides an improvement in compression efficiency of about 35–50 % compared to its predecessor H.264/AVC [3]. However, the better compression efficiency is achieved at the price of increased computational complexity. Although the general structure of the encoder and the decoder remains the same, there are many changes in the algorithm. Instead of 16 × 16pixel macroblocks, the new standard applies coding tree units (CTUs), which can be up to 64 × 64 pixels in size. Each CTU can be recursively split into square coding units (CUs) with the minimal size of 8 × 8 pixels. Each 2N × 2N CU can be partitioned into predictions units (PUs). N can be equal to 4, 8, 16, or 32. There are eight allowable partition shapes: two square shapes (2N × 2N and N × N), two symmetric rectangular shapes (N × 2N and 2N × N), and four asymmetric rectangular shapes (2N × 3N/2, 2N × N/2, 3N/2 × 2N, and N/2 × 2N). Each inter PU has a separate motion vector (MV). Similar to H.264/AVC, the H.265/HEVC allows quarter pixel accuracy MVs. There are new interpolation schemes to compute fractionalpel positions. In particular, 7tap and 8tap filters are used for the luma interpolation of halfpel and quarterpel positions, respectively. Chroma samples are computed using 4tap filters. Although design and implementation of digital filters is a thoroughly explored issue, highthroughput video encoders require some effort to obtain efficient hardware solutions.
With the exception of our previous designs [4, 5], architectures for the motion estimation (ME) consist of two parts assigned to the integerpel and fractionalpel search [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. This approach requires separate reference pixel buffers for each part. The integerpel part usually applies the hierarchical strategies to extend the search range, which involves quality losses. Most architectures use nonadaptive search patterns and their resource consumption is large [6, 7, 8, 9, 10]. The architecture supporting Multipoint Diamond Search proposed in [11] requires less resource; however, it only supports 16 × 16 blocks, limiting the compression efficiency.
Some highthroughput interpolators have been proposed in literature for H.264/AVC [5, 6, 7, 8, 9]. Their scheduling assumes two successive steps, one for the halfpel interpolation and another for the quarterpel interpolation. This approach is natural in terms of the specification of quarterpel computations which refer to results of halfpel computations. This dataflow cannot be applied directly in H.265/HEVC since quarterpel samples are computed using separate filters. In particular, more filters are needed in the second step. Furthermore, the hardware cost increases due to a larger number of filter taps and much higher throughputs required (more partitioning modes). Some interpolator architectures designed for H.265/HEVC have been described in literature [12, 13, 14, 15]. They achieve throughputs suitable for video resolutions from 1080p to 4320p. All the designs neglect the interpolation for merge modes. Three designs [12, 13, 14] are based on the assumption that the size of prediction units is selected at the integerpel ME. If the processing of more sizes has to be performed, the throughput is decreased accordingly. One design [15] supports three prediction block sizes (16 × 16, 32 × 32, and 64 × 64); however, it consumes a large amount of hardware resources. Generally, a compressionefficient and highthroughput implementation requires more hardware resources and increases power consumption. Therefore, there is a need for solutions in which these parameters are optimized.
This study presents the highthroughput ME architecture dedicated to the H.265/HEVC encoder. Similar to the previous works [4, 5], the architecture can check one integerpel motion vector for an 8 × 8 block in each clock cycle. As an arbitral order of motion vectors is allowed, the architecture supports the test zone search (TZS) algorithm used in the HM software. A significant amount of computation is saved for 16 × 16 and 32 × 32 prediction blocks by the exploitation of results of the 8 × 8block search. In the case of rectangular PUs, only MVs checked for the fractionalpel ME of 2N × 2N PUs are evaluated, which additionally reduces the complexity at small quality losses.
The present study has four novel contributions at the architecture level. Firstly, the use of dualport memories and register buffers for reused data allows shared and conflictfree access from two processing paths corresponding to the integerpel and the fractionalpel search. Secondly, the extension of the interpolation to 9 × 9 blocks allows the evaluation of four 8 × 8 fractionalpel blocks at a small increase in the resource cost. Thirdly, the architecture enables the twodimensional continuous interpolation of 9 × 9 blocks with reconfigurable and dedicated filter cores. Fourthly, the separate chroma interpolator additionally increases the throughput and design flexibility.
The rest of the paper are organized as follows: Sect. 2 reviews previous developments on the hardware design of the adaptive motion estimation. Section 3 describes the new architecture of the H.265/HEVC motion estimation system. The applied scheduling is described in Sect. 4. Section 5 presents the motion estimation algorithm executed by the proposed architecture. Section 6 provides implementation results. Finally, the paper is concluded in Sect. 7.
2 Design for adaptive motion estimation
The adaptive computationally scalable motion estimation algorithm allows video encoders to achieve close to optimal efficiencies in realtime conditions [16]. The algorithm can employ different search strategies to adapt to local motion activity, and the number of checked search points is set by the encoder controller for each macroblock. The algorithm can achieve results close to optimum even if the number of search points assigned to macroblocks is strongly limited and varies with time.
In order to process blocks of 8 × 8 samples, the interpolator embeds 64 reconfigurable filters [4]. The reconfiguration allows the computation of four fractionalpel positions (e.g., 0, 1/4, 1/2, and 3/4) for both luma and chroma samples. The number of filters corresponds to the size of blocks processed in the main path. Although the interpolation parallelism is high, the throughput is limited by the reading of 8 × 8 blocks at the input. In particular, two and four 8 × 8 blocks must be read to obtain the 1D and 2D interpolation of three fractionalpel positions for one block, respectively. More clock cycles are utilized when the interpolation is performed in two dimensions. If 100 cycles are available for each 8 × 8pixel block (2160p@30fps), the interpolation around two integerpel MVs can be performed for luma. Particularly, one 1D luma interpolation with the cross pattern takes 16 cycles, whereas 2D interpolation for nine positions takes 27 cycles. Two corresponding chroma blocks are interpolated in 10 cycles for one position. Totally, 96 out of 100 cycles are utilized for the luma and chroma interpolation. 54 cycles are available for the integerpel search interleaved with memory reads for the fractionalpel estimation. Although the throughput is significantly improved compared to other designs [12, 13, 14], it is still insufficient to evaluate the greater number of PU sizes. Additional interpolations are indispensable to support merge modes.
3 New architecture
In the architecture described in the previous section [4], the integerpel and the fractionalpel search share the same processing path with the interleaved processing. As a consequence, the number of clock cycles assigned to the integerpel estimation is decreased almost by half, which has a negative impact on the compression efficiency. The main bottleneck is introduced by the memory read port able to provide one 8 × 8 block in each clock cycle. In order to resolve the problem, the new architecture incorporates dualport memory modules instead of twoport ones. The main advantage of dual ports is that they can operate in either the read or the write mode. In the architecture, the first port is assigned to the integerpel path, whereas the second is used as the input to the interpolator. The interpolator incorporates the register buffer at the inputs stage to reuse samples from the second path. Since the interpolator does not read data in each clock cycle, some cycles can still be utilized to write the reference pixels for the following CTUs. The same approach is applied to the memory storing original samples.
9 × 9 blocks released from the interpolator must be written to a buffer to wait for the end of the preselection process. Some predictions corresponding to preselected MV candidates should be kept until they are forwarded to the reconstruction loop and the ratedistortion optimization. The buffer is outside the ME system and will be used to integrate with other encoder modules [17].
The horizontal interpolator computes the 9 × 16 sample array in four clock cycles and then forwards it to the vertical stage. The vertical interpolator can be implemented as the 9 × 9 array of reconfigurable filters which determine a 9 × 9 block for one fractionalpel MV in each clock cycle. However, the hardware cost of 81 reconfigurable filters is significant. To save resources, the vertical stage incorporates 54 dedicated and nine reconfigurable filters. Each of three fractionalpel interpolations (1/4, 1/2, and 3/4) is performed with 18 dedicated filters. Separate bypass paths transfer 18 samples not interpolated vertically. Each bypass path includes the rounding adder and the range limiter. Nine reconfigurable filters perform all interpolations for the most right column in three cycles. The fourth cycle is utilized to transfer nine samples through the bypass path. The remaining eight columns are horizontally rotated between registers feeding dedicated filters and the bypass path. In particular, the register content is moved by two columns in each clock cycle. Each register column is assigned to one of the three groups of dedicated filters or to the bypass path. As a consequence, the 9 × 9 blocks released from the vertical stage consists of samples interpolated for four fractionalpel MVs. Thus, SADs must be accumulated in parallel for 16 fractionalpel MVs in four clock cycles. One multiplexer at the interpolator output is used to restore locations of four 2 × 9 sample groups. Another multiplexer vertically transposes positions in the most right column if the result for the 3/4 interpolation is released.
The design of reconfigurable filters is well suited to FPGA devices since multiplexers are embedded in the same logic cell as the following adder/subtractor. The luma and chroma filter cores embed 12 and 10 adders/subtractors, respectively. The previous architecture required 22 adders/subtractors for the filter supporting both luma and chroma and 17 for luma. Therefore, the significant reduction of resources is achieved when the filter is limited to the luma processing.
Figures 10 and 11 depict architectures of dedicated filters used at the vertical stage for the halfpel and quarterpel interpolation, respectively. The halfpel filter embeds 10 adders/subtractors whereas the quarterpel filter consumes one more. Dedicated filters embed the rounding adder in the tree. The output multiplexer accomplishes the clipping (CLIP) of the final result to avoid overflow and underflow.
4 Scheduling
The fractionalpel ME needs 16 clock cycles to evaluate 64 MVs around one integerpel MV. Thus, the search can be performed around six MVs for a given 8 × 8 block. However, some cycles are required to interpolate MVs identified for the merge mode, in particular, four cycles are utilized to obtain the 8 × 8 interpolation for one MV. If the merge MV falls in the range of the regular fractional ME, no additional cycles are required. It is assumed that 48 cycles are allocated to regular fractional ME around three integerpel MVs (8 × 8, 16 × 16, and 32 × 32 PUs). The remaining 52 cycles are utilized to process 13 merge mode candidates determined for different CU divisions. The regular factionalpel ME for a given PU is skipped if its range matches that for a larger PU. Saved cycles are utilized to evaluate more merge mode MVs. Since the availability of most of merge MVs depends on the mode decision for preceding CUs/PUs, merge mode candidates are evaluated at the same stage as the reconstruction loop and the CU/PU mode decision.
Interpolation filters specified in H.265/HEVC refer up to eight luma samples located in row/column at neighboring pixel positions. Therefore, the 2D interpolation of one sample requires access to the 8 × 8 reference block. If four blocks are accessed, the output can be extended to the 9 × 9 block. Provided that 8 × 8 blocks appear at the interpolator input, four cycles are taken to load the input registers. The location of the blocks can be identified by specific MVs, as shown in Fig. 5. For convenience, the following description will refer to motion vector differences (MVDs) relative to the integerpel position around which the fractionalpel search is executed. If two horizontally adjacent 8 × 8 blocks are obtained for MVDs equal to (−4, 0) and (4, 0), the interpolator can compute MVDs equal to (1/4, 0), (1/2, 0), (3/4, 0), (−1/4, 0), (−1/2, 0), and (−3/4, 0). The same rule applies to the vertical processing. Four reference blocks required for the 2D interpolation have the following MVD: (−4, −4), (4, −4), (−4, 4), and (4, 4).
To perform the luma interpolation, four 8 × 8 reference blocks are taken from the input and written to the first ring buffer. The buffer consists of four register groups (FRB[0]–FRB [3]), each of which keeps four 16 sample rows. In each clock cycle, the rows are vertically rotated between register groups. Row indices are indicated in Fig. 14. Each reference block is simultaneously written to two register groups. Since each row is composed of samples taken from two reference blocks, two groups are halffilled with new samples in one cycle. Due to the rotation, the first/third block is written to FRB[0] and FRB [1], whereas the second/fourth block is written to FRB [1] and FRB [2]. If the 3/4 interpolation is performed, samples written to FRB [3] registers are horizontally transposed. The FRB [3] registers feed horizontal filters. The filtering result is obtained with the delay of two clock cycles. Horizontally interpolated samples corresponding to four rows are written to horizontal registers (HR) in each clock cycle. Every fourth clock cycle, 12 rows kept in HR and four rows available at filter outputs are forwarded to the second ring buffer (SRB). The buffer feeds 63 vertical filters and 18 bypass paths. The SRB is composed of nine columns. Eight of them are horizontally rotated by two positions in each clock cycle. Each two of six columns are assigned to a group of 18 dedicated filters supporting one particular type of the interpolation (either 1/2, 1/4, or 3/4). Two columns are assigned to 18 bypass paths. Similar to the horizontal stage, the filtering result is obtained with the delay of two clock cycles. The rotation in the second ring buffer allows the processing of eight columns with each filter type. On the other hand, multiplexers are required at outputs to restore appropriate locations of columns in the 9 × 9 block. One of nine columns is not rotated, and it feeds nine reconfigurable filters. The filters are reconfigured in each clock cycle to support one particular type of the interpolation. For the 3/4 interpolation, samples kept in SRB [4] are vertically transposed.
5 Search strategy
The proposed ME architecture can check an 8 × 8 prediction for one integerpel MV and four fractionalpel MVs in each clock cycle. In practice, the number of evaluated MVs is limited and depends on the clock frequency and the video resolution. If the motion estimation operates at the frequency of 400 MHz and processes 2160p@30fps videos, the number of integerpel MVs per each 8 × 8 block in the original image is about 100. This number should be allocated to all evaluated PUs corresponding to the block. Taking into account wider search ranges required for the 2160p@30fps resolution, numbers of MVs allocated to particular PUs can be too small to achieve a high compression efficiency. In the case of the fractionalpel ME, the available number of clock cycles can also limit the efficiency. Other limitations stem from the encoder dataflow, which introduces the delay between the ME and the final mode decision (based on the ratedistortion optimization). The delay causes some MV predictions to be unknown at the ME. Thus, costs of evaluated MVs cannot be estimated reliably. Moreover, the determination of predictions for merge modes must follow the mode decision for preceding blocks.

The search range is set to (−64, 63) × (−64, 63).

Test zone search is performed only for 8 × 8 PUs. It is interrupted when the number of checked MVs achieves the limit specified for a given resolution. The limit corresponds to the number of clock cycles assigned in the hardware architecture (e.g., 92 for 2160p@30fps). If some 8 × 8 PUs within the 32 × 32 unit do not utilize all allowable cycles, the remaining cycles are added to continue the interrupted search. This reallocation makes losses in the compression efficiency negligible.

The integerpel motion estimation for 16 × 16 PUs is performed by utilizing results from the 8 × 8 search. Four MV candidates are taken from MVs found for 8 × 8 blocks included in a given PU.

The integerpel motion estimation for 32 × 32 PUs is performed by utilizing results from the 16 × 16 search. MV candidates are determined according to the rule applied in the 16 × 16 search.

Rectangular PUs are evaluated within the range of the fractionalpel estimation corresponding 2N × 2N PUs. Although this simplification significantly reduces the ME complexity, it has a small impact on the average compression efficiency (0.3 %).

MV costs are estimated based on results of the 8 × 8 search if a neighbor belongs to the same CTU. In this case, MV differences are computed with the assumption that neighbors are 8 × 8 blocks. In the remaining cases, actual MV predictors are taken from adjacent CTUs.

Only merge mode candidates are evaluated for 64 × 64 PUs and their rectangular partitions. The exclusion of the 64 × 64 search decreases the compression efficiency by 0.8 % (−0.02 dB), on average.

At least three merge mode candidates are evaluated for each PU if the video resolution is 2160p@30fps. More candidates can be processed if any of the three following conditions are true: First, merge MVs fall in the range of the fractionalpel search for the same or a larger PU. Second, fractionalpel search for a given PU matches that for a larger PU. Third, the resolution is lower than 2160p@30fps. The conditions stem from the scheduling and allow a better utilization of available clock cycles. In particular, more merge modes are evaluated to avoid the redundant processing and/or nooperation cycles.

The final MV is not selected with sum of absolute transformed differences (SATD) used in the HM software. Instead, candidate MVs are selected based on SAD at the fractionalpel stage. Four candidates are selected for square PUs. The remaining (rectangular) PUs have one candidate MV. It is assumed that corresponding predictions are used in the mode selection based on the ratedistortion analysis. This approach decreases the compression efficiency by 0.3 % compared to the use of SATD.
The reuse of results of the 8 × 8 search saves a significant amount of computations. Particularly, eight integerpel MVs are evaluated for larger PUs including a given 8 × 8 block. Moreover, MVs for the larger PUs are reused for smaller ones.
Evaluation results for BDPSNR and BDRate
Class  Resolution  BDPSNR [dB]  BDrate (%) 

A  2560 × 1600  −0.033  1.45 
B  1920 × 1080  −0.031  1.56 
C  832 × 480  −0.103  2.92 
D  416 × 240  −0.148  4.15 
E  1280 × 720  −0.069  2.98 
4 K  3840 × 2160  −0.026  1.64 
Average  −0.068  2.45 
6 Implementation results
Resource consumption for FPGA and ASIC technologies
Module  Arria II GX (ALUT)  TSMC 90 nm (gate)  Memory (kB)  Power (mW) 

MV generator  3598 (7.71 %)  27,836 (6.59 %)  –  3.1 
Luma predictor  3412 (7.32 %)  28,982 (6.86 %)  48  164.7 
Luma interpolator  24,202 (51.89 %)  240,072 (56.80 %)  –  30.2 
Cost estimator  12,143 (26.03 %)  97,124 (22.98 %)  4  27.1 
Chroma predictor  546 (1.17 %)  3264 (0.77 %)  24  65.4 
Chroma interpolator  2742 (5.88 %)  25,386 (6.00 %)  –  2.5 
Total  46,643  422,664  76  293.0 
For the ASIC technology, the design can operate at the frequency of 400 MHz. This performance enables the encoder to allocate about 100 clock cycles per each 8 × 8 block if the resolution is 2160p@30fps. The estimated power consumption of the ASIC implementation is equal to 293 mW. The high power consumption is caused by memories keeping reference and original pixels. The FPGA implementation can operate at 200 MHz. As a consequence, the throughput is decreased by half.
The luma and chroma paths incorporate 64 dualport and 16 twoport memory modules, respectively. The modules store reference pixels. Each module in the luma path is 0.75 kB in size. In the case of the chroma path, the size is 1.5 kB. The joint capacity of 72 kB allows the search range of (–64, 63) × (−64, 63) for both luma and chroma. Wider ranges are possible at the cost of the increased memory size. The original luma samples are stored in a separate dualport memory with a capacity of 4 kB. This capacity is sufficient to keep samples for one CTU. Since the ME system is pipelined based on 32 × 32 units, the assignment of memory subspaces is swapped between four processing stages (the writing, the integerpel ME, the fractionalpel ME, and the merge mode evaluation).
Byun et al. [10] presented the H.265/HEVC integerpel full search architecture supporting all prediction unit sizes with the range of (−32, 31) × (−32, 31). The design consumes 3.56 M gates and 23 kB memories. The hardware cost of the motion estimation system described in this paper is much smaller (422.7 k gates and 76 kB memories). Moreover, the search range is wider [(−64, 63) × (−64, 63)]. The lowpower integerpel design was proposed by Sanchez et al. [11]. Its resource consumption is relatively low (50 k gates and 82 kbit memories). However, it supports only 16 × 16 blocks and a narrow search range, which does not exploit the compression potential of H.265/HEVC.
Comparison with other FPGA architectures
Design  Afonso [12]  Pastuszak [4]  This study 

Technology  Stratix III  Arria II GX  Arria II GX 
Clock (MHz)  403  200  200 
Resources (ALUT)  4077 + 16547  28,757  26,944 
Parallelism (sample/clock)  27 (1D)  64 (1D)  260 (2D) 
1000 × parallelism/resources  1.31  2.26  9.64 
Throughput  2160p@30fps  1080p@60fps  1080p@60fps 
Dynamic power (mW)  379  182  171 
Features  Luma  Luma and chroma  Luma and chroma 
Comparison with other ASIC architectures
Design  Diniz [13]  Guo [14]  He [15]  Pastuszak [4]  This study 

Technology (nm)  TSMC 150  SMIC 90  65  TSMC 90  TSMC 90 
Clock (MHz)  312  250  188  400  400 
Resources (gate)  30,209  32,496  1,183,000  277,074  265,458 
Parallelism (sample/clock)  12 (1D)  8 (1D)  16 × 12 (2D)  64 (1D)  260 (2D) 
1000 × parallelism/resources  0.3972  0.2462  0.1623  0.2310  0.9794 
Throughput  2160p@30fps  2160p@60fps  4320p@30fps  2160p@30fps  2160p@30fps 
Power (Mw)  –  –  198.6  11.4  30.2 
Features  Luma and chroma  Luma  Luma  Luma and chroma  Luma and chroma 
Most referenced designs support only the luma interpolation [12, 14, 15]. The FPGA implementation proposed by Afonso et al. [12] achieves a high frequency due to deep pipelining and the better device. The proposed architecture can also be modified to operate at higher frequencies by the insertion of registers. This modification would not increase the logic resources since at least one flipflop is embedded in each ALUT. However, the power consumption would be increased. Moreover, the gain in the frequency would not compensate the increased latency of the deeply pipelined processing path composed of the luma predictor, the interpolator, and the cost estimator. The latency of the path affects timing constraints corresponding to the final mode decision and the availability of corresponding MVs. Thus, it would be difficult to determine merge mode candidates and MV costs for the highest throughput.
Although the hardware cost of the interpolator is decreased compared to the previous one [4], the proposed ME system is more complex. Particularly, the compensator in the previous architecture consumes 42.5 k gates, whereas the inter luma/chroma predictor and the cost estimator in the new one require 129.4 k gates. There are two main reasons of the increase. First, separate processing paths for the integerpel and the fractionalpel are used. Second, four costs are simultaneously evaluated in the fractionalpel path. Since most logic resources are contributed by interpolators (265.5 k gates), the increased complexity in the remaining modules is relatively small in terms of the whole ME system. The throughput is increased by the factor of 1.85 (100/54) and 3.1 (100/32) for the integerpel and fractionalpel processing, respectively.
7 Conclusion
The ME architecture is developed for the H.265/HEVC encoder. The design embeds two parallel processing paths for the integerpel and the fractionalpel motion estimation. The paths share the same dualport memories. Internal buffers and the scheduling allow the writing of reference samples through the port assigned to the fractionalpel path. The architecture supports TZS for 8 × 8 prediction blocks. The motion estimation for larger blocks is performed by utilizing results of the 8 × 8 search. The search for rectangular PUs is performed only at the fractionalpel level and reuses partial costs computed for 2N × 2N PUs. The design achieves the best ratio of the throughput to hardware resources compared to other designs. The design can check about 100 integerpel MVs for each 8 × 8 input block when encoding 2160p@30fps video at the 400 MHz. Within future works, the proposed ME system will be integrated with the intra encoder [17] to support inter modes.
Notes
Acknowledgments
This research was supported in part by PLGrid Infrastructure.
References
 1.ITUT Recommendation H.265 and ISO/IEC 230082 MPEGH Part 2, High efficiency video coding (HEVC) (2013)Google Scholar
 2.HEVC software repository—HM16.0 reference model. https://hevc.hhi.fraunhofer.de/trac/hevc/browser/tags/HM16.0 (2015). Accessed 29 June 2015
 3.ITUT Rec. H.264 and ISO/IEC 1449610 MPEG4 Part 10, Advanced video coding (AVC) (2005)Google Scholar
 4.Pastuszak, G., Trochimiuk, M.: Architecture design of the highthroughput compensator and interpolator for the H.265/HEVC encoder. J. Real Time Image Process. Online first articles (2014)Google Scholar
 5.Pastuszak, G., Jakubowski, M.: Adaptive computationallyscalable motion estimation for the hardware H.264/AVC encoder. IEEE Trans. Circuits Syst. Video Technol. 23(5), 802–812 (2013)CrossRefGoogle Scholar
 6.Chen, T.C., Chien, S.Y., Huang, Y.W., Tsai, C.H., Chen, C.Y., Chen, T.W., Chen, L.G.: Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder. IEEE Trans. Circuits Syst. Video Technol. 16(6), 673–688 (2006)CrossRefGoogle Scholar
 7.Liu, Z., Song, Y., Shao, M., Li, S., Li, L., Ishiwata, S., Nakagawa, M., Goto, S., Ikenaga, T.: HDTV1080p H.264/AVC encoder chip design and performance analysis. IEEE J. SolidState Circuits 44(2), 594–608 (2009)CrossRefGoogle Scholar
 8.Yang, C., Goto, S., Ikenaga, T.: High performance VLSI architecture of fractional motion estimation in H.264 for HDTV. In: IEEE International Symposium on Circuits and Systems (ISCAS 2006) pp. 21–24 (2006)Google Scholar
 9.Oktem, S., Hamzaoglu, I.: An efficient hardware architecture for quarterpixel accurate H.264 motion estimation. In: 10th Euromicro Conference on Digital System Design, pp. 1142–1143 (2007)Google Scholar
 10.Byun, J., Jung, Y., Kim, J.: Design of integer motion estimator of HEVC for asymmetric motionpartitioning mode and 4KUHD. Electron. Lett. 49(18), 1142–1143 (2013)CrossRefGoogle Scholar
 11.Sanchez, G., Porto, M., Agostini, L.: A hardware friedly motion estimation algorithm for the emergent HEVC standard and its low power hardware design. In: IEEE International Conference on Image Processing, pp. 1991–1994 (2013)Google Scholar
 12.Afonso, V., Maich, H., Agostini, L., Franco, D.: Low cost and high throughput FME interpolation for the HEVC emerging video coding standard. In: IEEE Fourth Latin American Symposium on Circuits and Systems (LASCAS) (2013)Google Scholar
 13.Diniz, C. M., Shafique, M., Bampi, S., Henkel, J.: Highthroughput interpolation hardware architecture with coarsegrained reconfigurable datapaths for HEVC. In: IEEE International Conference on Image Processing, pp. 2091–2095 (2013)Google Scholar
 14.Guo, Z., Zhou, D., & Goto, S.: An optimized MC interpolation architecture for HEVC. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1117–1120 (2012)Google Scholar
 15.He, G., Zhou, D., Chen, Z., Zhang, T., Goto, S.: A 995Mpixels/s 0.2nJ/pixel fractional motion estimation architecture in HEVC for UltraHD. In: IEEE Asian SolidState Circuits Conference, pp. 301–304 (2013)Google Scholar
 16.Jakubowski, M., Pastuszak, G.: An adaptive computationaware algorithm for multiframe variable blocksize motion estimation in H.264/AVC. In: International Conference on Signal Processing and Multimedia Applications (SIGMAP ‘09), pp. 122–125 (2009)Google Scholar
 17.Pastuszak, G., Abramowski, A.: Algorithm and architecture design of the H.265/HEVC intra encoder. IEEE Transactions on Circuits and Systems for Video Technology, pp. 1 (2015). doi: 10.1109/TCSVT.2015.2428571
 18.Bossen, F.: Common test conditions and software configurations, JCTVCL1100. JCTVC, Geneva (2013)Google Scholar
 19.Ultra video group, test sequences: (online). http://ultravideo.cs.tut.fi/#testsequences (2015). Accessed 29 June 2015
 20.Xiph.org: test media, http://media.xiph.org/video/derf/ (2011). Accessed 29 June 2015
 21.Bjontegaard, G.: Calculation of average PSNR differences between RDCurves. In: ITUT VCEGM33, VCEG 13th MeetingGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.