1 Introduction

The prime goal of the latest HEVC [1] standard is to provide the same perceptual quality compared to its antecedent H.264 [2], at approximately 50 % bit-rate reduction for efficient transmission and storage of high volume video data [3]. HEVC introduces inventive approaches, including the coding unit (CU) size extension from \(16\,\times \,16\) up to \(64\,\times \,64\)-pixels, variable size transform unit (TU), and symmetric/asymmetric block partitioning phenomenon that have achieved on improved performance gain, but at a cost of more than 4 times algorithmic complexity for a particular implementation [4, 5]. For this reason, any electronic devices with limited processing capacity could not fully exploit HEVC encoding and decoding features. This motivated us to reduce the computational time of HEVC encoder by appropriate selection of inter-prediction modes. For this to happen, only the AOI in a video is taken into account that comprise with phase correlation based motion features and the saliency feature after applying it on the difference between successive image blocks. Unlike HEVC test model (HM12.1) [6], the selected CUs with, and without, AOI are motion estimated and motion compensated with modes in the higher and lower depth levels, respectively. Thus, exhaustive exploration of all modes in each coding depth level can be avoided (level \(64\times 64\), \(32\times 32\), \(16\times 16\) and \(8\times 8\) are denoted as depth level 0, 1, 2, 3 respectively) by proposed method. This results in computational time reduction. To select a particular motion prediction mode, HM exhaustively checks the Lagrangian cost function (LCF) [7]using all modes in each coding depth level. The LCF, \(\varTheta ({{k}_{n}})\) for mode selection (kn is the nth mode) is defined by:

$$\begin{aligned} \varTheta ({{k}_{n}})=D({{k}_{n}})+\lambda \times R({{k}_{n}}) \end{aligned}$$
(1)

where \(\lambda \) is the Lagrangian multiplier, \( D({{k}_{n}})\) is the distortion, and \( R({{k}_{n}})\) is the resultant bit, which are determined by modes for each CU. In order to select the best partitioning mode in a coding depth level, HM checks at least 8, and at most 24, inter-prediction modes with lowest LCF. This process is time consuming due to the exploration of all modes in one or more coding depth levels. Moreover, only the LCF-based mode decision could not always provide the best rate distortion(RD) performance in different operational coding points due to advanced parameter settings in HEVC. Therefore, instead of merely depending on LCF, some consecutive pre-processing stages are executed by the proposed technique that makes mode decision process more appropriate and less time consuming (shown in Fig. 1).

In literature, some researchers have contributed to reduce time complexity [8, 9]. In order to terminate the exploring modes in lower level, Hou et al. [10] recommend a RD cost based threshold to explore modes only in the higher level that results in 30 % time savings with 0.5 % quality loss. Vanne et al. [11] propose an efficient inter-mode decision scheme by finding the candidate PU modes of symmetric and asymmetric motion partition. The tested results reveal the reduction of HEVC encoder complexity by 31 %–51 % at the cost of 0.2 %–1.3 % bit-rate increment. Pan et al. [12] introduce an early MERGE mode decision algorithm to reduce computational complexity of HEVC encoder. They achieve 35 % time savings with the bit-rate increment of 0.32 %, and quality loss of 0.11 dB peak signal to noise ratio (PSNR).

The energy concentration ratio (ECR) from phase correlation was extracted and employed for mode selection in HEVC by Podder et al. [13](used in [14] for H.264 standard). They save computational time by sacrificing 0.24 dB PSNR on average compared to the full search approach of HM. As ECR stipulates only the residual error between current block and motion-compensated reference block, it would not provide expected compression results. For more accurate decision on ME and MC modes, in this paper, we extract three motion features from phase correlation. Other than motion, visual attentive areas are also extracted by exploiting the saliency feature. Compared to the current image, as the difference between two successive images provides the actual displacement and salient information, we apply the saliency feature on image difference to capture motion and saliency dominated precise areas perceived by human visual system. Moreover, as there is no difference of color or contrast in static areas, the image difference produced motion and salient information would be the premier option to determine visual observant areas. The features are incorporated by developing a weighted cost function to actuate AOI-based binary pattern for the current block to select a subset of inter-modes. From the selected subset, the final mode is determined based on their lowest LCF. The proposed method not only reduces the computational time by appropriate selection of AOI based ME and MC modes but also demonstrates similar subjective and objective image quality.

The remainder of this paper is structured as follows: Sect. 2 describes all the key steps of the proposed method; experimental results and discussions are detailed in Sect. 3, while Sect. 4 is the conclusion of the paper.

2 Proposed Mode Selection Technique

The phase-correlation renders relative displacement information between current block and the reference block by Fast Fourier Transformation (FFT). We first extract phase-correlation related three motion features including (i) ECR (\(\alpha \)), (ii) predicted motion vector (dxdy) and (iii) phase correlation peak (\(\beta \)) and combine them with saliency feature (\(\gamma \)) that is applied on the difference between successive image blocks. We evolve a unified AOI based cost function using the normalized motion features and weighted average of the saliency values to determine a unified AOI feature for the current block. The binary AOI pattern from the unified AOI features is then configured by applying threshold and to select a sub-set of inter-prediction modes. The final mode from the selected subset is determined by their lowest LCF. The whole process is shown as a block diagram in Fig. 1.

Fig. 1.
figure 1

Block diagram of the proposed mode selection process.

2.1 Motion Features Extraction

The phase correlation is calculated by applying the FFT and then inverse FFT (IFFT) of the current and reference blocks and finally applying the FFTSHIFT function as follows:

$$\begin{aligned} \varOmega =fftshift\left| ifft\left( {{e}^{j(\angle {{F}_{r}}-\angle {{F}_{c}})}} \right) \right| \ \end{aligned}$$
(2)

where Fc and Fr are the Fast Fourier transformed blocks of the current C and reference R blocks respectively and \(\angle \) is the phase of the corresponding transformed block. Note that \(\varOmega \) is a two dimensional matrix. We evaluate the phase correlation peak (\(\beta \)) from the position of \((dx + blocksize/2 + 1, dy + blocksize/2 +~1)\) as follows:

$$\begin{aligned} \beta =\varOmega \left( dx+blocksize/2+1,dy+blocksize/2+1 \right) \ \end{aligned}$$
(3)

where the blocksize is 8 if \(8 \times 8\)-pixel block is used for phase correlation. Then we compute the predicted motion vector (dxdy) by subtracting blocksize-1 from the (x, y) position of \(\varOmega \) where we find the maximum value of \(\varOmega \). We use the phase of the current block and magnitude of the motion-compensated block in the reference frame and finally calculate the matched reference block \((\varPsi )\) for the current block by:

$$\begin{aligned} \varPsi =\left| ifft\left( \left| {{F}_{r}} \right| {{e}^{j(\angle {{F}_{c}})}} \right) \right| \ \end{aligned}$$
(4)

Now the displacement error (§) is enumerated by: §=C-\(\varPsi \). We then apply the discrete cosine transform (DCT) to error §and calculate the ECR (i.e., \(\alpha \)) as the ratio of low frequency component and the total energy of the error block by:

$$\begin{aligned} \alpha =({{D}_{error\_low}}/{{D}_{error\_total}}) \end{aligned}$$
(5)

where \({{D}_{error\_low}}\) and \({{D}_{error\_total}}\) represent the top-left triangle energy and the whole area energy of a particular block.

2.2 Saliency Feature Extraction

The saliency map is operated as a tool (based on [15, 16]) and incorporated into our coding architecture. The exploited graph-based visual saliency (GBVS) technique gives us the variance map of AOI based human visual features for an \(8\times 8\)-pixel block consisting of the values ranging from 0 to 1. We extract the saliency feature, \(\gamma \), by averaging all values of a given \(8\times 8\) block. Our focus is to encode the AOI based salient portions with more bits to achieve better compression and improve coding performance.

Fig. 2.
figure 2

Illustration of motion features and saliency feature generated at different CUs of 12th frame on Tennis video; (b-d) are the phase shifted plots for no motion (0.4), simple motion (0.7) and complex motion (0.8); (e-f) corresponds to the respective values generated by ECR and saliency map for CUs at positions (3, 1), (3, 10) and (5, 7) respectively. For clear visualization we use \(32\times 32\) block size.

2.3 Cost Function Based AOI Categorization

After evaluating the phase correlation extracted motion features (i.e., \(\alpha \), \(\beta \) and (dxdy) and saliency extracted variance map (i.e., \(\gamma \)), we finally determine a cost function \(\eta (i,j)\) from these four features for (ij)th block by-

$$\begin{aligned} \eta (i,j)=\omega _1\alpha (i,j)+\omega _2(1-\beta )+\omega _3((\frac{|dx|}{\delta })+(\frac{|dy|}{\delta }))+\omega _4(\gamma ) \end{aligned}$$
(6)

where \(\delta \) is the maximum block size, and \(\omega 1\) to \(\omega 4\) are the weights with\(\sum _{i=1}^{4} \omega _i=1\). We innovatively derive weights for each feature and only consider 0.50, 0.25, 0.125, and 0.125 weights based on the relative texture deviation of the current block against that of the whole frame. First we sort four features based on their values and if the value of the Standard Deviation(STD) of the block is smaller than the value of the current frame then the highest weight (i.e., 0.50) is applied to the feature 1 (i.e., sorted) and the lowest weight (i.e., 0.125) is applied to the feature 4 (according to sorted list); otherwise, inverse weighted order is applied. If the resultant value of the cost function (i.e., \(\eta \)) is greater than a predefined threshold the block is tagged by ‘1’ otherwise tagged by ‘0’ where binary ‘1’ and ‘0’ corresponds to AOI and non-AOI respectively. The rationality of the proposed weight selection strategy is that if the current block has higher texture variation compared to the current frame, the current block should be encoded with more bits compared to the rest of the blocks to achieve similar/improved RD performance. The relationship of the quantitative motion and salience features with the human visual features are depicted in Fig. 2. Figure 2 (b-d) shows the categories of motion-peak (\(\beta \)) and their corresponding values provided by ECR (Fig. 2 (e)) and saliency feature (Fig. 2 (f)) for Tennis video. Figure 3 illustrates the impact of saliency map and from Fig. 3 (a), we clearly depict the ellipse like red marked area as the table tennis court edge (obviously a visually attentive area) that is precisely identified by the saliency feature (red marked ellipse like area in Fig. 3 (c)) although three motion features of phase correlation could not grasp the court edge as it does not have any motion (white marked ellipse like area in Fig. 3 (b)). Thus, the proposed combined strategy improves the RD performance by recognizing not only the AOI-based motion features of phase correlation but also the AOI-based visually attentive areas inside the videos.

Fig. 3.
figure 3

Identification of motion and salient areas with and without saliency feature based cost function (Color figure online).

2.4 Intermode Selection

For the generation of binary matrix, we exploit each of the \(8\times 8\) pixel blocks from the \(32\times 32\) pixel blocks (i.e., CU) and produce a matrix of \(4\times 4\) binary values for each CU (applying threshold). The cost function generated \(4\times 4\) binary matrix is then compared with a codebook of predefined binary pattern templates (BPT) to select a subset of modes. Each of the templates is constructed with a pattern of AOI and non-AOI block focusing on the rectangular and regular object shapes at \(32\times 32\) block level as shown in Fig. 4. Both in Fig. 4 (for \(32\times 32\) level) and Table 1(b) (for \(16\times 16\) level), the cells with black squares present the AOI (i.e., binary 1) and the rest are non-AOI (i.e., binary 0). We use a simple similarity metric using the Hamming Distance (DH) between the binary matrix of a CU generated by phase correlation and the BPTs in Fig. 4. We select the best-matched BPT that provides minimum sum of the absolute values of their differences for a CU.

Fig. 4.
figure 4

Codebook of the proposed binary pattern templates for subset of inter-mode selection where template cells with black squares present AOI (i.e., binary 1) and the rest are non-AOI (i.e., binary 0).

The DH, \(D_h\) is determined as follows where S is the binary motion prediction matrix of a CU comprising \(4 \times 4\) ‘1’ or ‘0’ combinations and Tk is the k-th BPT:

$$\begin{aligned} {{D}_{h}}(x,y)=\sum \limits _{x=0}^{4}{\sum \limits _{y=0}^{4}{\left| S(x,y)-{{T}_{k}}(x,y) \right| }} \end{aligned}$$
(7)

The best-matched j-th BPT is selected from all BPTs as follows:

$$\begin{aligned} B_j= arg_{{T_k}\forall BPT} \min ({D_h}) \end{aligned}$$
(8)

The mode selection process from the BPTs at \(32 \times 32\) and \(16 \times 16\) coding depth levels is illustrated in Table 1 (a) and (b) respectively. Once a particular template selects a subset of candidate modes at \(32 \times 32\) level, the final mode is decided by their lowest Lagrangian cost function. Again, at \(32 \times 32\) level, if any of the \(16 \times 16\) coding depth level modes is selected, we further explore smaller modes at \(16\times 16\) level using the AOI pattern (i.e., presence or absence of binary 1 and 0 as shown in Table 1 (b)).

Table 1. Selection technique of inter-modes at \(32 \times 32\) and \(16 \times 16\) coding depth levels.

Then the equation for the final mode (\(\xi \)) selection is given by:

$$\begin{aligned} \xi =arg_{\forall {k_n}}\min (\varTheta ({{k}_{n}})) \end{aligned}$$
(9)

where \(\varTheta ({{k}_{n}})\) is the Lagrangian cost function for mode selection.

2.5 Threshold Determination

Due to the imbalanced distribution of ECR values Podder et al. [13] (also mentioned different thresholds in [17, 18]) use a range of thresholds from 0.37 to 0.52 for different bit-rates. Those thresholds could not perform well in the proposed method as the distribution of cost function values is more compact and almost in all cases it exceeds the value of 0.15 in all types of sequences for the blocks with motion and dominant salience. Therefore, in the proposed technique, we use the fixed value of threshold (i.e., 0.15) for a wide range of bit-rates.

3 Experimental Results and Analysis

To verify the performance of the proposed algorithm, experimental results are presented with six standard definition (SD) videos- Tennis, Tempete, Waterfall, Silent, Paris, Bridgeclose, four high definition (HD) videos- Pedestrian, Bluesky, Rushhour, Parkrun and two multiview (MV) videos- Exit and Ballroom. Each of the video sequences are encoded with 25 frame rate and search length \(\pm 64\). We compare the proposed method results with HM of HEVC standard as HM outperforms the existing mode selection techniques in the literature. We use IPPP format with Group of picture (GOP) 32 for both techniques and two reference frames.

Table 2. Average time savings (%) by the proposed method (against HM) in terms of mode selection for each type of sequences-a theoretical analysis.

3.1 Experimental Setup

In this paper, the experiments are conducted by a dedicated desktop machine (with Intel core i7 3770 CPU @ 3.4 GHz, 16 GB RAM and 1TB HDD) running 64 bit Windows operating system. The proposed scheme and HEVC with exhaustive mode selection scheme are developed based on the reference software HM (version 12.1) [6]. RD performance of both schemes are compared considering the maximum CU size of \(32\times 32\) by enabling both symmetric and asymmetric partitioning block size of \(32\times 32\) to \(8\times 8\) depth levels for a wide range of bit-rates (using QP=20, 24, 28, 32 and 36). The calculations of BD-PSNR and BD-Bit Rate were performed according to the procedures described in [19].

3.2 Results and Discussions

For the theoretical justification of computational time of all type of sequences, we first compare the average number of modes selected in each CU by HM12.1 and the proposed method. The results in Table 2 shows that HM checks more options in all cases and normally requires more computational time. From Table 2, the overall average percentage of encoding time savings by the proposed method is 49.75 and the reason of this acquisition is the efficient subset of intermode selection with simple criteria. However, we cannot ignore the pre-processing stages of the proposed method, and by calculation we notice that over twelve sequences on average 6.71 % extra encoding time is required to execute phase correlation and saliency related pre-processing overheads (see Fig. 1). Thus, theoretically we anticipate to acquire 43.04 % computational time savings on average. The experimental evaluation reveals that over twelve different sequences, and for a wide range of bit-rates, the proposed method reduces on average 42 % (range: 37 %–45 %) computational time as shown in Fig. 5 (a). The equation for the time savings (TS) is defined as:

$$\begin{aligned} TS=\frac{(T_o-T_p)}{T_o}\times 100\,\% \end{aligned}$$
(10)

where \(T_0\) and \(T_p\) denote the total encoding time consumed by HM and the proposed method respectively. For comprehensive performance test, we execute the computational time analysis of both techniques based on video categories and find that the proposed method achieves on average 41 % encoding time savings compared to HM12.1 as shown in Fig. 5 (b). The figure also reveals that the proposed technique saves more computational time for SD video type (48 %).

Fig. 5.
figure 5

Illustration of time savings by the proposed method against HM.

Fig. 6.
figure 6

Comparative study on RD performance by HM12.1 and the proposed method for a wide range of bit-rates.

To test the performance of the proposed method objectively, we first compare the RD performance against HM using three different sequence types (one SD, HD and MV) for a wide range of bit-rates as demonstrated in Fig. 6. The figure shows that the proposed method achieves similar RD performance with HM12.1 especially caring about the AOI based CUs and partitioning them by efficient selection of appropriate block partitioning modes. Table 3 represents the performance comparison of the proposed method for twelve divergent video sequences. The results reveal that compared to the mode selection approach in HM, the proposed technique achieves an almost similar RD performance (small average reduction of 0.021 dB PSNR) with a negligible bit-rate increment of 0.14 %.

Table 3. Performance comparison of proposed technique compared to HM using BD-Bit Rate and BD-PSNR.

Figure 7(a) shows the original image of Tennis video taken for subjective quality test and Fig. 7(b) and Fig. 7(c) illustrate the reproduced images by HM and the proposed method respectively. To present the comparison in image quality let us concentrate on the cuff and sleeve sections of the shirt in the three images which are marked by the red, yellow, and white ellipses respectively. It can be perceived that the three ellipse marked sections have almost similar image quality. It was presented earlier in this manuscript that the proposed method requires less encoding time. Hence it can be concluded that the proposed technique shows significant computational time savings compared to HM12.1 with similar image quality for a wide range of bit-rates. Due to this phenomenon, the proposed implementation is expected to become more suitable for all real time video coding applications especially for a number of electronic devices with limited processing power and battery capacity.

Fig. 7.
figure 7

Subjective quality assessment for HM12.1 and the proposed method for Tennisvideo sequence. The figures are achieved from the 20th frame of the Tennis video at the same bit-rate (Color figure online).

4 Conclusion

In this work, a novel coding framework for HM performance improvement is presented. This is implemented by exploiting the AOI based mode selection technique comprising three different motion features of phase correlation and saliency features of human visual attention. The motion features focus on three different aspects of motions in each CU and the saliency feature captures the visual attentive areas. An adaptive cost function is formulated to determine a subset of inter-modes using predefined AOI based binary pattern templates. The Lagrangian optimization criterion is employed in the selected subset of modes to fix the final mode. Compared to HM with exhaustive mode selection strategy, the proposed scheme reduces on average 42 % computational time (range: 37 %–45 %) while providing similar image quality for a wide range of bit-rates.