1 Introduction

Applications of multimedia through digital broadcasting over various kinds of devices (like mobile, laptop, personal data assistants (PDAs), high definition television (HDT), standard definition television (SDTV), etc.) are increasingly important. And they need a better scalability in video coding due to the variable nature of bandwidth. A scalable extension of H.264/advanced video coding (AVC) is standardized to provide best suitable video coding in 2007 as H.264 scalable video coding[1]. A reference software is developed by motion picture experts group (MPEG) and video coding experts group (VCEG) jointly called as joint video team (JVT) for scalable video coding[2, 3].

The inherent nature of spatial, temporal and signal to noise ratio (SNR) or quality scalability with respect to H.264/AVC makes H.264/scalable video coding (SVC) standardized[3], and its performance in achieving high efficiency in coding is evaluated[4]. In spatial scalability, the picture with lowest spatial resolution is considered as base layer and is encoded as H.264/AVC compatible bit stream, whereas the picture with high resolution which is an unsampled residue between the original and reconstructed signal of base layer is considered enhancement layer. In temporal scalability, a hierarchical B picture approach is used for a particular spatial layer with zero structural delay. H.264/SVC constitute I, P and B pictures in which I/P picture will be the key picture and is encoded with normal intervals by only previous key picture as reference. The B picture encodes the pictures between the two key pictures. The size of group of pictures (GOP) size determines the number of temporal layers in a spatial layer, where a GOP is nothing but a key picture followed by all the temporally located pictures till the next key picture. The relation between the spatial and temporal scalability employs SNR or quality scalability which is based on different spatio-temporal reconstruction quality levels namely coarse grain scalability (CGS) and medium grain scalability (MGS). CGS is nothing but a single temporal layer per spatial layer and MGS is multiple temporal layers per spatial layer.

Although H.264/SVC with a unique bit stream adaptation to various bit rates, transmission channel bandwidth and display capabilities, achieves high scalability and high efficiency in coding, the computation complexity of the encoder is very high because of its inherent nature. Due to the hierarchical B picture approach in the temporal layer, it needs all the modes to be searched to be the best candidate mode prediction by full search algorithm implemented in joint scalable video model (JSVM). This is more time consuming and complex for the encoder. Focusing this issue, many research works were proposed to reduce the complexity in terms of fast mode decision (FMD) algorithm by reducing the redundant candidate mode in H.264/SVC. These works predict the redundant modes using rate distortion cost (RDC) function and the correlation among the hierarchical B picture structure. The computation complexity was efficiently decreased by these works with degraded video quality. But they were not suitable for sequences with large motions.

Nowadays, too many hand held devices with typical structural implementations, have increasing requirement for video quality as an important issue. It is enhancement layer where the quality has to be increased. But to conserve power for hand held devices is also an important issue to be considered particularly for real time video applications. Overall, the video quality and reduction in computation complexity need to be more important while implementing any algorithm.

In this paper, we focus on reduction of candidate mode by using probability model and mode correlation. The probability model creates a list of modes to be the best in base layer and mode correlation decides the best mode in enhancement layer. The rest of this paper is organized as follows. In Section 2, background and related works based on fast mode decision algorithms implemented in SVC, rate distortion cost procedure, probabilistic model and mode correlations were discussed. In Section 3, the proposed algorithm for complexity reduction is discussed. And the experimental results with comparative analysis are discussed in Section 4. Section 5 concludes this paper.

2 Background and related work

Three new modes such as motion vectors, residuals, and intra information were introduced in the inter-layer prediction from the base layer to select the best coding mode in the enhancement layer. Based on these inter-layer prediction modes, better improvement in coding efficiency is achieved along with scalability. But these inter-layer modes have to do rate distortion optimization (RDO) many times which involves very high computational complexity. Particularly, residual prediction mode must be performed twice of the RDO process which increases twice the computational complexity of the normal RDO process of H.264/AVC. This complexity implementation is reduced by an efficient architecture proposed in [5] by changing the processing order, here the prediction mode of reference macro block (MB) is used to predict the candidate modes. It follows two steps, if the prediction mode of reference MB is an INTRA or INTER mode. If it is INTRA mode, the INTRA 4 × 4 mode is checked with 8 × 8mode. Andif4 × 4modeis smaller, it will be selected. If 8 × 8 mode is smaller, INTER 8 × 16 mode in the base layer is checked with 16 × 8 mode, the smaller one will be selected. If it is INTER mode, the prediction is based on motion vectors (MV). If MV is greater than a threshold, the best mode is selected among three candidates of upper MB mode, left MB mode and SKIP mode, else the reference MB is checked with the SKIP mode. This approach achieves faster encoding but with degradation in performance. The best mode is selected by checking both the reference MBs[6]. If both are encoded as SKIP modes, SKIP mode is selected, else the mode saved in the previous MB is selected. If any of the reference MB is SKIP mode, then 16 × 16 mode is checked with SKIP mode, the smaller will be selected. This approach achieves better performance than that in [5] and improves rate distortion, but the encoding speed is not much faster. In [7], a very high complexity reduction rate is achieved by decreasing some candidate modes in H.264/SVC. However, the previous works are based on sacrificing the video quality of the enhancement layer by reducing the complexity rate. In our previous work[8], an adaptive rate control scheme is proposed which reduces bit rate and maintains PSNR. Here, initial quantization parameter is estimated based on Cauchy probability density function (PDF). This function dynamically adjusts the selection of mode which is slightly better than full search algorithm. Low complexity algorithm for addressing the computation complexity in [9] reduces the candidate modes by mode correlations between the base layer and the enhancement layers. For an MB to be coded either by INTRA or INTER type, two algorithms were proposed which decrease the redundant candidate modes, but the quality of the enhancement layer degrades. In [10], INTER modes are divided into a set of SKIP mode and a set of non-SKIP mode. All non-SKIP modes will have almost the same RDC except SKIP mode. Here, the SKIP mode is checked initially with the reference MB. If the reference MB to be encoded is not SKIP mode, INTER 16 × 16 and 8 × 8 is checked to derive various fast mode decision techniques. The non-SKIP mode is again divided into three groups to select the desired mode based on the threshold and model parameter. This probability model implemented for H.264/AVC achieves fast mode selection and is not implemented in H.264/SVC. An efficient mode reduction method is proposed in [11] to decrease the encoding complexity, but it is not suitable for the sequences with large motions.

2.1 Rate distortion cost

RDO used in JSVM is used to estimate the RDC for all modes using the Lagrangian parameter (λ). RDC is a cost function of rate and distortion with λ. This λ acts as a weighting parameter which adjusts the bit cost as

$$\matrix{{J(s,c,MODE|QP,{\lambda _{{\rm{MODE}}}}) = SSD(s,c,mode|QP) + } \hfill \cr {{\lambda _{{\rm{MODE}}}}R(s,c,MODE|QP)} \hfill \cr }$$
(1)

where λMODE is the weighting parameter or Lagrangian parameter, s is the original MB, c is the reconstructed MB, SSD is the sum of squared differences between s and c, QP is the quantization parameter, and R is the number of bit cost for motion vector, header information and the discrete cosine transform (DCT) coefficients.

2.2 Probability model

A desired mode list based on the probability of the mode to be the best possible mode with the co-located macro block has been evaluated in [10]. There is a correlation between an MB to be coded and its neighboring MB. With this correlation, a desired mode list is created with the highest probability to be the best mode. Therefore, a reference MB set P is described as

$$P = \left\{ {\matrix{{MB(j,x - 16,y - 16),MB(j,x - 16,y)} \hfill \cr {MB(j,x,y - 16)} \hfill \cr {MB(j,x + 16,y - 16),MB((j - 1),x,y)} \hfill \cr } } \right\}$$
(2)

where MB(j, x, y) denotes the MB located at the j-th frame with upper left pixel x, y, and MB((j−1)′,x,y) denotes the previous co-located MB located at (j−1)′-th frame (the same as the current coding MB) with upper left pixel x,y. The neighboring MB mode set Q is given by

$$Q = \{ {M_{MB}}|MB \in P\}$$
(3)

where M MB denotes the encoding mode of MB. The approximated probability[9] of the mode to the best mode is given by the probability model as

$$\matrix{{P(m = M|M \in Q) \approx } \hfill \cr {P(m \in P){{\sum\limits_{MB \in P,{M_{MB}} = m} N (m = {M_{MB}})} \over {\sum\limits_{(m \in Q)} {\sum\limits_{MB \in P,{M_{MB = m}}} N } (m = {M_{MB}})}} = } \hfill \cr {K \times \sum\limits_{MB \in P,{M_{MB}} = M} N (m = {M_{MB}})} \hfill \cr }$$
(4)

where N(·) is the occurrence time of an event and K is constant. Here, M MB is not an element of Q, which has less probability to be the best mode and is considered to be zero.

2.3 Mode correlation

There is a high degree of correlation for a selected candidate mode between a reference frame and the best mode of current frame[11], which is evaluated as follows a 4 layer model,

  1. 1)

    When both reference 0 frame and reference 1 frame are SKIP modes, best mode is a SKIP and 16 × 16.

  2. 2)

    When one of the reference frames is a SKIP mode, best mode is SKIP and 16 × 16 or other modes.

  3. 3)

    When none of reference frames is in SKIP mode, the best mode is surrounding modes of reference MB, also with SKIP and 16 × 16.

  4. 4)

    Reference MB’s along with SKIP and 16 × 16 is the best mode for other layers.

Also, the correlation among the redundant candidate modes with reference frames increases for higher layers.

3 Proposed algorithm

The objective of this work is to reduce the encoding time while maintaining almost the same PSNR level and bit rate, by selecting the desired candidate mode from all modes. We start up with two strategies in our algorithm for base and enhancement layers.

If the co-located MB is inter coded in the base layer, we make use of mode correlation among the temporal layers on selection of a desired candidate mode, otherwise all the intra modes are checked for the best mode.

If the co-located MB is inter coded in the enhancement layer, we make use of the correlations between modes. As moving on to the higher layers, the motion vector difference (MVD) becomes lesser. So we decide upon checking upper, left, right and bottom modes of the current MB. If this selection is not best suited, (4) will decide upon selecting the best non-SKIP mode. This selection of non-SKIP mode works on three levels, the first selection is based on checking if it is 16 × 16 or SKIP mode. If it is 16 × 16, we go on checking the lower order 16 × 8, 8 × 16, and all sub blocks of 8 × 8 as shown in the algorithm.

The RDC of the best current MB mode selected is compared with the minimum RDC among all the modes as given in (5), then the current mode will be taken as the best mode, and all the other modes are omitted.

$${J_{{\rm{curr}}}} < {J_{\min }}$$
(5)

where J curr denotes RDC of the current MB mode and J min denotes the minimum RDC of all the modes obtained so far.

The threshold level (JTH) for each enhancement layer is obtained as a product of minimum RDC among all modes with a model parameter Γ given in (6). The values for the model parameter Γ shown below for each layer is obtained after a series of simulation as

(6)

This threshold (J TH) will be the minimum RDC(J min) for each enhancement layer.

Algorithm 1.

figure Graphic1

4 Experimental results

The H.264/SVC reference software JSVM 9.19.15 is used for evaluating the results[12]. Low complexity macro block mode is set to evaluate the results for JSVM in the reference software. We use time, PSNR and bit rate to evaluate the performance of the original and test algorithm. Nine video sequences of common intermediate format (CIF) such as Bus, Foreman, Football, Mobile, City, Crew, Ice, Harbour and Soccer are chosen. The layers of each sequence were set with the quantization parameter values 28, 34 and 40. The GOP size is set as 30 frames per second in our algorithm.

$$\Delta time = {{tim{e_{{\rm{proposed}}}} - tim{e_{{\rm{JSVM}}}}} \over {tim{e_{{\rm{JSVM}}}}}} \times 100\% $$
(7)
$$\Delta PSNR = PSN{R_{{\rm{proposed}}}} - PSN{R_{{\rm{JSVM}}}}$$
(8)
$$\matrix{{\Delta bitrate = } \hfill \cr {{{bitrat{e_{{\rm{JSVM}}}} - bitrat{e_{{\rm{proposed}}}}} \over {bitrat{e_{{\rm{JSVM}}}}}} \times 100\%.} \hfill \cr }$$
(9)

The simulation results are shown in Table 1. From the results, the average encoding time of the proposed method is 41.89% faster than JSVM with a loss of 0.02 dB in PSNR and 0.05 % in terms of bit rate.

Table 1 Comparative measure between proposed algorithm and JSVM

Fig. 1 shows the average encoding time of each sequence encoded with JSVM and the proposed algorithm. From the observations, the difference in encoding time between our proposed algorithm and JSVM in the foreman sequence is the largest. It is less than half of the encoding time of JSVM with our algorithm. This is due to the correlation among modes between the current frame MB and the reference frame MB. At the same time, the minimum encoding time difference between the proposed algorithm and JSVM is the Harbour sequence. Here, the encoding time of our algorithm is not much faster than JSVM. This is because the sequence involves more spatial and temporal details. For large changes in the sequence, we apply the probability model. It takes a few more time to compute RDC of all the modes in the non-SKIP mode, but results in good PSNR quality with the same bit rate.

Fig. 1
figure 1

Comparison of the encoding time between JSVM and the proposed algorithm

So, the probability mode does an exhaustive search for modes to bring better quality of the signal, but the encoding speed is somewhat slower as compared to the correlation among modes between the current and the reference frames.

Fig. 2 shows the differences in PSNR between the proposed algorithm and JSVM. And from the observations, the sequences such as Bus, Foreman, Mobile have less PSNR levels as compared to JSVM. All the other sequences have similar PSNR levels. Fig. 3 shows the difference in terms of bit rate between the proposed algorithm and JSVM. Reduction in bit rate is achieved between Foreman, Soccer and Bus sequences through our algorithm. Also, an overall bit rate reduction of 0.5% for these three sequences are achieved.

Fig. 2
figure 2

Comparison of the average PSNR between JSVM and the proposed algorithm

Fig. 3
figure 3

Comparison of average bit rate between JSVM and the proposed Algorithm

Fig. 4 shows the comparison in terms of average measures between our proposed algorithm and JSVM. From the observations in three dimensions, the complexity of computation in deciding the best mode by searching all the modes is reduced in our algorithm, although our algorithm achieves higher coding efficiency by attaining the best reduction in encoding time with minimal PSNR loss.

Fig. 4
figure 4

Comparison of average measure with JSVM and the proposed algorithm

The maximum average encoding time is achieved for Foreman, Bus, Soccer, Mobile and Foreman sequence with minimal loss in PSNR. And there was a little increase in bit rate when compared to the other chosen sequences. This is because the sequences with large motion have large change in spatial and temporal layers, in which the MB modes chosen here are based on mode correlations between the base layer and enhancement layer.

The rate distortion curves for all the 9 sequences are shown in Figs. 413. Compared to JSVM, it is found that these sequences will not have much difference in bit rate as well as PSNR. This proves that our algorithm can be used to encode the sequences with lesser encoding time and almost the same PSNR and bit rate better and faster.

Fig. 5
figure 5

RD curve for Bus sequence

Fig. 6
figure 6

RD curve for Foreman sequence

Fig. 7
figure 7

RD curve for Football sequence

Fig. 8
figure 8

RD curve for Mobile sequence

Fig. 9
figure 9

RD curve for City sequence

Fig. 10
figure 10

RD curve for Crew sequence

Fig. 11
figure 11

RD curve for Ice sequence

Fig. 12
figure 12

RD curve for Harbour sequence

Fig. 13
figure 13

RD curve for Soccer sequence

The minimum average encoding time is achieved for Harbour, City, Crew and Ice sequences with improvement in PSNR as well as reduction in bit rate. It is due to small change in motion within the chosen sequences. These sequences will have small changes in the spatial and temporal level. As for our algorithm, less MVD will lead to probability model for mode selection.

In our algorithm, maximum encoding time for fast motion sequences in spatial and temporal level results in minimal quality degradation and increase in bit rate. Minimum encoding time for slow motion sequences in spatial and temporal level results in maximum quality degradation and reduction in bit rate.

5 Conclusions and future work

A novel algorithm is proposed to decide the modes faster than JSVM by using the correlation of modes and a probability model. A desired mode list is created based on the probability model for base layer. Enhancement layer mode selection is decided by the correlation of modes among reference frame MB and current from MB. Our algorithm is implemented in JSVM 9.19.15 reference software and the performance is evaluated based on time, PSNR and bit rate. The results show 41.89% improvement in encoding time with minimal loss of 0.02 dB in PSNR and 0.05% increase in bit rate. The tradeoff of maximum encoding time with minimal loss in PSNR and increase in bit rate for fast motion sequences in our algorithm may be considered as future work.