1 Introduction

The advancement in the manufacturing of high-performance electronic devices and their display technologies, such as smart mobile phones, tablets, and televisions devices, resulted in increased demands of ultrahigh-resolution video content delivery with low processing delay. Furthermore, nowadays, most of commercially available displays support spatial resolution up to 4 K (7668 × 4320) display resolution [1]. Such high-resolution display capabilities can consume most of the available bandwidth in conventional networks. Hence, for delivering high-quality video effectively, it is necessary to use efficient video coding tools to support high-resolution video applications. The most recent video coding standard is the high-efficiency video coding (H.265|HEVC) standard [2].

The main target of developing H.265|HEVC standard was to double the coding efficiency of MPEG-4 part 10, advanced video coding (H.264|AVC) standard. This means keeping the same video quality at half encoding bit rate [3]. Furthermore, we extend the H.264|AVC applications, which are already supported by H.264|AVC coding standard to include more efficient video coding tools for high-resolution video and parallel processing applications [4].

However, the increase in video coding efficiency comes at the expense of an increase in the amount of inter-prediction and motion compensation activity in the encoding process, which mainly involves increasing the coding efficiency by increasing the number of temporal and spatial redundant information [5]. Important to realise on the negative side, this high coding efficiency comes at the cost of high computation complexity requirements as well.

In other words, a highly compressed bitstream means more encoded redundant video information. Consequently, compressed video content becomes more sensitive to channel bit errors than encoded bitstream with previous standards.

Therefore, transmitting a highly compressed video bitstream in unreliable transmission channel leads to degrading the perceived visual quality at the video decoder or if errors hit the sensitive encoded data such as slice header data leads to failing in the decoding process for the whole video sequence [6]. Figure 1 demonstrates a video transmission issue on the received video quality when using an unreliable wireless channel.

Fig. 1
figure 1

Video transmission issue scenario

The H.265|HEVC codec system is a hybrid video coding type, which means the compression techniques depends on removing the temporal information first and then spatial information.

Thus, the injected errors on the transmitted video bitstream are propagated spatially and temporally on the perceived video quality. Consequently, injecting single-bit error on the encoded bitstream can lead to severe visual quality degradation.

In the development of H.265/HEVC, the primary target of both standardisation organisations (i.e. ITU-T and ISO/IEC) was to increase the bit rate saving to more than 50% compared to the previous H.264|AVC video coding standard [7, 8]. The high bit rate saving was achieved by adding new coding features to support more efficient coding and make the video codec more friendly to parallel processing applications [7]. However, developing error resilience tools for video coding standard is out of the video standard scope [6].

There are two main error control categories to reduce the effects of transmission errors on perceived visual quality. The first one employs traditional data error control methods which use lossless channel coding tools in data recovery such as Automatic Retransmission reQuest (ARQ) schemes. However, implementing such error recovery tools in compressed video delivery is less efficient because the nature of the compressed bitstream is of a variable-length code, which makes error recovery process of corrupted video contents a very challenging task.

The second category is video error control techniques implemented within the video coding system. In this case, to minimise the effects of the transmission errors efficiently at the video decoder side, the video error control can be divided into three categories: forward error recovery, error concealment, and interactive error recovery.

In the forward error recovery approach, the video encoder takes the full responsibility to insert redundant error resilience codes and makes the coded bitstream more robust against errors.

The second error control approach is called the error concealment approach, in which the decoder is responsible to spatially and temporally conceal the errors. The spatial error concealment employs correctly received information using interpolation techniques on the surrounded macroblocks. In case the whole macroblock information is lost, the simplest and most common concealment technique consists of replacing the lost macroblock by the spatially corresponding macroblock in the previously decoded frame. The other error concealment approach at the decoder is called temporal error concealment techniques, which extrapolate the correctly received motion vectors of the current and previous frames [9].

The third video error control approach uses the joint encoder–decoder error resilience techniques. In this approach, the video encoder and decoder work in an interactive way to reduce the effects of the channel errors on the perceived visual quality. In this category, a backward feedback channel is used from the decoder to the encoder sides to keep the encoder updated.

All lossy error resilience techniques take advantage of the nature of the human visual system to tolerate the distorted visual quality. The design of video error resilience tools depends on the employed video coding tools.

The developed error resilience tools should keep a balance between error robustness and video encoding efficiency, whilst aiming to maintain the video quality of service under error-prone conditions.

Developing error resilience tools at the encoder is one of the most efficient solutions to mitigate the effects of erroneous errors in real-time video transmission.

This paper presents an adaptive error resilience algorithm to increase perceived visual quality using H.265|HEVC coding system under error-prone conditions.

The proposed work is based on forward error recovery method (without using a feedback channel) and interactive error recovery (using feedback update from the decoder to the encoder sides).

The intra-encoding mode is used at slice level instead of intra-picture to support low delay delivery applications and keep a balance between coding efficiency and error resiliency performance. Our proposed work considers during the design stage the bit rate overhead, time delay of computational complexity, and video start-up time delay. The evaluation work is further extended to investigate the effect of LTE network traffic load on perceived visual quality at the decoder.

This paper is organised as follows. An overview of relevant literature that supports the proposed error resilience algorithm is presented in Sect. 2. Section 3 describes the proposed adaptive slice algorithm. The evaluation process and encoding configuration for testing the proposed algorithm are reported in Sect. 4. Section 5 discusses the obtained objective and frame by frame quality assessment results, coupled with computational complexity, and processing time delay analysis. Finally, Sect. 6 presents the paper’s conclusions and future work recommendations.

2 Background

2.1 Error resilience using region of interest extraction

This section presents a literature review of related work of the state-of-the-art video error resilience algorithms for low delay video delivery applications.

One of the main video encoding requirements at low delay or conversational video applications is to reduce the number of reference frames to a minimum level. These low delay video processing requirements can be achieved by reducing the number of previously used future reference frames at the motion process.

The first conducted work to select a group of MBs to be encoded with intra-refresh in H.264|AVC video coding was by Hoaming Chen et al. [10]. Their proposed error resilience coding scheme is based on adaptive intra-refresh. The error resilience coding scheme selects the important regions depending on the received network conditions, i.e. different packet loss rates (PLRs) and video motion information. To keep coding efficiency at acceptable level, the refresh cycle sizes range at (4, 8, 16). A selected area depends on the PLR value in the feedback channel. When receiving update of low PLR values, i.e. (10−4), the cycle size will be selected with smaller values, i.e. (4). In contrast to high error-prone conditions such as PLR with (10−1), the refreshing cycle sizes will be increased to obtain a balance between error resilience and H.264|AVC coding efficiency performance.

Two research studies have been conducted in the recent past to improve error resilience of H.265|HEVC coding standard which considers the moving areas as a region of interest. In 2015, the authors considered the moving area of the encoded video as a region of interest (ROI) and these extracted areas should be protected against transmission errors. Their proposed error resilience algorithm is based on generating activity map at frame level. The moving regions are segmented into blocks and, based on the maximum depth level of CTUs, they calculate and protect the moving region parts at the encoder [5]. In 2016, they improved their region of interest algorithm by utilising a rate control in a region of interest extraction process [6]. The obtained objective quality results presented by the authors show that a significant improvement of 0.88 dB was achieved compared with H.265|HEVC reference selection method for a PLR of 5% [6]. The moving region extraction methodologies of these studies are based on a work proposed by Hai-Miao Hu et al. [5]. This work (ROI-based rate control scheme) aimed to improve the coding efficiency of H.264/AVC coding standard by allocating more encoding bit budget to moving regions and improve the perceived quality for ROI area at the cost of non-ROI quality.

However, using the moving region as a region of interest does not provide precise selectivity of the selected moving areas because of the flexibility of block partitioning in the H.265|HEVC encoding process.

Under these circumstances, an adaptive slice encoding (ASE) algorithm is developed and proposed based on an understanding of the previous related work.

One of the most challenging tasks in using region of interest extraction process in H.265|HEVC coding system is how to keep computation complexity to a minimum level. Another challenge is how to implement accurate ROI in low delay transmission constraint such as conversational video applications.

Rate control tool in video coding is responsible for calculating he best trade-off between image quality and the required bit rate. Most video coding systems, in general, are lossy systems, so it is important to keep a bit rate saving at the highest level with maintaining the highest visual quality level-targeted video applications.

There is a great deal of research interest in proposing efficient algorithms for moving region extraction [11]. The moving regions extraction process should take into consideration a computation analysis. Such computation analysis is based on the human visual system analysis. For example, it includes frame texture details, skin colour, and object motion speed. These requirements make rate control adaptation more challenging in real-time processing applications.

The moving extraction process starts after the motion estimation stage, in which quantisation parameters are adjusted accordingly. There is a dilemma between region segmentation and motion estimation priorities. Quantisation parameters (QP) need to be adjusted before the rate-distortion optimisation (RDO) process start. On the other hand, motion information is generated after RDO and QP are generated before the RDO process. However, in the moving region extraction process, the QP needs to be adjusted based on motion information as an ROI. To solve the moving region extraction and QP adjustments dilemma, researchers in [12] proposed the motion differencing method. In this method, the motion vector of each current macroblock is compared with the macroblock of the previous frame. However, this method does not give acceptable results when dealing with fast-moving objects. In [13], the researchers proposed a method to distinguish the importance of each macroblock using the mean absolute difference (MAD) method between the current and previous macroblock (MB) [14]. Also, the researchers in [11] and [15] use the same region segmentation methodology which gives very accurate results when the background of the moving objects is stable. However, a slight movement in the object background area (e.g. temporal changing in lighting conditions or camera zooming) can adversely affect extracting the region of interest areas [16].

2.2 H.265|HEVC encoding with feedback channel

In general, low delay video application such as conference video application requires special attention in reference picture management. In unreliable networks, the encoder with feedback capabilities usually receives acknowledgement signal from the decoder with a delay channel (in milliseconds). Basically, in video codec systems, there are two types of acknowledgement signals; at slice level, these signals are transmitted from the decoder to the encoder [17]. The first type is called a positive acknowledgement (ACK) signal which is responsible for sending acknowledgement signal to the encoder, indicating the correctly received slice. The reference frame is chosen depending on the ACK signal update received from the decoder. If the encoder did not receive an ACK signal at a predefined interval, the encoder assumes that an error occurred and an intra-coding has to be applied.

The second feedback signal type sends a negative acknowledgement (NACK) by the decoder to notify the encoder that an error or loss has occurred in the currently decoded bitstream. In addition to the acknowledgement of the correctly received slice, the addresses of the corrupted parts are signalled back to the encoder. On the other hand, the reference picture buffer is updated each time accordingly on receiving the acknowledgement signal. At the same time, the decoder can apply an error concealment technique to reduce temporal–spatial error propagation in case error occurred.

2.3 H.265|HEVC video coding standard

The H.265|HEVC video coding standard is the result of continuous hard work aimed at enhancing the coding efficiency of the previous video codec standards [18].

Several changes have been added to the H.265|HEVC coding tools such as video coding control signals and bitstream structure [4]. A new coding unit design concept of flexible coding unit sizes has been added to the standard. A basic unit in H.265|HEVC standard is called a coding tree unit (CTU). The encoding size of CTU ranges (8 × 8 − 64 × 64) block sizes, which help to increase the coding efficiency with video resolution higher than high definition (HD) resolution [7]. All of the mentioned previous points make H.265|HEVC coding standard an attractive candidate to meet the main targets of wireless and multimedia applications for high visual quality requirements.

2.3.1 Slice structure in H.265|HEVC system

As discussed earlier, each frame can be represented by one or more slices and each slice contains a group dependent and independent slice segments. As can be seen in Fig. 2, a first slice segment is encoded independently from previously encoded slices. The remaining slice segments are encoded depending on the first slice segment with the same encoding mode of coding units.

Fig. 2
figure 2

Slice segment process sequence

In lossy packet networks, the packetized slices should not exceed the maximum transmission unit (MTU) size. On the other hand, the slicing structure helps decrease the transmitted packet length, and this is achieved by decreasing the number of encoded CTUs in each slice; hence, the length of transmitted packets becomes shorter, as a result, reducing error propagation at the decoder [19].

One main usage of the slicing segmentation concept in H.265|HEVC coding system is that it can reduce the effect of errors in corrupted video samples. Only the affected slice will be discarded or recovered from another correctly received slice segments. Moreover, the slicing segment concept helps to expedite the resynchronisation process using the correctly received independent slice segment header.

However, increasing the number of encoded slices adversely affects the coding efficiency. For example, at slice boundaries, the intra- and motion predictions are not allowed, which reduces the spatial frame prediction. Furthermore, it is a slice structure increased overhead.

There are three slice types used in H.265|HEVC coding system: intra mode slice, uni-prediction slice (P-slice), and bi-prediction slice (B-slice). Each slice header contains a complete reference picture list update. A reference picture set concept for H.265|HEVC coding standard will be further explained in detail later.

A slice header contains shared information between slice segments. This shared information differs from slice to slice at frame level. A reference selection list is updated in each slice header and signals explicitly. The slice header information of each frame is stored in picture parameter set (PPS). The PPS data is stored in a sequence parameter set (SPS). Finally, a video parameter set (VPS) contains shared information of the PPS and SPS. Further demonstration of the interconnection of the three parameters set is shown in Fig. 3.

Fig. 3
figure 3

Activation of parameter sets in H.265|HEVC slice headers

3 Proposed work

The aim of the proposed algorithm is to reduce error propagation at slice level. An adaptive encoding algorithm is introduced at the video encoder to encode and protect the most active slices. A coded video sequence is represented as a series of access units (AUs) in sequential order with shared sequence parameter. Each access unit is represented by a group of NAL units. A prefix code of access unit delimiter is used to identify the start of new AU in NAL unit bitstream. A primary encoded AU contains a group of VCL NAL units, which includes one or multiple slices. These slices contain video sample data. A redundant coded picture is encoded as additional VCL NAL units. These additional VCL samples are used for error recovery when the primary video samples are lost or corrupted. In this case, the decoder will parse the contents of the correctly received data to recover the corrupted video samples. In error-free conditions, the decoder will discard the additional redundant video data. At the end of each video coded sequence, a non-VAL NAL unit is encoded to indicate the end of the NAL units bitstream.

The concept of the proposed algorithm is illustrated in (Fig. 4). For instance, suppose the ship in the video sequence is the most important area that requires protection against errors. This algorithm will extract the active slice, i.e. the ship, and the active slices are then encoded in intra mode.

Fig. 4
figure 4

Encoding the slice of the most important areas in the ASE algorithm

As discussed earlier in the slice structure section, the independent slice header segment address identifies the exact location of a slice segment at picture level by using count number (ctb) identification in fixed scan order. The objective of the ASE algorithm is to reduce the temporal error propagation by encoding the most sensitive and important coding units (CUs) using intra-coding mode at slice level. In the following subsections, further details are provided as to how the activation map is generated to indicate the active slices that should then be encoded with intra mode. A rate control mechanism for the subdivided frame regions is presented as well.

3.1 Important area protection

The proposed algorithm is described as follows, At the first stage, a slice level differencing method is implemented on the current and previous frames. The active area consists of a change in content with new texture information. A moving slice is a highly important slice that must be protected to reduce error propagation. Each slice is mapped with its greyscale representation. The current and previous encoded slices are mapped into a row and column projection curves with (\({\text{CV}}_{n}^{x}\)) and (\({\text{CV}}_{n}^{y}\)), respectively. Each slice in the current frame is projected into a one-dimensional vector.

The greyscale representation for the slice row Ln (x) and slice column Ln (y) can be calculated in Eqs. (1) and (2), respectively:

$$L_{n} \left( x \right) = \sum\nolimits_{x} {L\left( {x,y} \right)} ,$$
(1)
$$L_{n} \left( y \right)\; = \;\sum\nolimits_{y} {L\left( {x,y} \right)} ,$$
(2)

where Ln are the greyscale values for frame number (P). The average values for Ln (x) and Ln (y) are calculated based on the number of the calculated grey samples rows (r) and columns (c), respectively, as defined in Eqs. (3) and (4):

$$L_{\text{avn}} \left( x \right) = \frac{{\mathop \sum \nolimits_{x} L_{n} \left( x \right)}}{r},$$
(3)
$$L_{\text{avn}} \left( y \right) = \frac{{\mathop \sum \nolimits_{y} L_{n} \left( y \right)}}{c}.$$
(4)

Then, the averaged 1-D projected curves are normalised using Eqs. (5) and (6):

$${\text{CV}}_{n}^{x} = L_{n} \left( x \right) - L_{\text{avn}} \left( x \right),$$
(5)
$${\text{CV}}_{n}^{y} = L_{n} \left( y \right) - L_{\text{avn}} \left( y \right),$$
(6)

where (\({\text{CV}}_{n}^{x}\)) and (\({\text{CV}}_{n}^{y}\)) represent the 1-D projected curves for the slice number (n).

For better coding efficiency performance, the extraction process uses intra-refresh map which is a calculated greyscale projection method (GPM) [20]. For this reason, the extracted moving objects with unstable object background give more accurate selectivity results. The GPM extraction method is used in image stabilities applications because of its implementation simplicity and moving object selection accuracy [20].

A cross-correlation for the current and previous slice is then calculated [20]. The difference vector \({\text{DV}}_{n} \left( p \right)\) or each slice is then calculated as in Eq. (7):

$${\text{DV}}_{n} \left( p \right) = \frac{1}{256}\sum\nolimits_{{\left( {i,j} \right) \in p}}^{\text{TS}} {\left| {L_{n} \left( {i,j} \right) - L_{n - 1} \left( {i + {\text{CV}}_{n}^{x} , j + {\text{CV}}_{n}^{y} } \right) } \right|} ,$$
(7)

where \(L_{n} \left( {i,\;j} \right)\) nd \(L_{n - 1} \left( {i,\;j} \right)\) are the luma samples representation for the current (n) and previous slice (n − 1) and (TS) is the total number of encoded slices per current frame.

The searching area to find the maximum cross-correlation of the normalised projection curves between the slices of the current and the previous frames can be calculated as in Eq. (8):

$${\text{Searching area}} = \frac{{{\text{number of}}\; ({\text{CU}}_{\text{level1}}^{p} )\; + \;{\text{number of}}\;({\text{CU}}_{\text{level1}}^{p - 1} )}}{2},$$
(8)

where \(({\text{CU}}_{\text{level1}}^{p} )\) and \(({\text{CU}}_{\text{level1}}^{p - 1} )\) are the encoded units of (32 × 32) block at coding level 1 for the current and previous frames, respectively. Equation (8) is optimised between the encoding processing delay (an additional computational cost which resulted from the motion estimation calculations) and error resilience performance.

The difference vector \({\text{DV}}_{n} \left( p \right)\) calculations are shown in Fig. 5.

Fig. 5
figure 5

Difference vector calculation

3.2 Subdivision of non-active area

In general, people pay more attention to moving objects in the foreground due to the nature of the human visual system. Additionally, people focus more on the middle area in visualisation. To get the best trade-off between the coding efficiency and perceived quality, a non-active area is further subdivided into a high textured area which contains a high stationary spatial detail, and a passive (or flat) area which includes a fixed background area with lowest texture details. A simple subdivision example is shown in Fig. 6. The decoded video quality is reduced in a gradual way from high important areas to textured and passive areas, respectively.

Fig. 6
figure 6

Division areas in the ASE algorithm

An adaptive modified grey projection \(({\text{AMGP}}_{\text{w}} )\) based on the grey projection method in [20] is implemented to achieve a more accurate selection of motion-active slices.

To find out the relation between the encoded slice location within the frame and adaptive modified grey projection (AMGP) value, different weighting factors ranging 0.1–0.9 were objectively evaluated to obtain the best rate control optimisation with ASE algorithm implementation. The selected weighting factors were achieved during the trial and error optimisation experimental work on modified HM16.06 +ASE encoder. Due to limited space, one selected test result is presented in Fig. 7. The figure presents the obtained results for Akiyo video test sequence encoded with the frame rate (25 fps).

Fig. 7
figure 7

Weighting factors three for three ASE areas

The idea of allocating different weighting factors at this stage is to optimise the proposed algorithm with the video coding efficiency. In the proposed algorithm, the frame content complexity is divided into three main areas. The decision of allocating weighting factor depends on the encoded frame area’s sizes, which is proportional to the frame dimensions.

As the natural human visual system focuses more on the area of the central frame, this means the probability of the active areas to be encoded in the frame area (0.9). Thus, the condition of encoding with a higher weighting factor will be in the central area. A less probability of moving area will be the area between the central and the corner areas (0.6). The corner area will be allocated with the lowest weighting factor (0.1). The predefined weighting factors are selected based on the trial and errors experiments to be optimised with the intra-coding refresh.

A weighting factor is assigned for each frame region according to Eq. (9):

$${\text{AMGP}}\left( {\text{W}} \right) = \left\{ {\begin{array}{*{20}c} {0.9, \quad {\text{if the block location}}\; \le \;{\text{bounding box of centre frame area}}} \\& {0.2, \quad {\text{if the block location}}\; \ge {\text{corners frame area}}} \\& {0.6,\quad {\text{otherwise}}} \\ \end{array} } \right..$$
(9)

The active area extraction process mainly depends on the calculated difference vector \({\text{DV}}_{n} \left( p \right)\) and the weighting factor of the current frame. Finally, the most active slice areas in the current frame (p) are encoded.

The encoding unit in the active map is encoded with intra mode as defined in Eq. (10):

$${\text{AMGP}}_{n} \left( p \right) = \left\{ {\begin{array}{*{20}c} {1,\quad {\text{if AMGP}}_{\text{w}} \times {\text{DV}}_{n} \left( p \right)/{\text{average }}\left[ {{\text{DV}}_{n} \left( p \right)} \right]\; > \;{\text{AMGP}}_{\text{th}} } \\ &{ 0 , \quad {\text{otherwise}}} \\ \end{array} ,} \right.$$
(10)

where \({\text{DV}}_{n} \left( p \right)\) is the calculated difference vector and \({\text{AMGP}}_{\text{w}}\) is the calculated weighting factor for the currently encoded frame.

3.3 Non-active area selection

As discussed earlier, a non-active area for each frame is further divided into two regions according to the content features. A further subdivision region contributes to ensuring a perceived visual quality at frame transition level from high-quality regions (active areas) to lower quality (passive or high flat region areas). Furthermore, it helps to allocate larger weighting factors for active areas and lower bit budget to non-active regions, which results in the assignment of more bits to important frame areas. The extraction process of high textured areas from non-active areas is done using mean absolute difference (MAD) calculations between the current and previous frames. In this work, a (0.35) value is selected as a threshold point for generating high textured map as defined in the following equation, Eq. (11).

$$H_{n} \left( p \right) = \left\{ {\begin{array}{*{20}c} {1,\quad {\text{if}}\; H_{n - 1} \left( p \right)\;{ < }\;{\text{threshold}}} \\& { 0 , \quad {\text{elsewhere }}} \\ \end{array} } \right.,$$
(11)

where \(H_{n - 1} \left( p \right)\) refers to macroblock in the previous frame. Then, the remaining map areas are extracted and encoded as the lowest complex areas (refer to passive areas).

At each slice header, a full set of reference picture list is extracted at the decoded picture buffer (DPB). To identify whether the current slice is suitable to be used in the prediction process or not, a reference picture set (RPS) data at the slice header is compared with the referenced pictures at DPB.

For error detection and recovery purposes, a feedback channel from the decoder is used to notify the encoder about the occurred errors. The H.265|HEVC codec use a flag named (used–by–curring–pic–X–flag). The encoder parses the slice header and checks the flag activation [21]. At the decoder side, a slice header RPS is checked against the available reference pictures at the DPB. If there is an update from RPS at the slice header, but there is no update available at DPB, it will consider this slice as not used in the current prediction process. However, if the flag is activated, then the encoded current slice is intended to be used in the prediction process, but there is loss or corruption in the reference pictures at the decoder.

A flowchart illustrating the proposed ASE algorithm with feedback channel implementations is depicted in Fig. 8a.

Fig. 8
figure 8

Adaptive slice encoding flowchart. a ASE algorithm without feedback. b ASE algorithm with feedback

3.3.1 Error resilience algorithm with feedback update

The proposed error resilience algorithm is further extended to work with receiving feedback update the acknowledgement (ACK) signals to enhance perceived visual quality at the decoder. The H.265|HEVC coding system requires a feedback channel to locate a damaged slice. To get a more accurate error localisation, the segment header information of the corrupted slices is sent back via a feedback channel to the encoder. This header information contains the most recent update about the reference picture list which includes the address of the most recent erroneous slice. A flowchart of the proposed algorithm with feedback channel implementation is shown in Fig. 8b.

3.4 Rate control adaptation of the proposed algorithm

The challenging tasks in the region of interest extraction implementation are how to keep the computation complexity of the extraction process to the minimum level and how to implement an accurate ROI process in low delay transmission.

In this algorithm, the encoder optimises the best trade-off between the number of intra-coded slices per frame and the coding efficiency target. A frame is divided into a passive or flat areas and high texture or complex areas. In HM16 reference software, a lambda rate control is used to optimise the encoding bit rate (bit allocation budget) and video quality (target quantisation parameters) [22]. The encoding bit rate is adjusted based on a target bit rate and picture buffer size for each group of pictures (GOP). Then, the encoder allocates the required encoding bit budget at the LCU level. Depending on the calculated target bit rate, the number of bits per pixel (bpp) is measured depending on the rate distortion Eq. (12):

$$\lambda \; = \;\alpha .{\text{bpp}}^{\beta } ,$$
(12)

where bpp is bit per pixel and \(\alpha\) and \(\beta\) are predefined parameters values. Once \(\lambda\) is calculated, a QP (delta quant) value can be obtained using Eq. (13) and quantisation step size using Eq. (14).

$${\text{QP}}\; = \;4.2\;\ln \lambda \; + \;13.7,$$
(13)
$${\text{Q}}_{ - } {\text{step}}\; = \;1\; + \;2^{(1/6)} .$$
(14)

3.4.1 Network testbed setup

The NS3 is chosen to be installed on Linux operating system. The long-term evolution (LTE) module is embedded within the NS3 environment. The LTE module simulates the core LTE network. To integrate the NS3 simulator embedded with LTE network module with a real physical Ethernet interface, a hardware in loop (HIL) platform in [24] is employed. Each node in NS3/LTE network is connected using carrier-sense multiple access (CSMA) scheme. LTE serving gateway (SGW)/packet data network gateway (PGW) uses a point to point internet connection. The NS3/LTE network simulator is configured with network configuration parameters reported in Table 1.

Table 1 LTE network parameters

3.4.2 Hardware and software testbed setup

Three PCs are used in the experimental work. Two PCs are implemented as PC video server (Dell T410 Power Edge server.

CPU: Quad-core 2.35 GHz, RAM: 16 GB, operating system: Microsoft Windows 10 and video receiver (Dell XPS, CPU: Intel Core i5-7200 @2.5 GHz, RAM: 8 GB, operating system: Microsoft Windows 10). Open-source network simulator version 3 (NS3) is installed on separated PC (HP Compac 8200, CPU: Core i5-2500 s, RAM: 8 GB, operating system: Ubuntu server 15.04). The open-source cross-platform multimedia player (VLC) is used to stream the video test sequences at the sender side, and at the receiver used to visualise the perceived visual quality.

3.4.3 Error-prone environment setup

For evaluation the performance of the proposed algorithm under error-prone conditions, various packet loss rates are injected into encoded video bitstreams. A modified version of NAL unit loss software produced in [25] is used to generate different packet rates to be injected on the encoded bitstream. It is utilised to support the NAL unit structure of H.265|HEVC coding standard.

4 Experimental setup

In this section, the hardware and software experimental setup and encoding video configurations are presented. The main objectives are to test the performance of the proposed algorithm with different error-prone conditions, encoding bit rates, in addition to the computational complexity of the modified video reference HM16 software.

Pre-selected video test sequences are chosen in the experiments. The input video test sequences are in raw format (YUV) with colour space format 4:2:0. They are classified into two groups; the classification is done according to their texture video information and motion activity speed. Class A: represents video sequences with low texture details and slow-motion activity. Class B: represents video sequences with high texture details and high-motion activity. The video sequences characteristics are presented in Table 2.

Table 2 Characteristics of the test video sequence [23]

5 Experimental results and discussion

For each video test sequence, 30 test times with different seeds are tested and the averaged Y_PSNR values are recorded. The process involves comparing the ASE algorithm performance with default reference software (HM16.06), region-based error-resilient scheme (ROI) algorithm in [26], and improved region of interest (IROI) algorithm in [27]. All video sequences were randomly injected with packet loss rate PLRs ranging 2–18% using packet loss rate generator software and encoding setting as in Table 3. The experimental work results are shown in Fig. 9a.

Table 3 Encoding parameters of modified HM 16.06
Fig. 9
figure 9

Video quality evaluation of the proposed ASE algorithm. a Performance of the proposed algorithm with different PLSs. b Rate distortion performance with BER \(\left( {10^{ - 5} } \right)\)

The evaluation includes both error-free and error-prone condition. In error-prone condition, video test sequences are injected with random bit error generated with BER = (1 × 10−5). All the tested video sequences are in CIF resolution.

Further evaluation tests are performed to measure the effectiveness of the proposed ASE algorithm under erroneous packet loss conditions with different encoding bit rates. The objective quality evaluation for ASE algorithm with different encoding bit rates is shown in Fig. 9b.

Table 4 reports the obtained Y-PSNR gain with and without feedback update for the three video sequences (coastguard, hall, and mobile sequences) in CIF resolution. Notably, the average (Y-PSNR) of ASE algorithm with different PLRs has improved by (4.521 db), (2.283 db), and (1.076 db) compared to HM16 for H.265|HEVC codec, ROI, and IROI algorithms, respectively.

Table 4 ASE performance comparison in terms of Y-PSNR (db)

In error-free conditions (PLR = 0%), Y-PSNR of the ASE algorithm is reduced by (− 1.096 db; HM16), (− 0.605 db; ROI), and (− 0.318 db; IROI).

The obtained test results in Table 4 indicate that the performance of the proposed algorithm is less effective in error-free conditions. The complex processing part in the proposed algorithm is in the frame partitioning depending on complexity of the frame content. These three transitions areas contribute to obtaining the best balance between coding deficiency and error resilience performance.

5.1 Frame by frame video quality assessment

The proposed ASE algorithm is further evaluated using subjective quality assessment. Pre-selected frames are extracted from the raw video test sequence for video test assessment. The video sequence named Coastguard sequence with CIF resolution was encoded at 30 fps. Packet errors at a mean rate of 2% are injected into the video test sequence. The obtained results are shown in Fig. 10, using frame by frame visual quality assessment. It can be seen from the decoded frames that ASE algorithm produces a better perceptual visual quality compared to the reference error resilience algorithms.

Fig. 10
figure 10

Frame by frame visual quality assessment

5.2 Network congestion and time delay

The proposed algorithm is further evaluated by streaming the encoded video in LTE network in terms of network load. The number of the end users is ranging from 10 to 30 users per base station. The experimental work is conducted depending on the settings described in the network testbed setup section. A frame copy concealment is used at the decoder to avoid failure in the decoding process. The two main objectives are targeted in this experimental work. The first one is to show the effect of a different number of LTE network clients over a shared bandwidth, on objective decoded video quality.

The second one is the start-up time, as it is a critical factor for meeting the user’s quality of experience requirements [28].

The start-up time is defined as the required time that the decoder buffer needs to process before displaying the decoded pictures.

The authors in Ref. [28] recommended that the start-up video delay in video streaming applications should not be more than 2 s. In this experimental work, we chose 1000 ms and 500 ms as realistic use cases.

Based on the network configuration parameters and the network testbed which was described earlier, the encoding configuration settings are reported in Table 3. Eighteen video sequences were selected and the average for each tested number of clients group (10, 20, 30 end users) is recorded. A network load is categorised into three levels: light (10 users), medium (20 users), and heavy load (30 users).

Each test is repeated ten times to get more reliable verifications.

Our algorithm is integrated with the video evaluation platform.

The average Y-PSNR results for the pre-selected test sequences are recorded for network load behaviour.

Figures 11 and 12 show the effect of increasing the number of users on the perceptual visual quality in terms of the Y-PSNR at the decoder.

Fig. 11
figure 11

Average Y-PSNR for video streaming with start-up delay (500 ms)

Fig. 12
figure 12

Average Y-PSNR for video streaming with start-up delay (1000 ms)

It is obvious when we increase the number of clients, the proposed algorithm outperforms the default reference software.

When the encoded video sequences is streamed with a high network load, the dropped packets are increased significantly due to network congestion. Therefore, the objective video quality is deteriorated further at a higher network load with a shared bandwidth network.

It is noted from Figs. 11 and 12 results that the start-up delay takes longer, the number of decoded redundant intra slices increased, in return the probability of recovered the damaged slices increased due to using the encoded redundant slices to resynchronise the corrupted areas. Hence, in real-time video streaming, the decoded video quality is affected by two factors in the proposed algorithm: the encoding bit rate and the GOP structure used.

In general, for a GOP with longer encoded length, the coding efficiency will be higher due to less encoding bit rate required. However, a high GOP size means more dependent frames are encoded between I-frame interval; this longer interval lowering the encoding error resilience. This is in addition to the increase in the picture decoding delay. The acceptable total delay could be between 500 ms and 1000 ms before the picture is ready for display.

5.3 Computational complexity

The aim of this work is to determine the impact of encoding/decoding computational complexity from a processing time perspective. It is worth noting that the reference software (HM) is mainly used for developing H.265|HEVC video coding algorithms. Moreover, it does not practically support real-time video encoding applications. Although the HM suffers from slow speed of execution of encoding and decoding, some attempts have been made during the regular reference development process, i.e. HM versions.

In the experimental work, we measured the encoding/decoding processing time for the proposed algorithm, and then compared the obtained results with the standard default reference software HM16.

The video test sequences are encoded using the same encoding setting reported in Table 3. For video test sequences, the experiment was implemented with the same video test sequences reported in Table 2.

As stated in the JCT-VC common test conditions [29], there are mainly three encoding configuration settings: low delay-B slices, all intra (AI), random access (RA). In low delay-B configuration, the first frame is encoded with intra frame type and the following frames are encoded as redundant frames with bi-prediction B-frames which give higher coding efficiency and coding delay than redundant frames with uni-prediction P-frames [21, 30].

For (AI) configuration, intra mode is used to encode the whole video sequence. This encoding type gives low encoding time, but requires very high encoding rates.

For (RA) encoding configuration, the encoded video frames are organised in hierarchical B structure. This mode gives higher compression efficiency than other encoding modes. However, it is not suitable for low delay applications, because it requires more processing for reorganising the decoding pictures order at the far end decoder.

In this paper, we evaluated the proposed algorithm with low delay-B configuration mode and encoded with QP at (32). The execution time of encoding/decoding process for each video test sequence without feedback channel is reported in Table 5.

Table 5 Encoding and decoding time of the ASE algorithm compared to the HM16 reference software

The table presents the encoding and decoding run times as an indication of algorithmic complexity compared with the default reference (HM 16).

To examine the processing time of the proposed error resilience algorithm, the average processing time of enc/decoding 18 video test sequences is obtained. Figure 13 shows the required increase in the enc/decoding processing time percentage compared with HM 16 software without the error resilience tool implementation.

Fig. 13
figure 13

Average increase in encoding/decoding processing time compared to HM16

It is noted from the obtained results that the encoder consumes more time than that at the decoder (Fig. 13). The additional computation time in the modified HM16 encoder has arisen from rate control adaptation for encoding different areas with different encoding bit rates, and also allocating different quantisation parameters at different largest coding units levels. In addition to the picture segmentation processing into different subregion areas, this process includes differencing method as additional computations between the current and previous frames.

For additional computations at the decoder, the high amount of time spent on parsing the redundant slices leads to an increase in reference sample generation at the decoded picture buffer. Furthermore, an additional part of the decoding process is spent on the scanning process at the slice boundaries level, in addition to the reference sample generation for the intra slices prediction. The increasing time in both encoding and decoding process arises from adding a set of C++ classes for error resilience implementation.

6 Conclusions and recommendation

This paper presents an efficient H.265|HEVC error resilience algorithm to support low delay video delivery applications. The novelty of this algorithm lies in automatically selecting the most active frame regions and protecting them against transmission errors at the cost of an increase in the encoding bit rate overhead and encoding/decoding computational complexity. The proposed work also took into consideration the coding efficiency by subdividing the non-active regions into flat and high textured areas. Hence, the saving of the bit budget in non-active areas is achieved by spending larger portions of the available bit budget on active frame areas and obtaining the best trade-off between the coding efficiency and error resilience performance.

We conducted several simulation scenarios for evaluating the proposed algorithm. Firstly, we presented different network testbeds used for a modified video codec performance evaluation. The experimental work was conducted in error-prone and error-free environments, with averaging packet loss rates ranging from 2 to 18%. The obtained results show the proposed algorithm yields a Y-PSNR gain of 4.52 db over the HM16 reference software, and outperforms the state-of-the-art algorithms ROI and IROI by 2.28 db and 1.07 db, respectively. However, in error-free conditions, the proposed algorithm suffered the highest db loss of − 1.09 against the default software HM16.

Furthermore, the encoding and decoding processing time of the tested video sequences is analysed and reported in terms of computational complexity. The processing time results of the proposed algorithm showed that the encoding and decoding time increased by 19% and 11%, respectively.

The algorithm is further investigated with start-up video play delay (0.5 s and 1 s), in long-term evolution LTE network. The obtained results showed that when the start-up delay increases (0.5–1 s) at the decoder, the objective decoded video quality remarkably increases (1 db on average). These results indicate that the proposed algorithm, without a feedback channel, can be used in low delay video applications. Our future work includes implementing a Gilbert–Elliott model with the proposed algorithm for providing real-time quality service estimation. The model will enable the automatic control and adjustment of the encoding parameters.