On the use of deep learning and parallelism techniques to significantly reduce the HEVC intra-coding time

It is well-known that each new video coding standard significantly increases in computational complexity with respect to previous standards, and this is particularly true for the HEVC and VVC video coding standards. The development of techniques for reducing the required complexity without affecting the rate/distortion (R/D) performance is therefore always a topic of intense research interest. In this paper, we propose a combination of two powerful techniques, deep learning and parallel computing, to significantly reduce the complexity of the HEVC encoding engine. Our experimental results show that a combination of deep learning to reduce the CTU partitioning complexity with parallel strategies based on frame partitioning is able to achieve speedups of up to 26×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} when 16 threads are used. The R/D penalty in terms of the BD-BR metric depends on the video content, the compression rate and the number of OpenMP threads, and was consistently between 0.35 and 10% for the video sequence test set used in our experiments


Introduction
The high-efficiency video coding (HEVC) standard was launched in 2013 [9] by the Joint Collaborative Team on Video Coding (JCT-VC). Although HEVC can compress a video sequence using half the bitrate of its predecessor, this performance improvement comes at the expense of an increment in the computational cost [1].
Great efforts have been made to speed up the encoding process. Several works in the literature have tried to reduce the coding time using modern hardware accelerators [2][3][4][5][6][7][8]. In [6,8], computation of the motion estimation (ME) was moved to the GPU, since in the same way as for previous video standards, ME is the most complex task undertaken by the encoder, requiring more than 90% of the encoding time [9]. In [2,4,7], the ME process was accelerated using a similar approach based on FPGAs. In other approaches, various coding processes have been moved to the FPGA, such as the 2D-DCT with variable size [3], the intraframe prediction process [5], and the CABAC entropy encoder [10].
Other works in the literature have used parallel computing strategies to reduce the overall complexity of HEVC encoding, and to take advantage of the multicore processors available in modern HPC servers in order to speed up the overall encoding time for a video sequence [11][12][13][14][15][16]. There are also several other approaches, which typically depend on the selected parallelisation strategy (temporal or spatial) and the level at which parallelism is applied (fine, medium, or coarse). For example, in [15], the authors applied a fine parallelism scheme to reduce the complexity of the HEVC Sample Adaptive Offset (SAO) in-loop filter, and obtained an speedup of 1.9× , while in [14], the authors employed a temporal parallelism approach based on wavefront parallel processing which consisted of a special type of pipeline processing for the Coding Tree Units (CTUs) of a given frame when several computing OpenMP computing threads were available. The latter approach obtained an speedup of 5.5× using 20 cores, with a BD-rate [17] increment of 1.2%. In [12], a higher-level parallelisation scheme (at the frame level) was proposed based on the partition of each frame using tiles (a new feature available in HEVC). In this approach, a maximum speedup of up to 9 × was obtained for the all intra (AI)-coding mode using 10 cores. The study in [16] presented a thorough analysis of the need to adaptively evaluate the workload of the different tiles in order to determine the best CTU partitioning is presented. In [13], the authors developed a parallel HEVC encoder using frame-level parallelism by means of slices rather than tiles, obtaining speedups of up to 9.3× and 8.7× for the AI and Random Access (RA) coding modes, respectively. In [11], a coarse-grained parallelisation scheme was presented (at the sequence level), in which different groups of pictures could be independently encoded by several processing nodes. This parallel approach was well-suited to the distributed memory architectures of modern federated clusters, and obtained speedups of up to 11.84× using 12 cores for the RA coding mode, with a BD-rate increment of 1.3%.
Finally, there are other works that have focused on optimisation of the source code of specific parts of the HEVC encoder [18][19][20][21][22][23][24]. In [18,19], a pre-analysis technique was proposed to reduce (a) the size of the search area; (b) the number of reference frames in the inter-frame prediction; (c) the number of intra-prediction modes; and (d) the number of best candidates for the intra-frame prediction process. This approach achieved a 49% reduction in coding time on average for the RA coding mode with an average BD-rate increment of 1.08%. In [21], the authors developed a fast decision method to perform efficient asymmetric mode partition, thus reducing the computational complexity. They also proposed an adaptive motion search area estimator to reduce the overall inter-coding complexity even further. Their results demonstrated that their algorithm could reduce the encoding time by 31.37% in the RA coding mode with a negligible BD-rate increment. In [20], the authors reported on a fast decision mode based on CABAC rate estimation with a coding time reduction of 15%, while in [22], a fast CTU partitioning algorithm was developed in which the CTU texture was used to prune the CTU quad-tree structure. The results proved that the proposed fast coding unit (CU) partitioning algorithm yielded savings of 41% in the encoding time on average, with a BD-rate increment of 0.69%. In [23], a decision tree-based algorithm for CTU partition was presented. The authors implemented three decision trees classifiers for all the three depths of the CU partition. However, the thresholds required by this algorithm needed to be selected manually. This technique was able to reduce the encoding time by 42.1% on average, with a BD-rate increment of 0.7%. The authors of [24] proposed a Bayesian decision rule for an early termination CU algorithm. This Bayesian decision rule was used to estimate a likelihood function and the prior probability of a new scene. The model was then updated for the following frames, to predict the CU size. Although the proposed model had a negligible training time compared with other machine learning models, its accuracy depended on the particular scene, making it inaccurate. The results showed that an average reduction in coding time of 36% could be achieved with a BD-rate increment of 1.08% for the AI coding mode.
With regard to source code optimisation techniques, several authors have developed deep learning approaches to reduce the complexity of the HEVC encoder [25][26][27][28][29][30][31][32][33]. For example, to reduce the complexity of inter-mode prediction in the Low Delay B coding mode (LB), Zhang et al. [29] proposed a coding unit (CU) depth decision algorithm with a three-level joint classifier based on a support vector machine (SVM), which predicted the splitting of CTUs based on as a three-level of hierarchical binary decision problem. The proposed algorithm was able to reduce the encoding time by 51.45% on average, with a BD-rate increment of 1.98%. For the intra-coding mode, Liu et al. [26] developed a convolutional neural network (CNN) approach that predicted the CTU partitioning, thus reducing the coding time by 72% on average, with a BD-rate increment of 4.79%. The authors of [28] proposed a CNN-based algorithm for predicting the CU size for both inter-and intra-prediction coding using CNN models, where the quantisation parameter (QP) was used as one of the inputs to the classifier. In this scheme, reductions in coding time of 66.47% and 62.94% were achieved for the intra-and inter-coding modes, respectively. In [31], the authors developed a CNN-based algorithm to extract texture and objects location features, which were used with a Softmax classifier to predict the CU size. The results showed a reduction in the coding time of 66.89%, with a BD-rate increment of 1.31% for the AI coding mode. In [32], the researches proposed a fast CU size decision algorithm based on a CNN architecture, where four CNNs were used as classifiers at each of the four depths to make a decision (splitting or nonsplitting) for the given QP. The pruning algorithm achieved a coding time reduction of 77% with a BD-rate increment of 3.1% on average for the AI coding mode. The authors of [33] presented CtuNet, a CNN approach that predicted CTU partitioning. The CtuNet framework consisted of three CNN networks for the CU sizes of 64 × 64 , 32 × 32 , and 16 × 16 , with a residual network (ResNet18) [34] as the backbone model. This model obtained reductions in the coding time of 63.68% with a BD-rate increment of 1.77% on average, for the AI coding mode.
Recently, Çetinkaya et al. [35] have published a survey of CTU depth decision algorithms that covered classical statistics-based algorithms to modern advanced deep learning algorithms such as deep neural networks. In another recent paper, Wang and Li [36] designed a one-stage decision network(OSDN) structure to determine the CU/PU partition and prediction mode for intra-coding. Their experimental results showed that the proposed method could reduce the intra-encoding time by 73.69%, with a BD-PSNR loss of 0.1673 dB on average.
The most important contributions of the present work are as follows: 1 A hybrid HEVC encoder that combines two different acceleration strategies based on parallel computing and source code optimisation techniques is designed and developed. The first acceleration technique is a parallel scheme that uses a domain decomposition model based on HEVC slice partitioning, which is particularly suitable for exploiting the shared memory parallelism of multicore processors. The second technique uses optimisation methods at the CTU level to reduce the complexity of the quad-tree splitting process by means of a CNN. 2 The benefits of our hybrid solution are demonstrated, and it is shown to be fully compliant with the HEVC standard, to give good encoding performance for the HEVC, and to achieve outstanding speedups. 3 The hybrid proposal also includes extra parallelisation of the additional processing steps required by the machine learning-based acceleration approach.
The remainder of this paper is organised as follows. In Sect. 2, we explain the deep learning approach used to predict the CU partition and the slice-based parallelism strategy. Sect. 3 describes the proposed hybrid approach for improving the speed of the HEVC coding stage, and in Sect. 4, experimental results from the proposed hybrid algorithm are presented. Finally, in Sect. 5, some conclusions are drawn.

Related work
In this section, we explain the main features of the techniques used in this work to create the hybrid acceleration scheme in order to significantly improve the speedup of the HEVC encoding process.
On the use of deep learning and parallelism techniques to…

Neural network algorithm
The HEVC algorithm reduces the bit rate of the encoded video at the cost of a considerable increase in the encoding complexity. One of the most time-consuming process is the decision on the optimal quad-tree partitioning of each CTU. To find an optimal CTU partitioning from the 83522 possible partitions (see [35]), HEVC searches 85 CUs with different sizes ranging from 64 × 64 to 8 × 8 pixels. In addition to finding the correct CU depth structure, the prediction unit (PU) modes and the transform unit (TU) partitioning must be properly determined for each CU. Thus, the search for the optimal CTU structure requires the largest amount of time in the encoding process [37], since it uses a brute force approach to find the one with the minimum rate-distortion (RD) cost. Several schemes for reducing the computational cost of the CU partition have been reviewed in Sect. 1, some of which reduce the complexity of the algorithm at the cost of an increase in bit rate to maintain the reconstructed video quality; others replace the brute force search for R/D optimisation (RDO) with a deep neural network that is trained to estimate the CTU partitioning. Of the numerous complexity reduction schemes based on deep learning that have been proposed, we highlight the one presented by Xu et al. [28]. The main factors that differentiate this proposal from the alternatives involve the definition of a hierarchical CU partition map (HCPM) to represent the CU partition. Given sufficient training data and an efficient HCPM representation, the authors propose a deep CNN structure called an early-terminated hierarchical CNN (ETH-CNN) that can be trained to explore various patterns for the CTU partition and thus reduce the complexity of the HEVC coding process.
A CTU has a size of 64 × 64 pixels by default, and can either contain a single CU or be recursively split into multiple smaller CUs, based on the quad-tree structure shown in Fig. 1.
In the CU partition structure in HEVC, four different CU sizes are supported by default; these are 64 × 64 , 32 × 32 , 16 × 16 and 8 × 8 , corresponding to four CU depths of 0, 1, 2 and 3. For a coding unit U, the first-level binary label y 1 (U) indicates whether U is split (= 1) or not (= 0). If U is split, its sub-CUs of depth one are . As stated above, in HEVC, the binary labels for splitting each CU are obtained using a time-consuming RDO process, but these can be predicted faster via a deep learning algorithm using a simple multi-class classification in one step call (ETH-CNN). Note that the input CTU is extracted from raw images, and only the Y channel is used in ETH-CNN. The structure of ETH-CNN consists of two pre-processing layers, three convolutional layers, and one concatenating layer [28]. Using this ETH-CNN structure, the model is trained to minimise the R/D loss function (see Equation (2)), and can finally be used to predict the CTU partitioning in the form of HCPM. For each training sample r the loss function LF r sums the cross-entropy over all valid elements of HPCM (see Equation (1)).
NoTS k=1 are the labels of the hierarchical CU partition map predicted by ETH-CNN and r represents the number of training samples (NoTS). Moreover, H(y,ŷ) is the cross-entropy between the ground-truth (y) and the predicted labels ( ŷ ). The proposed ETH-CNN model is trained by optimising the global loss function (LF) shown in Equation (2).
Given an input CTU, ETH-CNN provides the splitting probabilities at each level P 1 (U) , P 2 (U i ) and P 3 (U i,j ) for the binary labels y 1 (U) , y 2 (U i ) and y 3 (U i,j ) , to predict the CU partitioning. In general, a decision threshold l = 0.5 is set for levels 1, 2 and 3. Hence, a CU with P l (U) > l is split into four sub-CUs. The author of [28] also provides a convolutional network for inter-coding called ETH-LSTM. However, as our proposal is focused on the intra-coding we will use the ETH-CNN network specially developed for intra-coding.

Slice-based parallel algorithm
The HEVC standard allows each frame of a video source to be segmented into a set of CTUs, each of which can be configured as an independent block that can be encoded in parallel. The HEVC standard offers two options for dividing the video source to be encoded into independent sets of CTUs: slice and tile partitioning. Slices are sets of correlative CTUs where the number of CTUs in each set are the same for all slices (except where necessary for the last slice containing the CTUs in the lower right-hand corner of the frame). In the HEVC standard, the number of (1) On the use of deep learning and parallelism techniques to… CTUs per slice needs to be established. The sizes of the slices (in terms of the numbers of CTUs) will determine the number of slices in each frame, depending on both the resolution of the video sequence to be encoded and the size of the CTUs. Note that each CTU is a square set of pixels for which the size is set to 64 × 64 pixels, as specified in the HEVC common test conditions [38].
As each slice contains a data header, it can be decoded independently of the others, even if the data from the others are not available when decoding. Since the size of the header can affect the compression ratio (i.e. the number of bits per pixel in the compressed bit stream), the number of slices in the proposed parallel algorithm should be established with care, in order to avoid an excessive bitstream overhead (see [39]). Each encoding process calculates the slice size, expressed in number of CTUs, depending on (a) the number of CTUs in a frame; (b) the identification of the encoding process I EP ; and (c) the total number of available encoding processes N EP , as indicated in Algorithm 1. The size of the last slice (in the lower right-hand corner) is either equal to or smaller than the rest of the slices, and its size S Slice is determined based on the number of processes according to Algorithm 1.
The slice partitioning process in Algorithm 1 aims to achieve a balanced computational load, in which domain decomposition is performed to assign each process the same (or a similar) amount of data. Note that if the computational load assigned to each process is evaluated based on the number of CTUs in a frame N CTUs it is only possible for the encoding process of the last slice to have an imbalanced computational load. Depending on the video sequence resolution to be encoded, there may also be CTUs at the right-hand or bottom edges of a frame with fewer than 4096 (64 × 64) pixels.  Figure 2 shows partitioning into two slices of 52 CTUs each, while Fig. 2 shows partitioning into six slices, where the first five slices contain 18 CTUs each and the last slice contains 14. In the last slice, only the first CTU has 4096 (64 × 64) pixels, and the remaining 13 CTUs have only 2048 (64 × 32) pixels. Once the slices have been assigned to the processes, each process must encode the CTUs contained in the assigned slice, and for each CTU, the quad-tree structure must be computed using the brute force R/D algorithm as described in Sect. 2.1.
In order to significantly reduce the computing time of the HEVC encoding process, we propose a hybridised scheme that includes both a deep learning approach to predict the CU partition and a parallel processing scheme based on slice partitioning, and this is described in the next section.

Hybrid acceleration proposal
The deep learning algorithm described in Sect. 2.1 and the slice-based parallel algorithm in Sect. 2.2 can be complemented by allowing for parallelisation and pre-calculation of CTU partitioning through deep learning. A general flowchart for the proposed hybrid algorithm is shown in Fig. 3. The sliced parallel algorithm is represented using red boxes, while the blue ones represent the contribution from deep learning. In the first step, all of the OpenMP threads read the configuration parameters and encode a set of frames depending on the total numbers of frames and threads. Each thread computes the HCPM for all the CTUs in the assigned frame set, and the partition map is stored in memory so that it can be accessed by all threads when the CTU partitioning tree is computed for a given slice. Once all the HPCMs have been generated and saved in a concurrent manner (which yields an improvement in computation time compared to other approaches), all threads are synchronised to encode each frame. In this sense, the slice-based parallel algorithm is applied at a higher level. As shown in Fig. 3, only the master thread reads the new frame to be encoded, in order to reduce both the number of disk accesses and the memory requirements. The frame to be encoded will therefore be stored in the shared memory, and will be accessed only for reading. In fact, each thread will only access those CTUs that are part of the slice to be encoded by it. The prediction for the CTU partition obtained from the deep learning approach is used when coding the set of CTUs for the slice assigned to each thread. When each thread has encoded the slice assigned to it, it writes its bit stream into the final bit stream, and this process must be done in the right order, as shown in Fig. 3. Hence, thread 0 is the first to become idle after storing its computed part of the bitstream. This thread can then start reading or receiving the new data, while the rest of the OpenMP threads finish writing to the bitstream.

3
On the use of deep learning and parallelism techniques to…

Experimental results
In this section, we present the results of a set of experiments carried out to validate the effectiveness of our proposal are presented. To evaluate the intra-frame coding performance of our hybrid scheme, we compare the slice-based parallel approach proposed in [13], the deep learning approach proposed in [28] and the proposed hybrid approach. All three methods are based on the HEVC reference software HM version 16.3 [40] (which was used as a benchmark), and the AI configuration was applied using the default configuration file encoder_intra_main.cfg. Four QP values (22,27,32,37) were chosen for compression of the selected video sequences as recommended by the HEVC common test conditions [38]. All experiments were conducted on a server with two processors (Intel(R) Xeon(R) Gold 6140 @ 2.30 GHz) with 18 cores per processor, 400 GB RAM, four Tesla P100-PCIE GPUs and CentOS Linux release 7.6.1810 as the operating system. For the deep learning approaches, we used TensorFlow 1.8 with GPU support for CUDA 9.1 and  [28]. Eleven video sequences from the JCT-VC standard test set [38] were used to evaluate and compare our method, as summarised in Table 1. Table 2 shows the speedup and Bjontegaard delta bit rate (BD-BR) [41] obtained for the Class A video sequences using the schemes in [13,28] and our proposed approach (Prop.). The time reduction is expressed based on the speedup as an acceleration measurement in order to directly relate the coding latency to the number of OpenMP threads (Th.) used. All the speedups and the values for the BD-rate were obtained with respect to the reference software, HM version 16.3 [40].
The experimental results from the deep learning approach were similar to those obtained by the authors of [28]; for example, for the Traffic sequence, a reduction of a 73.7% in the execution time was achieved for QP = 37, corresponding to an average speedup of 3.7× . The OpenMP approach described in [13] gave speedups of up to 14.65× for 16 threads for same video sequences, with an efficiency of 75% (where efficiency is defined as the ratio of useful work to the resources expended by each thread in each core). This was as expected, since a slice-based distribution is more efficient for higher-resolution video sequences where the computational load can be equally distributed, as described by the authors of [13]. The proposed approach which combines both strategies is able to considerably reduce the coding times. For example, for the BQMall Class C video sequence encoded with QP = 37, a speedup of 37.9× was achieved for 16 threads. These results clearly show that a combination of slice-based parallelisation with a reduction in complexity from deep learning can provide significant levels of acceleration for HEVC intra-frame coding, which are greater than the accelerations obtained by the schemes in [28] and [13] (2.96× and 14.12× , respectively). In a practical scenario where the speed of intra-coding is decisive, the proposed solution offers a much higher performance than all the proposals described in Sect. 1.
The reduction in the complexity of the HEVC intra-frame coding mode is achieved at the expense of a loss of R/D performance. Tables 2 , 3, 4 and 5 show the values of BD-BR used to evaluate the R/D performance of the proposed scheme and the other two alternatives [13,28]. As expected, the BD-BR for our hybrid proposal is approximately the sum of the penalties obtained by the approaches in [28] and [13]. For example, it can be seen from Table 5 that for QP = 37, the algorithm proposed in [28] shows an increase in the BD-rate of 1.43% for RaceHorses, whereas the penalty obtained by the algorithm proposed in [13] is 1.76% for 16 threads. On the use of deep learning and parallelism techniques to… Th. [13] Prop. [28] [13] Prop. Th.

3
On the use of deep learning and parallelism techniques to… Table 4 Speedup and BD-BR for Class C video sequences Sequence Speedup BD-BR [28] Th. [13] Prop.

QP
[13] Prop. [28] [13] Prop. Finally, our hybrid model has a penalty of 3.22% for the BD-rate. From an analysis of these results, it can be concluded that deep learning and parallelism do not interfere with or cancel each other out in terms of the video quality.

QP
In Fig 4, we show the speedup behaviour of the three schemes under evaluation as the number of the working threads increases, for three different Class B video sequences encoded with a QP value of 22. For the deep learning approach, the speedup is constant, as it does not use threads, whereas for the slice-based approach, we find an speedup progression that indicates good scalability behaviour, which is maintained for our hybrid proposal.
Finally, Table 6 shows the R/D performance results and the time reductions achieved by several schemes in the literature and the approach presented in this work. These results show that our scheme is able to achieve the greatest time reductions, with values that are consistently above 90%, and R/D performance losses of under 5% on average. However, if the increase in bitrate is unacceptable, a slower configurations may be chosen (with a lower number of threads), but with a minor R/D loss.

Conclusions
In this paper, we present a powerful technique to accelerate an HEVC encoder in the intra-frame coding mode. Our scheme combines two different approaches and exploits their characteristics to reap the benefits of both, and can considerably increase the speedup. Our proposed algorithm combines a slice-based parallel proposal for shared memory systems, with a deep learning approach. Although each scheme obtains a significant speedup when applied separately, a combination of both approaches considerably accelerates the HEVC encoder and achieves time savings of more than 90%. Our experimental results show a coding acceleration of up to 35× . There have been many attempts in the literature to speed up intra-encoding in HEVC, but they have not been jointly exploited. Our scheme achieved an acceleration of 35× with regard to the reference software, without the need for additional hardware. However, this acceleration was obtained at the expense of a loss of R/D performance. In our experiments, the maximum BD-rate penalty was 10.14% and the minimum was -0.9%. It was found that the two base algorithms did not interfere (a) (b) (c) Fig. 4 Speedup behaviour versus number of threads for the approaches in [13,28] and the proposed scheme Prop. with each other, as the results for the BD-rate obtained by the hybrid algorithm were approximately the sum of the penalties of both algorithms. Due to the high level of computational complexity of the newest video coding standards, hybrid approaches that combines different acceleration techniques will be necessary in order to reduce the computational requirements. As a future line of research, we plan to use two levels of parallelisation based on heterogeneous platforms (shared and distributed memory) in order to get closer to real-time encoding with no change in the coding performance.