Probabilistic Approach Versus Machine Learning for One-Shot Quad-Tree Prediction in an Intra HEVC Encoder

Evolutions of the Internet of Things (IoT) in the next years are likely to boost mobile video demand to an unprecedented level. A large number of battery-powered systems will integrate an Hevc video codec, implementing the latest encoding MPEG standard, and these systems will need to be energy efficient. Constraining the energy consumption of Hevc encoders is a challenging task, especially for embedded applications based on software encoders. The most efficient approach to reduce the energy consumption of an Hevc encoder consists in optimizing the quad-tree block partitioning of the image and trade-off compression efficiency and energy consumption by efficiently choosing the near-optimal pixel block sizes. For the purpose of reducing the energy consumption of a real-time Hevc Intra encoder, this paper proposes and compares two methods that predict the quad-tree partitioning in “one-shot”, i.e. without iterating. These methods drastically limit the computational cost of the recursive Rate-Distortion Optimization (RDO) process. The first proposed method uses a Probabilistic approach whereas the second method is based on Machine Learning approach. Experimental results show that both methods are capable of reducing the energy consumption of an embedded Hevc encoder of 58% for a bit rate increase of respectively 3.93% and 3.6%.


Introduction
With the progress of microelectronics, many embedded applications now encode and stream live video contents. The HEVC [33,34,42] [35,37]. This gain reduces the energy needed for transmitting video. On the other hand, the computational complexity of the encoders has been significantly increased. The additional complexity brought by HEVC is mostly due to the new quad-tree block partitioning structure of Coding Tree Units (CTUs) and the increase in the number of Intra prediction modes, which exponentially impact the Rate-Distortion Optimization (RDO) process [20].
The main limitation of recent embedded systems, particularly in terms of computational performance, comes from the bounded energy density of batteries. This limitation is a major constraint for image and video applications, video encoding and decoding being for instance the most energy-consuming algorithms on smart phones [3]. A large share of systems are likely to integrate the HEVC codec in the long run and will require to be energy efficient, and even energy aware. As a consequence, energy consumption represents a serious challenge for embedded HEVC real-time encoders. For both hardware and software codecs, a solution to reduce energy consumption is to decrease the computational complexity while controlling compression quality losses.
To reduce the computational complexity of HEVC encoders, several algorithmic solutions have been proposed at the level of quad-tree partitioning. Indeed, choosing the right encoding block sizes is necessary to obtain a good compression ratio but this choice is difficult and usually results from a costly RDO process. The exhaustive search partitioning solution is the optimal one, obtained by testing all possible partitioning configurations and selecting the one that minimizes the Rate-Distortion (RD)-cost. This process is the most time consuming operation in an HEVC encoder and thus it offers the biggest opportunity of complexity reduction (up to 78% in the considered embedded encoder) [20]. Complexity reduction solutions at the quad-tree level consist in predicting, without encoding, the adequate level of partitioning that offers the lowest RD-cost. As examples of related works, authors in [31] and [4] propose to use the correlation between the minimum depth of the co-located CTUs in the current and previous frames to skip computing some depth levels during the RDO process. Authors in [1,9,15,25,40] use CTU texture complexities to predict the quad-tree partitioning. All these solutions are based on reducing the complexity of an offline (i.e. non-real-time) costly reference encoder called HEVC test Model (HM). In this paper, we target energy reduction in a real-time context of an optimized software encoder. A real-time encoder such as Kvazaar is up to 10 times faster than HM [39]. The complexity reduction performance of state-of-the-art solutions based on HM are biased since they are measured with respect to a large compression time. The complexity overhead of state-of-the-art solutions is thus comparatively higher in the context of a real-time encoder.
We propose in this paper two energy reduction methods for HEVC Intra encoders based on a CTU partitioning prediction technique. We then compare these methods that drastically limit the recursive RDO process. The first method exploits the correlation between CTU partitioning and the variance of the CTU luminance samples to predict the quad-tree decomposition in one-shot. The second method uses a Machine Learning approach to perform the same prediction. Machine Learning is an interdisciplinary subfield of computer science that aims to replace the manually engineered solutions for extracting information from sensored data in all application fields. The two methods are compared in terms of prediction accuracy of the quad-tree partitioning as well as in terms of compression performance.
The rest of this paper is organized as follows. Section 2 presents an overview of the HEVC intra encoder and goes through State-of-the-Art methods of complexity reduction techniques. Section 3 details the first proposed probabilistic algorithm of quad-tree partitioning prediction based on variance studies. Section 4 presents the second proposed Machine Learning algorithm of quad-tree partitioning prediction. The two proposed energy reduction schemes are then compared in term of quad-tree partitioning accuracy and performance in Section 5. Finally, Section 6 concludes the paper.

HEVC Encoding and its Rate Distortion Optimisation
An HEVC encoder is based on a classical hybrid video encoder structure that combines Intra-image and Interimages predictions. While encoding in HEVC, each frame is split into equally-sized Coding Tree Units (CTUs) (Fig. 1). Each CTU is then divided into Coding Units (CUs), appearing as nodes in a quad-tree. CUs gathers the coding information of a block of luminance and 2 blocks of chrominance (in 420 representation). In HEVC, the size, in luminance pixels, of CUs is equal to 2N × 2N with N ∈ {32, 16, 8, 4}. The HEVC encoder first predicts the units from their neighbourhood (in space and time). To perform the prediction, CUs may be split into Prediction Units (PUs) of smaller size. In intra prediction mode, PUs are square and have a luminance size of 2N × 2N (or N × N only when N = 4), which can be associated to a quad-tree depth range d ∈ {0, 1, 2, 3, 4}, as illustrated in Fig. 1.
The HEVC intra-frame prediction is complex and supports in total N pm = 35 modes performed at the level of PU including planar (surface fitting) mode, DC (flat) mode and 33 angular modes [33]. Each mode corresponds to a different assumption on the gradient in the image. To achieve the best RD performance, the encoder performs an exhaustive search process, named Rate-Distortion Optimization (RDO), testing all possible combinations of quad-tree partitioning and the 35 Intra prediction modes. The Quantization Parameter (QP) impacts the RDO process to tune quality and bitrate. For a given CTU, an RDO exhaustive search tests N t different decompositions and prediction modes where: This set of tests is the main cause of the HEVC encoding complexity and the target of the energy optimization process developed in this paper.

Software Real-Time HEVC Encoder
For embedded applications, hardware encoding solutions [27] consume much lower energy than software solutions. However, when the considered system does not embed a hardware coprocessor, a software HEVC encoder [13,24,36,38] can be used, for instance the HEVC reference software model (HM). HM is widely used, as it has been designed to achieve an optimal coding efficiency (in terms of RD). However, the computational complexity of HM is high and not adapted to embedded applications. To fill this gap, the x265 [24], f265 [38] and Kvaazar [36] HEVC software encoders provide real-time encoding solutions, leveraging on parallel processing and low-level Single Instruction Multiple Data (SIMD) optimizations for different specific platforms.
This study is based on the Kvaazar HEVC encoder [36] for its real-time encoding capacity of Ultra High Density (UHD) videos. The conclusions of this study can however be extended to other real time software or hardware encoders, as they all depend on a classical RDO process to reach high compression performance.

Complexity Reduction of the Quad-Tree Partitioning
As shown in [20], in a real-time software HEVC Intra encoder, two specific parts of the encoding algorithm provide the highest opportunities of energy reduction; the Intra prediction (IP) level offers at best 30% of energy reduction whereas the CTU quad-tree partitioning level has a potential of energy reduction of up to 78%. Previous studies on low complexity CTU quad-tree partitioning can be classified into two categories: the early termination complexity reduction techniques which are applied during the RDO process to dynamically terminate the process when further gains are unlikely, and the predictionbased complexity reduction techniques which are applied before starting the RDO process and predict the quadtree partitioning with lower complexity processing. In this paper, we focus on prediction-based complexity reduction techniques.
Authors of [4,31,44] propose to reduce the complexity of the HEVC encoder by skipping some depth levels of the quad-tree partitioning. The skipped depths are selected based of the correlation between the minimum depth of the co-located CTUs in the current and previous frames. Results in [4] show an average time savings of 45% for a Bjøntegaard Delta Bit Rate (BD-BR) increase of 1.9%. For the algorithm from [31], results show an average complexity reduction of 21%. Concerning [44], experimental results show that the method can save about 48% encoding time for a BD-BR increase of 2.9%. In this paper, the objective of the study is to demonstrate a drastic energy reduction in a realtime encoding setup by predicting the CTU partitioning. As a consequence, higher energy reductions are obtained at the expense of higher BD-BR increases.
Works in [1,9,15,25,40] use CTU texture complexities to predict the quad-tree partitioning. Min et al. [1] propose to decide if a CU has to be split, non-split or if it is undetermined, using the global and local edge complexities in four different directions (horizontal, vertical, 45 • and 135 • diagonals) of CU and sub-CUs. This method provides a computational complexity reduction of 52% (in the nonreal-time HM) for a BD-BR increase of 0.8%. Feng et al. [9] use information entropy of CUs and sub-CUs saliency maps to predict the CUs size. This method reduces the complexity by 37.9% (in HM) for a BD-BR increase of 0.62%.
Khan et al. [15] propose a method using texture variance to efficiently predict the CTU quad-tree decomposition. The authors model the Probability Density Function (PDF) of variance populations by a Rayleigh distribution to estimate some variance thresholds and determine the quad-tree partitioning. This method reduces the complexity by 44% (in HM) with a BD-BR increase of 1.27%. Our experiments have shown that the assumption of a Rayleigh distribution is not verified in many cases. For this reason, our Probabilistic proposed method, based on the variance, does not consider the Rayleigh distribution and thus differs from [15].
In [26], Penny et al. propose the first Pareto-based energy controller for an HEVC encoder. From [26] are extracted the following results which are the average results on one sequence of each video class (A, B, C, D et E). For an energy reduction from 49% to 71% (in HM), authors achieve a BD-BR increase between 6.84% and 25%, respectively.
Several works have been proposed that use Machine Learning based optimization to reduce the complexity of the HEVC encoding process. Authors of [29,30]   an Intra CU size classifier based on data-mining with an offline classifier training. The classifier is a three-node decision tree using mean and variance of CUs and sub-CUs as characteristics. This algorithm reduces the coding time by 52% (in HM) at the expense of BD-BR increase of 2%. Duanmu et al. [7] present a fast CU partitioning using Machine Learning for screen content video compression. Authors use several characteristics such as CU luma variances, color Kurtosis of CU, gradient Kurtosis of CU. Shen and Yu [32] propose a CU splitting early termination algorithm based on a Support Vector Machine (SVM). The RD cost losses due to the misclassification are used as features (weights) during the SVM training. In [43], authors model the coding tree determination in HEVC with a threelevel hierarchical decision problem using SVM predictors.
These studies are all based on complexity reduction of the HM software encoder and their performance can not be directly translated to real-time encoders. The two methods proposed in the next sections are studied within a real-time optimized encoder and demonstrate high prediction efficiency.

Probabilistic Approach for Predicting an HEVC Quad-Tree Partitioning
The aim of the techniques proposed in this paper is to replace the brute force scheme usually employed in HEVC Low variance Large blocks High variance Small blocks Figure 3 Quad-tree partitioning of the 6th HEVC intra coded frame of the BasketballDrive sequence. The green (resp. blue) circle shows that the lowest (resp. highest) variance regions tend to be encoded with larger (resp. smaller) units. encoders by a low complexity algorithm that predicts in one-shot the CTU partitioning for Intra prediction without testing all possible decompositions. Following a bottom-up approach (from CU 4 × 4 to 32 × 32), the main idea is to determine the best partitioning of a given CU between 2N × 2N pixels and N × N pixels sub-blocks. Figure 2 illustrates the classification problem which predicts whether the CU of the depth d has to be merged in CU of the depth d − 1.
It has been shown that the CTU partitioning during the RDO process is highly linked to the QP value and the texture complexity which can be statistically represented by the variance of blocks in Intra coding [1,15,40]. Figure 3 shows the CU boundaries of the 6th frame of BasketballDrive video sequence. It is worth noting that the regions with the lowest variance (smooth) tend to be encoded with larger blocks, as illustrated by the green circle in Fig. 3, while the blue circle shows a region with a high variance (high local activity), which are encoded with smaller blocks. In this section, we use this correlation between the pixel values of a block (variance) and its CTU partitioning to predict the quad-tree decomposition of a CTU and thus reduce drastically the encoding complexity.

Variance-Based Decision for Quad-Tree Partitioning
To study how to predict the quad-tree partitioning from the variance values of CU luminance samples, two populations of CUs at a current depth d are defined: Merged (M) and Non Merged (NM). The CU belongs to the Non Merged population when the full RDO process chooses to encode the CU at the current depth d, while the CU belongs to the Merged population when the RDO process choose to encode the CUs at a new depth d with d < d. With a bottomup approach (i.e. d from 4 to 1), all CUs of the quad-tree decomposition of all CTUs can be classified into one of these two populations.

Figure 4
CDFs of the Merged population depending of CU size for the sixth frame of the sequence BasketballDrive. Under a specific probability , a variance threshold can be extracted from the inverse CDF curve to classify a block as Merged. Cumulative Distribution Function (CDF) of the Non Merged population can be used to decide whether a CU has to be merged or not. In our case, the CDF defines the probability of the variance population of a given CU size being less or equal to a given value. Figure 4 shows the CDFs of CU variances depending on CU size for the sixth frame of the BasketballDrive video sequence. The CDF curves show that the probability for a CU size to be selected during the RDO process decreases when the variance of the CU increases. In other words, it is rare for a CU to have a variance greater than a certain threshold. From this observation, a variance threshold υ th ( , d) for each depth d can be extracted from the inverse CDF curve for a specific probability . For example, Fig. 4 shows that 80% ( = 0.8) of CUs 8 × 8 (d = 3) have a variance less than 555 represented by the green dotted lines in Fig. 4. is the percentage of coding units whose variance is under the threshold υ th , i.e. the variance threshold that triggers unit split. Table 1 shows the thresholds υ th ( , d) for d ∈ {1, 2, 3, 4} extracted from the CDFs for the 50th frame of the two sequence PeopleOnStreet and ParkScene. The Table  illustrates that large variation of the threshold value across different video contents. In fact, the thresholds depend on the video contents and thus have to be determined on-the-fly from a Learning Frame (F L ).

Variance Threshold Modelling
Since the thresholds have to be adapted based on the video content, they have to be determined on-the-fly from Learning Frame, i.e. frames encoded with a full RDO process (unconstrained). The modelling of thresholds υ th ( , d) could have been conducted using variance PDFs with an approximation of the distribution based on a commonly known probability distribution but we observed that starting from a CDF curve is better performing [21]. An approximation of the thresholds directly from Non Merged population input features provides good results for the CDF curve. In From this observation, υ th ( , d) can be modeled using the following linear relation:  We can summarize the above analysis in the following steps: -Thresholds from CDFs of variances can be predicted from reference Learning Frame (F L ). -Look-Up Tables (LUTs) requires light computation and memory overheads for the determination of the threshold. -The prediction of thresholds is independent from the QP value (Fig. 5). -Threshold modelling is accurate with a mean Rsq of 0.86 for the different depths. -Thresholds can be precomputed according to value as a parameter.
The next section describes our first proposed algorithm to predict the CTU partitioning using a variance criterion and the obtained thresholds υ th ( , d).

Probabilistic Prediction of the CTU Partitioning
A description of the CTU partitioning is needed to explicitly depict the prediction of the quad-tree and then force the encoder to only encode this specific decomposition. The following section describes the proposed algorithm and the prediction scheme that predicts in one-shot the CTU partitioning using a variance criterion and the thresholds υ th ( , d) described in Section 3.1.

Computing the CTU Depth Map
For a given CTU, let υ d (i, j ) be the variance of the luminance sample blocks of size 2 6−d × 2 6−d at the depth level d and the local coordinates (i, j ) into the CTU as illustrated in Fig. 7.
Algorithm 1 describes our proposed Probabilistic algorithm that predicts in one-shot the CTU partitioning. The algorithm takes as inputs the luminance samples of CTU and the table of thresholds υ th previously computed by Eq. 2 to generate the CDM associated to the input CTU. In other words, the goal of this algorithm is to determine from the variance of the luminance samples the CDM matrix of the CTU. Then, the encoder only has to use the predicted depths instead of running an RDO process to encode the video, reducing significantly the complexity. First of all, the full CDM is initialized with the depth value 4 (line 1) and all the variances υ 4 (x, y) of the CU 4 × 4 (line 2) are computed using Eq. 3 where p x,y (i, j ) is the luminance component of the samples at the coordinate (i, j ) in the CU 4 × 4 at the position (x, y) andp x,y the average value of the block. Since the algorithm is bottom-up, there is no need to test the condition when d = 4 because the CDM is initialized at d = 4 which is the starting depth of the algorithm. If the previous condition is true, then the algorithm tests whether the blocks have to be merged or not using the variance criteria (line 8) previously detailed in Section 3.1. If the 4 blocks variances υ d are lower than the given threshold υ th ( , d) then the blocks are merged and the corresponding elements in the CDM are set to d − 1 (line 9) and the variance of the merged block is calculated three times using the combined variance Eq. 4.
Equation 4 computes the variance of two sets of data a and b containing the same number of observations n with μ and υ corresponding to the mean and the variance of the specified data set, respectively [5].

Refining the CTU Depth Map
To increase the accuracy of the one-shot depth map prediction with a limited impact on the complexity, a second algorithm is designed that refines the CDM. The algorithm, described by Algorithm (2), takes as input a CDM matrix from Algorithm (1) and generates a second CDM called RCDM. The RCDM is the result of merging all groups of four neighboring blocks (in the Z-scan order) having the same depth in the input CDM. Algorithm (2) details the process as follows.
The first step checks whether the input CDM depth is equal to 0, if so then no merge can be applied and thus the RCDM is also set to 0 (line 2). If not, the CDM is analysed element by element (lines [4][5]. Due to the fact that a depth of 4 in a CDM corresponds to 4 CUs 4 × 4, they are always merged to a depth 3 and thus the value in the RCDM is automatically set to 3 (line 7).  (Fig. 6a) and its associated RCDM (Fig. 6b) matrices. The grey blocks in the RCDM Fig. 6b represent the merged blocks. The next section describes our Probabilistic energy reduction scheme.

Resulting Probabilistic CTU Prediction Method
Based on the previous elements, we propose to limit the recursive search of the RDO process on the CTU quad-tree decomposition by predicting the coding-tree partitioning from video frame content properties. We introduce a probabilistic variance-aware quad-tree partitioning prediction method, illustrated in Fig. 9. First, the video sequence is split into Groups of Frames (GOF). The first frame of a GOF, called Learning Frame (F L ) is encoded with a full RDO process. From this encoding are extracted the variances υ d according to the depth d ∈ {1, 2, 3, 4} selected during the full RDO process. Then, the two following statistical moments according the depth d are computed: the means μ υ d and the standard deviations σ υ d of the variance populations υ d . According to the parameter , the set of thresholds υ th (d) are calculated using Eq. 2 and the LUT of the coefficients a( , d), b( , d) and c( , d) computed off-line (cf. Section 3.2). The other frames of the GOF, called constrained frames (F C ), are encoded with a limited RDO process. For each CTU, Algorithm (1) is applied using the sets of thresholds previously computed for the F L to generate the CDM. However, the CDM generated with Algorithm (1) is very restrictive for the RDO process as it allows only one depth to be searched for each CU of a CTU. To increase the accuracy of the depth map prediction with limited impact on the complexity, the CDM is then refined by Algorithm (2) to generate a second CDM called RCDM. To finish, the HEVC encoder is forced to only apply the RDO process between the two CDMS.
To conclude this section, our proposed probabilistic energy reduction scheme takes as input the parameter to generate CDMS and RCDM. Then, the HEVC encoder is forced to only apply the RDO process between the CDM and the RCDM. The next section details the competing Machine Learning method.

Machine Learning Approach for Predicting an HEVC Quad-Tree Partitioning
This Section presents our second quad-tree prediction method based on Machine Learning. This quad-tree prediction is then used to drastically simplify the brute force algorithm usually employed in HEVC encoders.

Machine Learning Based Decision
As in the probabilistic method (Section 4), the Machine Learning-based quad-tree prediction follows a bottom-up approach (from CU 4 × 4 to CU 32 × 32). The classification problem remains to determine whether the CU of the depth d has to be merged in CU of the depth d − 1 as illustrated in Fig. 2. The next section details the training set-up of the learning algorithm.

Training Set-Up for the Coding Tree Structure Determination
Machine Learning efficiency is very linked to the diversity of data serving for the training. Video sequences used to train the Machine Learning framework are chosen to cover a vast space of content types. To select this training data set including a large range of video contents and complexities, the Spatial Information (SI) and Temporal Information (TI) metrics [12] are used to characterize video sequences. The TI and SI give respectively the degrees of motion and spatial details in the video sequence. Since compression complexity is highly linked to these two spatio-temporal parameters, the set of training sequences for the Machine Learning feature evaluation should span a large range of both SI and TI. Figure 10 shows the SI and TI for the video sequences according to the classes (from A to E). The chosen training sequence set (circled in Fig. 10) is composed of one video sequence of each class, well distributed in term of SI and Overfitting, i.e. overspecializing a model to a training set, constitutes one of the main risks for the quality of an Machine Learning-based model [6]. Thus, the dataset used for training should result in a low bias. In our case, due to the broad range of resolutions and frame rates across the training sequences, the total number of CTU for each class is not equally distributed. For instance, sequences with high resolution contain a high number of CTUs with low texture complexities when compared to sequences with low resolution. To avoid such bias, datasets used for training are forced to be composed of a fixed number of CTUs from each class. To avoid the temporal bias, which would lead to redundant information, the sampled CTUs come from frames uniformly distributed throughout the sequences: 13 frames of the class A, 25 frames of class B, 55 frames for class E, 125 frames of class C and 500 frames of class D. For each depth d, 80000 instances are randomly sampled from the previous defined data pool, composed by 40000 instances of each prediction decision at each depth d.
The open source Waikato Environment for Knowledge Analysis (WEKA) Machine Learning framework is used for the training process [11]. Weka is chosen for its popularity and extensive set of documentations. It includes a large number of Machine Learning algorithms for data mining tasks, such as REPTree, LMT, RandomForest, BFTree and C4.5 among others. WEKA also provides several useful tools for features evaluation that use strategies according to a search algorithm so as to rank the features depending on their usefulness. For the current study, features have been selected using the information gain provided by the WEKA software. Information gain is based on the Kullback-Leibler Divergence (KLD) [18], also called relative entropy, which measures the divergence between two probability distributions.

Decision Trees-Based Partitioning Decisions
State-of-the-art studies described in Section 2.3 gather many characteristics used to predict the coding tree decomposition of a CTU. To predict the coding tree in one-shot, only characteristics independent from the encoding process with a limited overhead of computation are considered.
The choice of these features is detailed in [23]. They have been deduced from an extensive study of two factors: the information gain provided by the WEKA software and the overhead of computation under a real-time encoder. The features vector for CU at coordinates (x, y) and depth d of a given CTU, F d x,y , is composed of the following 12 features: -CU var [7,15,21,25,29,30] : the variance of the CU luminance samples of depth d (1 features). -Lower-CU var [7,15,21,29,30]   -Upper-CU var [15,21,25,29,30]: the variances of the upper CU luminance samples of depth d−1 (1 features). -Nhbr-CU var [7,15]: the variances of the neighbouring CU luminance samples of the depth d in the Z-scan order (3 features). -Var of lower-CU mean [29,30]: the variance of the mean of the 4 sub-CU luminance samples of the depth d + 1 (1 feature). -Var of lower-CU var [29,30]: the variance of the variance of the 4 sub-CU luminance samples of the depth d + 1 (1 feature). -QP: the QP of the frame (1 feature).
The training of the decision trees is performed with the C4.5 algorithm [28] because the trees it generates are light weight. In terms of information gain, the C4.5 algorithm uses KLD to select the best features for each decision. The C4.5 algorithm is iterating among all training instances and searches for each features the threshold that achieves the best classification, i.e. with the highest information gain. Then, the features and its corresponding threshold are used to divide the training instances into two subsets. Finally, the process is recursively iterated on the two different subsets of training instances.
To measure the accuracy of the decision trees, a 10-fold cross-validation is performed on the training instances. The cross-validation technique evaluates a predictive models by partitioning the original instances into a training set to train the model, and a test set to evaluate it. In 10-fold cross-validation, the original instances are randomly split into 10 equally sized subsets. Among the 10 subsets of instances, one subset is used as the validation instances for testing the model, and the remaining 9 subsets are used as training instances. The cross-validation process is then repeated 10 times (called folds), with each of the 10 subsets used exactly once as the validation instances. Let the Percentage of Correctly Classify Instances (PCCI) given by the 10-fold cross-validation be the accuracy of the decision trees.
Two types of classifiers are defined for each depth d: the Merge and Split decision trees. These two decision trees solve the same classification problem illustrated in Section 3 by Fig. 2 Table 2 summarizes the trained tree sizes, number of leaves and the PCCI of the 8 decision trees. Results in Table 2 show that the accuracy of both Merge and Split decision trees are over than 80% of good decisions.
The next sections describes how we use decisions trees to predict CTU partitioning.

Formalisation of the CTU Partitioning Decisions
Let P S (F d x,y ) and P M (F d x,y ) respectively be the prediction results of the Split and Merge trees for the features vector F d x,y . The prediction decision D d (x, y) is defined as the prediction resulting of the best combination of the decision trees at the depth d by Eq. 5: with P

{3,4}
M (x, y) such as: In other words, for the high depth d ∈ {1, 2}, the algorithm will merge the four CUs at the depth d if all the five decision trees predict to merge. In contrast, for the low depth d ∈ {3, 4}, the algorithm will merge the four CUs at the depth d if at least one decision tree predicts to merge.

Machine Learning Prediction Algorithm for CTU Partitioning
Algorithm (3)  First of all, the full CDM is initialized with depth value 4 (line 1). Then, the algorithm explores the CTU decomposition with a bottom-up approach: from d = 4 to d = 1 (line 2). For the current depth d, the algorithm browses the CDM (lines 4-5) taking the block size δ in the CDM (line 3) into account. Afterwards, the Machine Learning Algorithm (3) differs from the probabilistic Algorithm (1) and does not test if the 4 neighbor blocks in the Z-scan order have the same depth d, as illustrated in Fig. 11. Indeed, better results are obtained without this condition, contrary to the probabilistic case.
Then the algorithm tests whether the blocks have to be merged or not using the merge and split decision trees prediction respectively P M (F d x,y ) and P S (F d x,y ) previously detailed in Section 4.1.2 and the combination defined by Equation 5 (line 7). If the prediction is true, the blocks are merged and the corresponding elements in the CDM are set to d − 1 (line 8).  Figure 12 presents a high-level diagram of our resulting Machine Learning CTU partitioning prediction technique. Thanks to the offline training of the decision trees, all the frames are constrained and no learning frame is needed, in contrast to the probabilistic approach (cf Section 3). The features detailed in Section 4.1.2 are computed for the whole frame to minimize the computational complexity overhead. Then, Algorithm (3) is applied using the features to generate the CDM. As with the Probabilistic approach, to increase the accuracy of the depth map prediction with limited impact on the complexity, the CDM is refined using Algorithm 2 to generate the RCDM. Finally, the HEVC encoder is forced to only apply the RDO process between the previously generated CDM and RCDM.

Probabilistic Approach versus Machine Learning for One-Shot Quad-Tree Prediction
This section gives the experimental setup and the results obtained for the two proposed energy reduction schemes on the real time HEVC encoder Kvazaar [36].

Experimental Set-Up and Parameters
To conduct the experiments, 18 video sequences [2] that strongly differ from one another in terms of frame rate, motion, texture and spatial resolution were used. All experimentations are performed on one core of the EmETXe-i87M0 platform from Arbor Technologies based on an Intel Core i5-4402E processor at 1.6 GHz. The used HEVC software encoder is the real time Kvazaar [16,17,39] in All Intra (AI) configuration. Since the configuration aims to be real-time, from [20], the Rate-Distortion Optimisation Quantization (RDOQ) [14] and the Intra transform skipping [19] features are disabled. Each sequence is encoded with 4 different QP values: 22, 27, 32, 37 [2]. For the Probabilistic approach, previous experiments showed that the best prediction is obtained with ∈ [0.6, 0.7] [21]. For the following experiments, is fixed to 0.6 and GOF size is fixed to 50, which is shown in [22] to be an appropriate value for drastic energy reductions.
Bjøntegaard Delta Bit Rate (BD-BR) and Bjøntegaard Delta Psnr (BD-PSNR) [41] are used to measure the compression efficiency difference between two encoding configurations. The BD-BR reports the average bit rate difference in percent for two encodings at the same quality in terms of Peak Signal-to-Noise Ratio (PSNR). Similarly, the BD-PSNR measures the average PSNR difference in decibels (dB) for two different encoding algorithms considering the same bit rate.
To measure the energy consumed by the platform, Intel Running Average Power Limit (RAPL) interfaces are used to obtain the energy consumption of the CPU package, which includes cores, IOs, DRAM and integrated graphic chipset. As shown in [10], RAPL power measurements are coherent with external measurements and [8] proves the reliability of this internal measure across various applications. In this work, the power gap between the IDLE state and video encoding is measured. The CPU is considered to be in IDLE state when it spends more than 90% of its time in the C7 Cstates mode. The C7 state is the deepest C-state of the CPU characterized by all core caches being flushed, the PLL and core clock being turned off as well as all uncore domains. The power of the board is measured to 16.7W when the CPU is in idle mode and goes up to 31W during video encoding in average. RAPL shows that 72% of this gap is due to the CPU package, the rest of the power going to the external memory, the voltage regulators and other elements of the board.

Experimental Metrics
The performance of the proposed energy reduction scheme is evaluated by measuring the trade-off between Energy Reduction (ER) in % and Rate-Distortion (RD) efficiency using the BD-BR and BD-PSNR. ER is defined by Eq. 8:  to encode the same sequence with our proposed energy reduction scheme, both with QP = QP i .
The main objective of Algorithms (1) and (3) is to generate a CDM that minimises the prediction error when compared to what the full RDO process would generate. To evaluate the accuracy of our predictions, we define the normalized L 1 distance between two CDMS in terms of depth levels as follow: where A and B are the two compared CDMS. In other words, the metric (A, B) measures the average gap in number of depth levels between two CDMS A and B of a given CTU. Let use Fig. 6 as example, the distance between the CDM Fig. 6a, b is equal to = (4 + 1 + 16)/64 = 0.3281. In addition to the distance metric, we define the recall ρ between two CDMS A and B: with The recall ρ(A, B) represents the share of correct quadtree decomposition in term of pixel area between predicted CTUs A and reference CTUs B. Let us use Fig. 6 as example, the recall between the CDM Fig. 6a (considered as predicted) and Fig. 6b (considered as reference) is equal to ρ = 43 × 100 64 = 67.19%. The recall ρ(P , R) and the distance (P , R) are used in the following sections to evaluate the accuracy of the prediction with P being the predicted CDM and R the reference CDM, 1 generated by a full RDO process (optimal). The average of ρ(P , R) measurements gives the percentage of good prediction in term of pixel area, it falls between 0% and 100% and the more ρ(P , R) is close to 100%, the more the predicted CDMS accurately fit the reference CDMS. The average distance (P , R) represents the mean error in term of depth between the predicted CDMS and the reference one, the more (P , R) is close to 0, the more precise the predicted CDM P becomes. to the 5 classes A, B, C, D and E, each one corresponding to a specific resolution or video content. Two types of metrics are detailed in Table 3. The first one is composed by the ρ(P , R) measure and the distance (P , R) defined in Section 5.1.2 averaged across the four QP values which evaluate the precision of the quad-tree prediction for all constrained frames. The second one is composed by BD-BR and BD-PSNR [41] that are the common metrics used in video compression to measure the compression efficiency difference between two encodings. The ER values include the energy overhead due to the entire energy reduction scheme (features or variance computation and CDM prediction). In a real-time configuration (see Section 5.1.1), the computational overheads in the real-time encoder Kvazaar of our two proposed methods are between 1% and 1.9% for the Probabilistic approach and between 1.5% and 2.5% for the Machine Learning approach.

Comparison of Probabilistic and Machine Learning Approach for Predicting an HEVC Quat-Tree Partitioning
In terms of quad-tree prediction accuracy in one-shot, Table 3 shows that the Machine Learning energy reduction techniques achieve better results (around 53% of ρ(P , R) for a distance (P , R) of 0.67 depth level) than the Probabilistic energy reduction techniques (around 50% of ρ(P , R) for a distance (P , R) of 0.79 depth level).
The results show that both energy reduction techniques achieve an average of 58% of energy reduction. In fact, the overhead due to the unconstrained Learning Frame (F L ) and the variance computations of the Probabilistic approach is approximately equal to the overhead of the features computations of the Machine Learning approach. However, even if the Probabilistic approach does not constrain all the frames (only 49 every 50 frames), this approach causes more encoding degradations: +0.33% of BD-BR and -0.02dB of BD-PSNR, than the Machine Learning approach. These results show that the two metrics ρ(P , R) and (P , R) of quad-tree prediction accuracy are well correlated with the impact in encoding degradations.
It is noticeable in Table 3 than the Kimono sequence has more degradations than the other sequences: 13.28% of BD-BR increasing with the Probabilistic approach and 9.51% of BD-BR increasing with the Machine Learning approach. This can be explained by the texture specificity of the Kimono video sequence which is composed by a traveling of trees and vegetation in the background. This video sequence has the highest Spatial Information (54.1) due to the details. Nevertheless, the results show that the Machine Learning approach reduces the degradation of 3.77% of BD-BR compare to the Probabilistic approach.
The performance of state-of-the-art solutions (cf Section 2.3) based on HM can not be directly compared to these results. Indeed, they are measured comparatively to a large compression time, far from real-time. The complexity overhead of state-of-the-art solutions is thus comparatively higher in the context of a real-time encoder. Previously Table 3 The recall ρ(P , R), distance (P , R), BD-BR, BD-PSNR and ER of the Probabilistic and Machine Learning drastic energy reduction schemes according to the sequences. For the same energy reduction, the Machine Learning energy reduction techniques achieve better results than the Probabilistic energy reduction techniques for both quad-tree prediction accuracy and encoding degradation.

Probabilistic
Machine learning Sequence ρ (in %) (in d) BD-BR (in %) BD-PSNR (in dB) ER (in %) ρ (in %) (in d) BD-BR (in %) BD-PSNR (in dB) ER (in %) published results can thus not be directly applied to reduce the energy consumption in a real-time encoder as the two methods developed here. To conclude, the Machine Learning approach achieves better results on average than the probabilistic approach and does not require unconstrained learning frames to predict the quad-tree partitioning. These two points make the proposed Machine Learning approach a good candidate to build energy reduction methods for real-time HEVC encoders.

Conclusion
This paper proposes and compares two energy reduction methods for real-time HEVC Intra encoders. These methods are based on CTU partitioning prediction techniques that drastically limit the recursive RDO process. The first proposed method exploits the correlation between a CTU partitioning and the variance of the CTU luminance samples to predict the quad-tree decomposition in one-shot. The second method uses a Machine Learning method to predict in one-shot the quad-tree decomposition.
Experimental results show that the Machine Learning method has a slight edge over the probabilistic method and that this performance has a direct impact on the encoding degradations. Both energy reduction techniques are capable of reducing the energy consumption of the HEVC encoder by 58% -including the additional algorithm overhead under a real-time encoder -for a bit rate increase of respectively 3.93% and 3.6%. The obtained energy gain is substantial and close to the theoretical maximum of 78% gain that would be obtained if the perfect quad-tree decomposition would be known in advance. Future work will use one-shot quad-tree partitioning prediction to control the energy consumption of an HEVC Intra encoder for a given energy consumption budget. distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. He was also member of the IRISA/INRIA laboratory. He is currently Professor at the ECE department of the engineering school INSA Rennes. He is also member of the IETR/CNRS laboratory. He is author of more than 80 international papers distributed in the areas of embedded systems, video codecs, computer-aided design, arithmetic and signal processing. He is member of the DISPS committee of IEEE Signal Processing society. His research interests include design and implementation of image and signal processing applications in embedded systems, video codecs, approximate computing, fixed-point arithmetic, low power systems.