Performance analysis of optimized versatile video coding software decoders on embedded platforms

In recent years, the global demand for high-resolution videos and the emergence of new multimedia applications have created the need for a new video coding standard. Therefore, in July 2020, the versatile video coding (VVC) standard was released, providing up to 50% bit-rate savings for the same video quality compared to its predecessor high-efficiency video coding (HEVC). However, these bit-rate savings come at the cost of high computational complexity, particularly for live applications and on resource-constrained embedded devices. This paper evaluates two optimized VVC software decoders, named OpenVVC and Versatile Video deCoder (VVdeC), designed for low resources platforms. These decoders exploit optimization techniques such as data-level parallelism using single instruction multiple data (SIMD) instructions and functional-level parallelism using frame, tile, and slice-based parallelisms. Furthermore, a comparison of decoding runtime, energy, and memory consumption between the two decoders is presented while targeting two different resource-constraint embedded devices. The results showed that both decoders achieve real-time decoding of full high-definition (FHD) resolution on the first platform using 8 cores and high-definition (HD) real-time decoding for the second platform using only 4 cores with comparable results in terms of the average energy consumed: around 26 J and 15 J for the 8 cores and 4 cores platforms, respectively. Furthermore, OpenVVC showed better results regarding memory usage with a lower average maximum memory consumed during runtime than VVdeC.


I. INTRODUCTION
A new era of information and communication technologies is emerging, where video communication plays an essential role in internet traffic.In particular, the significant increase in video traffic supported by the COVID19 global health situation and the emerging video formats and applications have lead to the development of a new video coding standard, named Versatile Video Coding (VVC)/H.266.This latter was standardized in July 2020 by the Joint Video Experts Team (JVET) of the VCEG working group of ITU-T and the MPEG working group of ISO/IEC JTC 1/SC 29 [1].VVC enables bit-rate savings of up to 50% [2] with respect to the previous standard High Efficiency Video Coding (HEVC)/H.265[3] for the same video quality.However, this achievement comes at the cost of 10× and 2× more complexity compared to HEVC for the encoder and decoder contexts, respectively [4].In this scenario, the main challenge is to develop real-time VVC codecs that take into account resource-constrained consumer devices frequently used in consumer electronics based on embedded platforms.
Each coding standard is released with a reference software implementation available for the scientific community.These solutions incorporate all the basic features of the standard but offer a very limited speed performance.In the case of VVC, the reference software is the VVC Test Model (VTM) [5].Taking this as a starting point, research groups and companies develop their own real time software and hardware solutions.Depending on the targeted architecture, these solutions mainly exploit the intrinsic parallelism of the algorithms, both at the data and functional levels, to enhance their performance in terms of speed and energy consumption.In the first case, some data operations included in the source code are optimized by using Single Instruction Multiple Data (SIMD) type instructions [6].Here, vectorized operations are used to perform mathematical operations with more than one operator using a single processor instruction.The other potential optimisation route is to take advantage of the intrinsic parallelism of independent processing of pictures [7], or smaller parts of the picture, such as slices [8] or tiles [9].In the latter case, it is necessary that the coding is done by activating these normative tools that break dependencies between adjacent regions.
In this work, two open source VVC decoders are presented and compared against each other.These solutions, named OpenVVC [11], [12] and Versatile Video deCoder (VVdeC) [13] decoders, are optimized using data and functional level parallelism techniques.In this paper, we evaluate their performance in terms of decoding run time, power consumption and memory usage targeting two different embedded platforms.The results showed that both decoders have achieved 15 to 34 Frame Per Second (fps) for Ultra High-definition (UHD) sequences with Quantization Parameter (QP) 27 and 37, and achieved real-time decoding of Full High-definition (FHD) and High-definition (HD) sequences over the first target platform using 8 cores.Furthermore, 16 to 28 fps have been obtained for FHD sequences with QPs 27 and 37, and realtime decoding has been reached for all HD sequences by OpenVVC and VVdeC when targeting the second embedded platform with 4 cores.In terms of energy consumption and maximum memory usage, the experimental results showed that VVdeC has consumed ×2 more memory with respect to the consumption of OpenVVC over both target platforms.On the other hand, OpenVVC consumed the same energy as VVdeC on the Embedded System-on-Chip (ESoC)1 platform with 8 cores and around 1.25× VVdeC's energy consumption when targeting ESoC2 embedded platform with 4 cores.
The rest of the paper is structured as follows.Section II give a short introduction to the VVC standard.Section III describes the optimizations included in the VVC decoders using specific parallelization techniques along with the stateof-the-art of VVC decoders.Section IV details the proposed optimizations techniques included in the OpenVVC decoder.The obtained results and comparison between OpenVVC and VVdeC performance are provided in Section V. Finally, Section VI concludes the paper.

II. INTRODUCTION TO VVC
Similar to its predecessors, VVC was designed based upon a hybrid coding scheme using intra/inter-prediction coding and transform coding.In Figure 1, the decoding process scheme of VVC is presented.Here, the encoded bit-stream is the input, the decoded video is the output, and each decoding phase is presented by one block.

A. Entropy decoding
Bit-stream decoding begins with this process integrated with similar but advanced context adaptive binary arithmetic coding (CABAC) [14] with respect to HEVC.Here, an updated multihypothesis probability estimation model was adapted and the computed look-up table was eliminated for enhancing the accuracy.The coefficient coding has been improved by allowing 1×16, 2×8, 8×2, 2×4, 4×2, and 16×1 coefficients group size for transform block size.

B. Inverse quantization and transform
The spatial domain coefficients are retrieved from the frequency domain by inverse quantization and inverse transformation.VVC introduces Multiple Transform Selection (MTS) Fig. 1: Block diagram of a VVC decoder.
[15] tool used to encode the residual inter and intra coding blocks.MTS allows three transforms of the rectangle blocks with the height and width of ≤ 64 for Discrete Cosine Transform (DCT)-II, ≤ 32 for DCT-VIII and Discrete Sine Transform (DST)-VII.Moreover, the coefficients of high frequency are zeroed when the height and or width is equal to 64 for DCT-II and 32 for DCT-VIII and DST-VII.In addition, Low Frequency Non-Separable Transform (LFNST) [16] is used on the low frequency coefficients of the transform domain after DCT-II for further signal decorrelation.

C. Intra prediction
In VVC intra-prediction module, 32 additional directional intra-prediction modes were added with respect to HEVC.Moreover, it allows a wide-angle intra modes for rectangular blocks, which improves the prediction accuracy.In addition, the matrix weighted intra-prediction tool was used a new intra prediction mode by taking above and left neighbouring lines of the prediction block.Besides, VVC adapted cross-component linear model [17] tool, which applied to the prediction of chroma components from the luma components.

D. Inter prediction
Inter-prediction takes advantage of the temporal redundancy of the video by removing the correlation in the temporal domain [18].Here, the motion compensation estimates the current coding unit samples according to the samples recorded in the decoded picture buffer.In addition, 8-tap filter is used to luma samples for creating motion-compensated prediction and 4-tap filter is used to chroma samples for interpolation [19].Furthermore, VVC achieved improved prediction accuracy using decoder-side motion vector refinement [20] and bidirectional optical flow prediction refinement [21].

E. Luma mapping with chroma scaling
Forward Luma Mapping with Chroma Scaling (LMCS) is a new tool introduced in VVC and comes after inter-prediction stage.It has two parts: Luma Mapping (LMP), used to modify luma predicted samples, and Chroma Scaling (CSP), used to modify chroma residues.LMP makes the most use of the range of luma code values and provides an efficient reallocating process of luma code values in the coding domain.Therefore, CSP changes the value of the chroma residual samples in the chroma coding block for mitigating the defect coming from the interaction between luma and luma corresponding chroma signals [22].

F. In-loop Filters
VVC in-loop filters consist of inverse Inverse Luma Mapping Chroma Scaling (ILMCS) [23], Deblocking Filter (DBF), Sample Adaptive Offset (SAO) filter and Adaptive Loop Filter (ALF).First, ILMCS is a new addition to VVC which enhances the decoding performance by inverse mapping the luma code to the reconstructed block.DBF and SAO in VVC are very similar to HEVC [24].DBF is used to detect and filter the artifacts of the pixels at the block boundaries and SAO is used to minimize sample distortion over the pixels filtered by DBF.In addition, unlike HEVC, VVC includes ALF [23] for reducing mean-squared error of the decoded pixels.Therefore, undesired artifacts obtained by the previous decoding modules including blurring, blocking, flickering, ringing, and colour shift are mitigated using in-loop filters and the decoded video is obtained.

III. OPTIMIZED AND REAL TIME SOFTWARE DECODERS
There are three levels of parallelism that can be exploited to speedup the video decoding process: data-level, framelevel and high-level.In this section we first describe these three levels of parallelism, then we give a brief description of existing optimised decoders with an introduction to the recent software and real time VVC decoders including OpenVVC and VVdeC.2) Frame-level parallelism: Simultaneous processing of multi-frames is performed in frame-level parallelism while dependencies of the motion compensation are also satisfied.The length of the motion vector is, in this case, the deterministic factor [8]. Video sequences with large motion would imply large dependencies between frames which may create a major disadvantage for frame-level parallelism.Hence, sequences in all intra configuration are the most benefited by frame-level parallelism due to the fact that there are no motion compensation dependencies.Moreover, in frame-level parallelism, additional picture buffer and local buffers should be stored for each thread used to decode in parallel.As a result, it demands higher memory than sequential decoding.

B. High level parallelism
1) Wave-front parallel processing: Wave-front Parallel Processing (WPP) allows virtual picture partitioning into Coding Tree Unit (CTU) rows [8].In WPP, the coding dependency is kept unchanged during the picture partitioning while the entropy engine is initialized at the start of each CTU line.Therefore, at the beginning of each CTU row, the CABAC context is re-intialized and it depends at least on the data from first CTU of the previous row.As a result, the decoding of rows could not completed at the same time and slightly limits parallelisation efficiency of WPP.
2) Tiles parallelism: VVC supports tiles of rectangular shape consisting of CTUs [26].An example of tile partition is shown in Fig. 2. Here, four tiles are labelled with A, B, C, and D. Tiles are separated by boundaries, which eliminates the prediction dependencies.Therefore, for all the tiles, the entropy encoding step is reinitialized, which allows to decode tiles independently.It allows decoding a picture concurrently using multiple threads.However, the in-loop filtering process can only be carried out at the tile boundaries when pixels are reconstructed from both sides.

Various parallelization methods have been used to accelerate video codecs on Central Processing Unit (CPU), Graphics
Processing Unit (GPU), and combination of both based architectures.Yan et al. accelerated a HEVC decoder by ×4 compared to HM 4.0 using SIMD technologies over a x86 processor in [27].Authors in [28] and [29] proposed GPU-based implementation of HEVC that satisfied real-time requirements for the decoding of UHD 4k sequences.S. Gudumas et al. in [8] discussed various optimization techniques to implement VVC using multiple CPU cores on heterogeneous platforms with the purpose of achieving real-time decoding [8].Here, the decoding tasks were redesigned and parallelized with task parallelization based on load balancing and data parallelization, at the level of CTU.The authors in [30] presented a GPU-based motion compensation system to accelerate the VVC decoder that exploits the partition of the different Coding Unit (CU) and proper thread organisation.Furthermore, Wieckowski et al. in [2] described an optimized VVC decoder that achieved real-time decoding on general purpose CPUs.Here, SIMD operations-based optimisation and multi-threading based optimisation were adapted.The authors in [31] demonstrated an optimized real-time VVC decoder that takes advantage of SIMD instruction extensions on x86 based CPU.Moreover, the authors discussed the implementation of frame, CTU, sub-CTU, and task level parallelization.An optimized VVC software decoder on mobile platforms is presented in [18],.The presented decoder was generated from VTM-11.0 reference software.Here, the decoder achieved real-time decoding for HD video sequences using SIMD and multi-threading on ARM [24] based platform.
There are a handful of software open-source decoders available that are compliant with the VVC standard.Firstly, the already mentioned reference VVC test model, or VTM.
Secondly, VVdeC [13], an implementation proposed by Fraunhofer Institute, is an optimized decoder based on VTM.It includes SIMD and multi-threads parallelisation for an optimal performance.Thirdly, O266dec [32], a real time decoder proposed by Tencent Media Lab.O266dec also uses the SIMD and multithreading parallelisation techniques.Unfortunately this decoder has not been updated in the last 14 months.Finally, OpenVVC is a lightweight open source software decoder that is available in [11].It is designed to target different operating systems and hardware architectures.Similar to VVdeC and O266dec, OpenVVC uses data and functional level parallelism to optimize the decoding performance.For more details on VVC codecs, Sullivan [10] provides a complete list of available VVC encoder and decoder implementations.

1) Introduction to OpenVVC:
OpenVVC is an open source software VVC decoder written in C programming language.It is compiled as a cross-platform library and is compatible with most-used operating systems and optimized for x86 and ARM processors.The last version of the decoder is compliant with VVC Main profile.In addition, it is integrated with VLC [33], GPAC [34], and FFplay [35] video players.OpenVVC provides high decoding speed by using as little memory as possible.It exploits tiles and frames paralelization based on multi-CPU cores, and SIMD optimisations for accelerating the decoding process.The decoding process of OpenVVC starts by parsing the parameters of the sequence.Therefore, reconstruction tasks consisting of Inverse Quantization and Transform (TX), LMCS, Inter Prediction (EP) and Intra Prediction (IP) decoding blocks, are performed at the CU level.Then, DBF is performed immediately after the reconstruction process is completed at the level of CTU.This approach helps to optimize memory usage by avoiding massive storage of quantization parameter map and CU dimension that is essential for the DBF process.Finally, ALF is applied after the SAO at the level of CTU line and this process delivers the decoded frame as output.
2) Introduction to Versatile Video Decoder: VVdeC [13] is an open source software VVC decoder optimized for x86 architectures and developed by Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI.Unlike OpenVVC, VVdeC has been developed from VTM reference software [36].It supports VVC Main 10 profile and it is capable to decode all conformance VVC bitstreams [37].In addition, VVdeC comes with SIMD optimisations and multithreading parallelisation over x86 architectures.The parallelisation of VVdeC decoding begins by parsing multiple frames concurrently.Therefore, in the reconstruction process, tasks are split based on CTU lines and CTUs.Here, a stage is given to each CTU for tracking the following stage and process tasks in parallel after the dependencies are satisfied.It allows task coordination where a task worker is assigned to each CTU.Task workers are scanned by a thread-pool for assigning the available task.VVdeC has achieved decoding time reduction up to 90% [2] with respect to VTM.

IV. DECODER OPTIMIZATIONS
This section describes the implementation of frames, tiles, and Neon-based SIMD parallelisms in OpenVVC over EGPPbased platforms.

A. Frames and Tiles parallelization in OpenVVC
In frame-level parallelism of OpenVVC, a main thread is used to parse the Picture Parameter Set (PSP), Sequence Parameter Set (SPS), picture/slice header and schedule decoding threads with a thread pool.Then, the main thread provides the data and updates the internal structure of the available threads in the thread-pool for decoding the frame.Therefore, motion compensation synchronisation between threads is performed for sequences with inter-coding configuration after starting the decoding process.In fact, this latter is the most challenging step in frame-level parallelism, where the available thread has to wait for motion compensation before starting the pixels processing.When pixels are ready, the available thread is able to perform the decoding process.This process is applied at the CTU line level since OpenVVC performs decoding and in-loop filtering at CTU line basis.Once the decoding process is completed, the decoding threads signal their availability to the main thread and return to the thread-pool.
On the other hand, tiles level parallelism is applied at a portion of a frame.In fact, every frame is decomposed into rectangular regions of the picture containing multiple CTUs [26].The main challenge of tile level parallelism is that tiles could have different run time complexities.Therefore, the time required to finish one frame is the time to finish the longest tile.In this case, at a certain processing time, some threads are free, without a task, waiting to finish processing the current frame.For more details about this issue, a qualitydriven dynamic frame partitioning for efficient parallel processing is explained by Amestoy et al. in [38].A dynamic tile and rectangular slice partitioning solution enables the best partitioning configuration that minimizes the trade-off between multi-thread encoding time and quality loss.This is performed by taking into account both spatial texture of the content and encoding times of previously encoded frames.Experiments prove that the proposed solution, compared to uniform partitioning, significantly decreases multi-thread encoding time, with slightly better quality.
The proposed solution integrated in OpenVVC aims to efficiently activating all threads at all times.In order to do so, it applies thread pipelining technique that overlooks frames and focuses only on tiles.Figure 3 illustrates tile pipelining.The tile partitioning forms a 2×2 grid.They are labeled A, B, C and D for the first frame and A', B', C' and D' for the second frame and delimited by the thicker black lines.Prediction dependencies across tile boundaries are broken and entropy encoding state is reinitialized for each tile.These restrictions ensure that tiles are independently decoded, allowing several threads to decode simultaneously the same picture.As it can be observed, regardless of the tile position, as soon as a thread is available from the thread pool, the tile is processed.Thread 2 for example does not work on any tiles of the second frame since it took the entire time working on tile D of the first frame.This fact does not restraint threads 0 and 1 from working on the tiles of the second frame.However, adopting this technique creates some sort of a combination between frame and tile parallelism, as a result, dependencies between frames for inter prediction and motion compensation should be taken into account.This latter is handled by OpenVVC.Moreover, since OpenVVC processes tiles independently of their frame affiliation, tile size and load optimization at the encoder side do not actually impact the performance of OpenVVC.At the end, tiles are pipelined regardless of their size or load without waiting to finish processing the current frame.

B. SIMD optimisation in OpenVVC
In this study, ARM Neon-based SIMD optimisations were adapted to accelerate OpenVVC targeting EGPP-based platforms.First, the x86 architecture-based SIMD intrinsics used in OpenVCC are replaced by the ARM Neon-based SIMD intrinsics.Therefore, additional adjustment was adapted due to the fact that Neon-based intrinsics are not as powerful and complete as compared to SSE or AVX intrinsics.In particular, for some cases, one SIMD instruction for x86-based was replaced with multiple Neon-based SIMD instructions.For instance, two Neon intrinsics vmull s16 for multiply operation and vpaddq s32 for add operation are needed to replace madd epi16.
ESoCs used in this study support up to 128-bit SIMD registers.A 128-bit register can be loaded with 16 8-bit, 8 16bit, 4 32-bit, or 2 64-bit data.This fact allows concurrent data processing to achieve a theoretical speedup of up to ×16 on 8bit data.In this study, Neon-based SIMD instructions are used to optimize the high computational demanding VVC decoder modules presented in Table I by adapting the SIMD [39] library.Here, DST-VII, DCT-II, DCT-VIII, inter-component transform and LFNST module of TX block of VVC decoder was accelerated using SIMD registers.TX involves several matrix operations including matrix multiplication for the inverse transformations.These matrix operations were tackled Then, the pixel prediction inside the picture of IP block was effectively managed by storing masks, clipped, and offset value using SIMD intrinsics.Further, the edge and band filter of SAO use vceq, vadd, and vsub instructions to handle mathematical operations.Finally, ALF filters are parallelized by concurrently storing filter parameters using shuffle intrinsic.Moreover, it exploits the full capacity of SIMD register of 128 bit using load and store intrinsic instructions.

V. EXPERIMENTAL RESULTS
In this section, the experimental setup, test bench used in this study and the experimental results obtained are presented for two open source optimized decoders VVdeC V1.3 and OpenVVC V1.0 on two EGPP-based embedded platforms.

A. Experimental Setup
This study focuses on low-cost mobile embedded heterogeneous platforms.Therefore, two ESoC platforms, ESoC1 [40] and ESoC2 [41] have been used.ESoC1 processor consists of eight EGPP cores running with a maximum clock speed of 2.26 GHz and 512 embedded GPU (EGPU) cores running with a maximum clock speed of 1.37 GHz.In addition, ESoC1 has 8MB of L2 cache memory, 4MB of L3 cache memory and 32GB 256-Bit random access memory with 137 GB/s speed.ESoC2 has 4 EGPP cores and 128 EGPU cores running with a maximum clock speed of 1.48GHz and 0.92GHz, respectively.Moreover, it has 2MB L2 cache memory and 4MB 64 bit random access memory with 25.6 GB/s speed.This work is only based on EGPP cores and GPU cores could be used in

B. Test video sequences
Table II presents the different features of the fifteen JVET common test sequences [42] used in this study.The following sequences, grouped by resolution classes, have been encoded by the VTM-11.0reference software with 10-bit random access and 3 × 3 tile configuration at two QP 27 and 37. Three HD sequences from class E are used: FourPeople, Johnny, and KristenAndSara alongside six FHD sequences of class B: MarketPlace, RitualDance, Cactus, BasketballDrive, BQTerrace, and ArenaOfValor.We also considered six UHD sequences from class A1: Tango2, FoodMarket4, and Campfire and class A2: CatRobot1, DaylightRoad2, and ParkRunning3.

C. Results and analyses
Since this study focuses on analysing the decoding performance over embedded platforms, the average energy consumption and the maximum memory usage have been also measured.To do so, two open source optimized VVC decoders: VVdeC and OpenVVC over the two already mentioned platforms have been used.
1) Decoding performance: Firstly, the decoding performance of OpenVVC has been studied for four combinations of frame-tile parallelization by taking advantage of the 8 physical cores integrated in the ESoC1 architecture: • 1-frame and 8-tile per frame in parallel (f1/t8).
• 2-frame and 8-tile per frame in parallel (f2/t8).This study has been performed to do a fair comparison between the best configuration of OpenVVC and VVdeC.Fig. 4 shows the average decoding performance in frames per second (fps) for HD and FHD test sequences with QP27 and QP37 on ESoC1 using OpenVVC decoder.
It can be seen from Fig. 4 that the least performing configuration is f1/t8 and the best performing configuration is f4/t2 Threading configuration Decoding frame rate (fps) ESoC1 HD-QP27 HD-QP37 FHD-QP27 FHD-QP37 Fig. 4: Average decoding performance (fps) of OpenVVC for QP 27 and 37 sequences on the ESoC1 platform.for all QPs, HD, and FHD sequences.f4/t2 configuration has achieved in average ×1.4 and ×1.3 fps compared to the f1/t8 configuration for FHD and HD sequences, respectively.These results mainly illustrate the gain brought considering both frame and tile parallelism compared to only frame parallelism which is constrained by the inter coding dependency.Fig. 5 presents the average decoding performance in fps obtained over ESoC2 for HD sequences using both QPs.ESoC2 has four physical cores.For this reason, only three combinations of frame-title parallelization have been studied: • 1-frame and 4-tile per frame in parallel (f1/t4).
• 2-frame and 4-tile per frame in parallel (f2/t4).In this case, and as shown in Fig. 5, the f2/t2 configuration has achieved 62.6 fps for QP27 and 74.5 fps for QP37, which is higher in average by 8.1 fps and 3.4 fps than f1/t4 and f2/t4 configurations for HD sequences, respectively.
The decoding performance and speedup of VVdeC and f4/t2 configuration of OpenVVC decoders for QPs 27 and 37 over ESoC1 are presented in Table III for all considered video sequences.It can be seen that the obtained decoding The performance results follow the same pattern for both considered QPs.
2) Memory usage: Memory usage is one of the most limiting factors and a likely bottleneck for video decoding over resource-constrained embedded hardware.This part of the study presents the maximum memory (in MB) consumed by OpenVVC and VVdeC over ESoC1 and ESoC2 for the FHD/HD sequences with two QPs (27,37).In Fig. 7, the average maximum memory usage for different OpenVVC configurations and VVdeC is shown.Here, for both FHD  and HD sequences, f1/t8 and f2/t8 configurations have used the least and the most memory, respectively.These behavior is expected that with the increase of the number of frames 3) Energy consumption: Energy consumption is another important factor for video processing operation over embedded platforms.In this study, the energy consumption was calculated as follows: 1) the power consumption is taken (in mW) after decoding each frame using the built-in power monitor of both ESoC, 2) the average power consumption of the entire sequence is multiplied by the total time in second spent for decoding the sequence, and 3) convert the energy consumption to Joule.The average energy consumption in J with different configurations of OpenVVC and VVdeC decoders is shown in Fig. 8 To summarize, there are three important parameters to take into consideration to select a video decoder for an embedded platform with limited hardware resources: the performance  (fps), the energy consumed to decode a video and the memory used.The performance of the decoders compared in this paper (VVdeC and OpenVVC) is very similar and only a small improvement is achieved in VVdeC while the number of cores remains low.The energy consumed to decode a sequence is also very similar between both decodes.Finally, OpenVVC consumes less memory than VVdeC with a factor higher than ×2.11.This significant reductions makes OpenVVC decoder as the best option to implement a VVC video decoder conformance in a multi core platform with limited resources.

VI. CONCLUSION
This paper presents two open source VVC decoders: OpenVVC and VVdeC, optimized for low-cost resourceconstrained embedded platforms.Here, OpenVVC and VVdeC have been optimized at the level of data processing using SIMD operations.In addition, tile and frame based parallelizations have been implemented in OpenVVC.Both decoders have achieved 15 to 34 respectively fps for UHD sequences with QP 27 and 37, and achieved real-time decoding for all configurations of FHD and HD sequences over ESoC1 using 8 cores.Furthermore, 16 to 28 fps have been obtained for FHD sequences for QPs 27 and 37, and real-time decoding has been obtained for all HD sequences by OpenVVC and VVdeC on ESoC2 using 4 cores.Moreover, the experimental results for the two most important factors of the embedded platform: the average energy consumption and maximum memory usage by both decoders were presented for ESoC1 and ESoC2.VVdeC has consumed on average ×2.74 and ×2.96 memory compared to the OpenVVC f4/t2 configuration on ESoC1 and the f2/t2 configuration on ESoC2, respectively.For average energy usages, VVdeC consumed on average ×1.11energy with respect to the OpenVVC f4/t2 configuration on ESoC1 and ×0.83 energy with respect to the OpenVVC f2/t2 configuration on ESoC2.
A. Task level parallelism 1) Single instruction multiple data: SIMD is a data-level parallel processing technique that loads multiple data in single register to operate with.The x86-based architectures offer Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) based SIMD intrinsics.Embedded general purpose processor (EGPP) based platforms support ARM Neon suite instead of SSE and AVX based SIMD intrinsic.ARM Neon is an advanced SIMD technology designed for mobile devices that support up to 128-bit register.Neon-based SIMD technology can be used following 4 ways [25]: a) using Neon intrinsics, b) Neon-enabled libraries, c) compiler autovectorization, and d) hand-coded Neon assembler.

Fig. 6 :
Fig.6: Average decoding performance (in fps) of OpenVVC, in brown QP 27 and blue QP 37, and VVdeC, in black QP27 and red QP37, for 1 to 8 number of cores.

TABLE I :
Main functions optimized with SIMD.
and vmax instructions.In addition, loading and storing data in larger SIMD registers helped to accelerate EP, because the prediction information of the pixels is needed multiple times in different EP functions.

TABLE II :
Features of the considered VVC test sequences.

Table
IV, the decoding performance of VVdeC and OpenVVC (f2/t2 configuration) decoders for all considered FHD and HD sequences with QPs 27 and 37 over ESoC2 platform is shown.Here, both VVdeC and OpenVVC decoders have achieved real-time for HD sequences over ESoC2 using 4 cores.Therefore, in the upcoming part of this study, the results are presented for HD sequences over ESoC2.The average decoding performance with respect to the number of cores is presented in the Fig. 6.Here, the decoding frame rates have been recorded for OpenVVC and VVdeC decoders over ESoC1 and ESoC2.For both QPs, the average results in fps of OpenVVC and VVdeC are similar for one to four cores over ESoC2.Moreover, VVdeC has achieved up to ×1.08 fps with respect to OpenVVC on ESoC1 and it has reached the saturation point with 7 cores for HD sequences.However, for FHD sequences, the performance results of OpenVVC and VVdeC are comparable on ESoC1.

TABLE IV :
Decoding performance (in fps) for the considered HD and FHD test sequences at QP27 (top) and QP37 (bottom) on the ESoC2 platform with 1, 2, 3 and 4 cores.
. Here, OpenVVC and VVdeC Both open-source optimized video decoders OpenVVC and VVdeC have reached real-time for FHD and HD sequences over ESoC1 using 8 cores.In addition, both solutions present results near to real-time performance for UHD sequences on ESoC1 platform.Moreover, OpenVVC and VVdeC decoders achieved an average of 22 fps for QP27 and 28.5 fps for QP37 using 8 cores.TableVand Table VI show the average performance (in fps) of OpenVVC and VVdeC using different number of threads on ESoC1 and ESoC2, respectively.The complexity overhead of OpenVVC with respect to VVdeC is around 3% for UHD, 5% for FHD and 12% for HD sequences in both platforms.

TABLE V :
Average performance (fps) of OpenVVC and VVdeC decoders on ESoC1 platform with 1 and 8 cores.

TABLE VI :
Average performance (fps) of OpenVVC and VVdeC decoders on ESoC2 platform with 1 and 4 cores.