1 Introduction

The growing multimedia portfolio, including big data processing, cloud computing, and the Internet of things (IoT) [1], has a direct impact on our lifestyle. Multimedia IoT (M-IoT) is considered as a major network technology enabling the interconnection and interaction between humans, health centers, industries, and objects like cameras, transport, and sensors [1, 2]. In addition, M-IoT systems combine the networking technologies for computer vision, image processing, and connectivity. Yet, they can be used in driving assistance, surveillance such as crime and fire detection, and remote sensing such as high-speed object tracking [3]. Real-world multimedia applications, including smart industry 4.0 and agriculture 4.0, smart traffic monitoring, smart cities, smart homes, smart health, and smart environment with intelligent surveillance systems [3], are illustrated in Fig. 1. However, several issues such as interoperability, security, data size, reliability, storage, and computational capacity need to be well resolved to process multimedia data [4].

Fig. 1
figure 1

M-IoT use cases

Compared to traditional IoT, M-IoT has powerful functionality such as fast and reliable data delivery. Therefore, it imposes high quality of service (QoS) requirements and demands efficient network architecture. In this context, quality of experience (QoE) represents the perspective of the end user’s QoS. QoE can be depicted as objective or subjective. The users’ objective QoE is difficult to measure and varies considerably according to the requirements of M-IoT devices (bigger memory, higher computational power, more power-hungry with higher bandwidth, etc.). However, service providers concern with the subjective QoE to evaluate the network mean opinion score (MOS) [5]. Multimedia data (audio, image, video, etc.) pose several challenges for transmitting, storing, and sharing data, especially their processing [6]. Furthermore, M-IoT processing requires efficient feature extraction, event processing, encoding/decoding, energy-efficient computing, QoS, and QoE [7].

As emerging technologies have rapidly evolved, multimedia services and video applications have grown tremendously. Higher image resolution (4K, 8K), especially for video games and monitoring tasks, is needed to satisfy the QoS specifications of end users. In traditional multimedia encoding methods, data are compressed only one time and decoded in every playing time. M-IoT devices are more concerned with uploading the data in uplink transmission. The latter poses challenges on computationally powered constrained M-IoT devices. Therefore, versatile video coding (VVC) which is a powerful multimedia encoding/decoding technique has been widely adopted. VVC [8] is the new generation video coding developed in July 2020, by the joint video experts team (JVET), as a successor of the high-efficiency video coding (HEVC) [9]. As the next standard for sophisticated video coding technology, VVC allows up to 30% for BD rate savings while maintaining the same quality as HEVC. Although VVC aims to maintain high-quality compressed video with additional encoding features, these are still compression artifacts that can lead to lower QoE. Hence, the QoE of VVC compressed video needs to be improved.

On the other hand, VVC adopts the block coding and the quantization structure; many different forms of distortion still exist, such as blocking artifacts, blurring, and ringing artifacts. The blocking artifacts affect the visual quality. While these distortions are permanent and cannot be removed entirely, special filters can be used to reduce them. For example, loop filters play an important role in reducing artifacts problems and in improving video and image qualities.

Unlike HEVC, in-loop filtering techniques [i.e., de-blocking filter (DBF), sample adaptive offset (SAO), and adaptive loop filter (ALF)] are applied in the VVC standard. These filters remove the video compression artifacts and enhance the visual quality of the reconstructed video. Indeed, the DBF purpose is designed with the use of discontinuity-based smoothing filters to minimize artifacts along block boundaries [10, 11]. In order to reduce ringing artifacts through compensation, SAO is used as a filter added after DBF, which applies shifts to samples based on the encoder lookup table and analyzes signal amplitudes using a histogram [12]. ALF is the latest loop filtering considered as a new feature in VVC. ALF reduces distortions between reconstructed and original images [13]. Although these conventional in-loop filtering can relieve specific artifacts, it is difficult to overcome the complex distortion introduced by video compression. To meet this challenge, powerful deep learning approaches have been used. Among these techniques, convolutional neural network (CNN) is the most robust and efficient processing method for recognizing and analyzing images and videos [14,15,16].

Several CNN-based filtering approaches for HEVC and VVC standards have been proposed for video quality enhancement [17, 18]. These approaches using CNN-based in-loop filtering and post-processing are proposed to reduce visual artifacts and to achieve high performance. Indeed, regarding the challenges of 5G and M-IoT technologies, such as low latency cost, high speed rate, and high video and image resolution quality, the VVC original loop filtering became insufficient to meet resolution requirement of M-IoT-based applications. To address these critical issues, QoE must be considered and improved in order to ensure QoS for end users [3].

In this context, we propose a deep CNN-based in-loop filtering approach, denoted as the wide-activated squeeze-and-excitation deep convolutional neural network (WSE-DCNN). The proposed approach provides new powerful in-loop filtering without exploiting traditional ones (DBF, SAO, ALF) for the VVC standard. Indeed, the main goal is to effectively remove compression artifacts and enhance the compressed video quality. The proposed method improves the QoE of end users. The contribution of this paper is summarized as follows:

  • We propose a WSE-DCNN framework-based aware quality system for the M-IoT context.

  • We implement the proposed scheme into VVC standard, which achieves coding gains accordingly for random access configuration.

  • We adapt the M-IoT scenario-based smart city context in which QoE of video quality is improved.

The remainder of this paper is organized as follows: Sect. 2 presents the related work, and Sect. 3 introduces the proposed M-IoT scenario. Then, the proposed deep CNN-based in-loop filtering in VVC standard is defined in Sect. 4. Next, we evaluate the proposed method in Sect. 5. Finally, Sect. 6 concludes the paper and opens the same perspectives.

2 Related work

In this section, we start by briefly describing several existing works related to multimedia data computing in IoT for video coding. Then, we will present deep CNN-based in-loop filtering methods.

2.1 Video coding for M-IoT

M-IoT poses several challenges to identify data transmission methods that may have reduced latencies for real-time processing, while ensuring QoS, QoE, and flexible data sizes to meet bandwidth limitations and to reduce power consumption. To effectively reduce data transmission and improve video quality, video compression is the most interesting module to deal with in this field.

Video compression is therefore necessary for an efficient transmission of video data via band-limited Internet. This need was the most felt during the COVID-19 pandemic where data traffic is used for e-learning, video conferencing, and real-time surveillance. Lee et al. [19] proposed an encoding algorithm using HEVC for compressing high video quality with 4K and 8K resolutions. The fast proposed algorithm achieves better performances in terms of computational complexity and bit rate. In [20], an IoVT platform is developed which combines HEVC and H.264/advanced video coding (AVC) for reliable video streaming in real time. Meanwhile, in the mHealth context, video compression is considered as the key technology and therefore has been widely used for real-time medical video communications in a mHealth environment. With regards to healthcare application, Panayides et al. [21] studied VVC and AV1 (AOMedia Video 1) video encoding in mHealth video communication scenarios. In [22], a comparative study between HEVC and VVC was presented in the context of video telehealth systems. The obtained results prove that VVC requires a BD rate of up to 40%, with respect to a high quality in full high definition (FHD) video. On the other hand, Alarifi et al. [23] proposed a novel hybrid cryptosystem for secure streaming of HEVC in IoT multimedia applications.

However, these aforementioned methods are still limited in terms of QoE and QoS to be adapted to the new generation of wireless networks. The latter could usher in new immersive experiences, such as virtual reality (VR) and augmented reality (AR).

2.2 Deep learning-based video coding

Recently, deep learning (a branch of artificial intelligence) has seen great success in computer vision tasks [24, 25], especially for video encoding [26,27,28]. Indeed, deep neural networks have been adopted to improve coding tools, including intra- and inter-prediction, transformation, quantization, and loop filtering for HEVC and VVC standards. Regarding the HEVC, Bouaafia et al. [28] proposed a machine learning-based HEVC complexity reduction in the inter-prediction process. The proposed algorithm achieves good RD complexity performances. Additionally, for intra-coding context, a fast algorithm based on CNN to improve HEVC intra-coding performance is introduced in [29]. With regards to in-loop filtering in HEVC, Pan et al. [30] proposed an in-loop filtering using an enhanced convolutional neural network (ED-CNN) to replace DBF and SAO, in order to eliminate artifacts. The suggested scheme achieved 6.45% BD rate reduction and 0.238 dB PSNR gains. Variable-filter-size residue learning convolutional neural network (VRCNN) was proposed in [31] as a new technique for both DBF and SAO in intra-coding HEVC. The simulation results show that the proposed technique achieves a BD rate savings of 4.6%.

For VVC standard, Ma et al. [32] developed a new CNN model, MFRNet, as a way to improve loop-through filtering and post-processing. The proposed technique was integrated into VVC test model to remove visual artifacts and enhance video quality. Furthermore, an in-loop filtering-based dense residual convolutional neural network (DRN) for VVC was proposed applied after DBF, and before SAO and ALF [18]. To reduce CU partition complexity, the fast intra-CU coding technique of H.266/VVC is implemented based on the enhanced DAG-SVM classifier model [33]. Experimental results show that the proposed model reaches 54.74% of encoding time. Therefore, Park et al. suggested a lightweight neural network (LNN) based on a fast decision algorithm to eliminate redundant VVC block partitioning [34]. The proposed model achieves a trade-off between encoding complexity and compression performance. However, these approaches do not take into account a QoE in VVC standard in an M-IoT context.

In this context, we propose an in-loop filtering based on wide-activated squeeze-and-excitation deep CNN (WSE-DCNN) approach to enhance VVC video quality and achieve coding gains.

3 Proposed M-IoT scenario-based architecture for multimedia data

Without loss of generality, we propose an M-IoT scenario in the context of smart city, as illustrated in Fig. 2. It consists of a set of M-IoT devices, like cameras and multimedia devices, that are capable to acquire multimedia contents from the real and physical word. After that, the sensed multimedia data are sent to the centralized cloud computing for processing, via the network layer, using different transmission technologies such as LP-WAN [35]. M-IoT devices are more concerned with uploading data in uplink transmission, which poses challenges for constrained computational M-IoT devices. Several metrics can be considered, in this step, like the delay, jitter, and packet loss rate. Our interest is shifting to central computing, such as M-IoT data compression and encoding/decoding.

Fig. 2
figure 2

M-IoT scenario-based centralized video quality enhancement

After M-IoT data acquisition step, data are compressed once and decoded whenever played. Traditionally, video encoding/compression is achieved by utilizing spatial and temporal redundancies. In this context, video quality is considered as the potential challenge in the VVC standard, that must be improved in this phase, especially when the huge collected multimedia data are structured/unstructured, with high velocity, and with different resolutions. Therefore, the QoE, depending on the video quality performances, is denoted as the metric that should be maximized. Based on the modeling, given in [36], the QoE is modeled considering the bit rate (BR) as formulated in (1).

$$\begin{aligned} {\textit{QoE}}_{{\textit{BR}}} = a \times {\textit{log}}({\textit{BR}}) +b \end{aligned}$$
(1)

where a and b denote coefficients determined during the experiment. However, this parameter, just like the PSNR metric, will also be used for the proposed WSE-DCNN-based in-loop filtering to evaluate video quality.

4 WSE-DCNN-based in-loop filtering for VVC-based M-IoT

The proposed WSE-DCNN framework is integrated into VVC standard, which replaces the original VVC in-loop filter module, as shown in Fig. 3. The main purpose of this proposed approach is to enhance the visual quality of the reconstructed frame while maintaining coding gains. The rate distortion optimization (RDO) technique is applied in order to confirm whether or not the proposed loop filter based on WSE-DCNN is used at each coding unit (CU). The RDO metric is given then by Eq. (2).

$$\begin{aligned} J = D + \lambda R, \end{aligned}$$
(2)

where D represents the distortion between the original and the reconstructed frame, R indicates the coding bits needed, and \(\lambda \) is the Lagrange multiplier controlling the trade-off between D and R. To avoid a reduction in RDO efficiency, the coding tree unit (CTU) level on/off control is applied. The frame level filtering would be shut off to prevent over-signal, if the enhancement quality is not worth to cost the signaled bits. For each \({\textit{CTU}}\), the \({\textit{CTU}}\) control flag shall be enabled when \({\textit{RDO}}\) performance reaches a better quality of the filtered \({\textit{CTU}}\), otherwise the flag will be disabled.

Fig. 3
figure 3

Proposed VVC standard framework

Fig. 4
figure 4

WSE-DCNN architecture

The concept of the proposed architecture is illustrated in Fig. 4. The proposed architecture is shared by luma (Y) and two chroma (U and V), so the three components will be filtered simultaneously. The proposed WSE-DCNN model consists of six inputs, three denoting the \({\textit{YUV}}\) reconstructed and the three others include the quantization parameter \({\textit{QP}}\) and the coding unit (CU) for luminance and chrominance. Meanwhile, these inputs are first normalized to provide better convergence in the learning phase and then fed to a WSE-DCNN-based in-loop filtering. Hence, the three (Y/U/V) reconstructions are normalized to [0, 1] based on the highest bit depth value. This implies that the normalized values (\(P'(x,y)\)) are achieved by Formula (3).

$$\begin{aligned} P'' (x,y)=\frac{P'(x,y)}{1<<B-1} ,\quad x=1,\ldots ,W, y=1,\ldots ,H \end{aligned}$$
(3)

where the bit depth is denoted by B, \(P''(x,y)\) is the normalized value in normalized Y/U/V at (xy), and W and H are the width and the height of the reconstructed frame, respectively.

Various quantization parameters (QPs) contribute to a variety of reconstructed video quality. This makes it easier to use a single set of parameters to fit reconstructions with different qualities. \({\textit{QP}}\) should be normalized to \({\textit{QPmap}}\) following Formula (4).

$$\begin{aligned} {\textit{QPmap}}(x,y)=\frac{{\textit{QP}}}{63} ,\quad x=1,\ldots ,W, y=1,\ldots ,H \end{aligned}$$
(4)

Regarding the other inputs, there are \({\textit{CU}}\) partition of the luma (\({\textit{Y}}\)) and chroma (\({\textit{UV}}\)) components. Since the blocking artifacts are mainly caused by CU block partition. The \({\textit{CU}}\) block partition is converted into coding unit maps (\({\textit{CUmaps}}\)) and normalized, and then it is considered as an input to the network. For example, for each \({\textit{CU}}\) in each frame, the boundary position is filled with two and the other positions are filled with one. However, two \({\textit{CUmaps}}\) can be obtained, one as \(Y-{\textit{CUmap}}\) for luma and the other denoted by \(UV-{\textit{CUmap}}\) for chroma.

As shown in Fig. 4, the processing of WSE-DCNN-based in-loop filtering has three levels. At the first level, the three components of \({\textit{YUV}}\) are processed through WSE blocks, and each one of them is fused with its corresponding \({\textit{CUmap}}\). In addition, \({\textit{CUmap}}\) would be multiplied by its own corresponding channel before being concatenated with feature maps. In the second level, the feature maps of different channels are connected together and then processed by several WSE blocks. At this level, the \({\textit{QPmap}}\) is also concatenated. Finally, in the last level, the three channels are processed separately again to generate the final residual image. Then, the original input will be implemented as a residual CNN. The WSE module is considered to be the basic unit of the WSE-DCNN-based in-loop filtering proposed in the VVC standard and shown in Fig. 4. Additionally, this basic unit is composed of the wide-activated convolution [37] and the squeeze-and-excitation (\({\textit{SE}}\)) [38] operation. The wide-activated convolution performs very well in super-resolution and noise reduction tasks. It consists of a wide \(3\times 3\) convolution followed by ReLU (rectified linear unit) [39, 40] activation function and a narrow \(1\times 1\) convolution. Next comes the \({\textit{SE}}\) operation which is the most technical operation used to weight each convolutional layer. It can use the complex relationship between different channels and generates a weighting factor for each channel.

The WSE unit consists of the following phases as depicted in Algorithm 1, given a feature map X with shape \(H\times W \times C\), where C means channel amounts:

figure a
  • A wide \(3\times 3\) convolution followed by ReLU and a convolution layer with kernel size is \(1\times 1\). Given \(Y_1\) is the channel defined in Algorithm 1 line 4 and \(Y_2\) is the output of the second convolution layer given in line 5.

  • Each channel obtains a value according to the squeeze operation using global average pooling (GAP) \(Y_3(k)\) as shown in Algorithm 1 line 10.

  • The excitation operation is described by two fully connected layers followed by ReLU and sigmoid (\(\sigma \)) activation functions, respectively. As shown in Algorithm 1 line 13, \(Y_4\) is the first fully connected layer followed by ReLU, which is refined by a certain ratio r. Then, the second fully connected layer is followed by the sigmoid activation function which is denoted by \(Y_5\) in line 14, and it gives each channel a smoothing gating ratio in the range of [0,1].

  • According to WSE function, each \(Y_2\) channel is multiplied by the gating ratio r, as defined in line 19.

  • Finally, when the number of input is equals to the output channels C, a skip connection will be added directly from input to output to learn the residue. Otherwise, there is no skipped connection.

5 Experimental Results

In this section, we evaluated the performances, in terms of RD performance, and QoE of the proposed WSE-DCNN-based in-loop filtering scheme in VVC standard. Then, a comparison to the state of the art is made.

5.1 Dataset collection

In this work, we exploited the public large video dataset BVI-DVC [41], especially developed for training the deep video compression methods. According to the work cited in [41], all selected sequences are progressive-scanned at a spatial resolution of \(3840\times 2160\), with frame rates ranging from 24 fps to 120 fps, a bit depth of 10 bit, and in YCbCr 4:2:0 format. All of them are truncated to 64 frames without scene cuts, using the segmentation method described in [42]. To further increase data diversity and provide data augmentation, the 200 video clips were spatially down-sampled to \(1920\times 1080\), \(960\times 540\), and \(480\times 270\) using a Lanczos third-order filter. The BVI-DVC dataset includes 800 video sequences with different contents at four different resolutions. Table 1 summarizes the key features of BVI-DVC video training dataset used in this study.

Table 1 Key features of BVI-DVC video training database [41]

In this context, we selected 80% video sequences for the training model and 20% for validation from the BVI-DVC video training dataset. These sequences are compressed by VVC reference software (VTM-4.0) [43] with QP values (22, 27, 32, 37) under random access configuration. For each QP, the reconstruction video images, including luma and chroma components, and its corresponding ground truth are divided into \(64\times 64\) patches, which were selected in a random order.

5.2 Deep model training, testing, and evaluation

The proposed deep learning model is trained offline in a supervised learning manner. During training phase, the TensorFlow GPU [44] is used as a deep learning framework to train the proposed model. The training parameters used in our experiments are summarized as follows: The batch size is set to 128, the training epochs to 200, the learning rate to 0.001, and weight decay of 0.1 for every 50 epochs. The Adam [45] optimizer is used to train our deep model. The training platform uses windows 10 OS with Intel®core TM i7-3770 @3.4 GHz CPU and 16 GB RAM and an NVIDIA GeForce RTX 2070 GPU.

The mean square error (MSE) is applied as a loss function between the ground truth image and the reconstructed image [35]. Equation (5) defines the MSE loss function.

$$\begin{aligned} L(\theta )=\frac{1}{N}\sum _{i=1}^{N}||F(Y_i,\theta )-X_i ||_{2}^{2} \end{aligned}$$
(5)

Let \(X_{i}\) be the ground truth of the proposed model, where \(i \in \{1,\ldots ,N\}\). The output of the WSE-DCNN model is denoted by \(F(\cdot )\), where \(Y_{i}\) represents the compressed images, \(i \in \{1,\ldots ,N\}\), and \(\theta \) is the parameter set of the proposed framework. The loss function has indeed converged on a minimum value; it means that our model is well trained. To prove the efficiency of the proposed WSE-DCNN network, Fig. 5 shows the MSE loss and the validation peak signal-to-noise ratio (PSNR) during training process. The PSNR is defined by Eq. (6) [46]. As we can see, the MSE loss function performs well in terms of the convergence’s performance.

$$\begin{aligned} {{\textit{PSNR}}}= {10\times {\textit{log}}} \frac{(2^{B} - 1)^{2}}{{\textit{MSE}}} \end{aligned}$$
(6)

where B is the number of bits per sample of the video sequence and the \({\textit{MSE}}\) is defined in Eq. (5).

Fig. 5
figure 5

Training MSE loss and validation PSNR

In the testing phase, our proposed WSE-DCNN model is integrated into VVC standard to replace the traditional in-loop filtering method. All simulations are tested under the VVC JVET common test conditions (\({\textit{CTC}}\)) [47] using random access configuration at \({\textit{QP}}\) values (22, 27, 32, 37). The (VTM-4.0) with traditional filters enabled is used in our experiments. From VVC CTC, 17 test sequences were used for performance evaluation, including class A1 (\(3840\times 2160\)), class A2 (\(3840\times 2160\)), class B (\(1920\times 1080\)), class C (\(832\times 480\)), and class D (\(416\times 240\)). To evaluate the coding performance of the proposed model, Bjontegaard delta bit rate (BD rate) [48] is applied as an assessment metric.

Table 2 Performance evaluation of the proposed model under random access configuration

5.3 WSE-DCNN evaluation

The RD performance results of the proposed model compared to the original VVC standard are shown in Table 1. Columns Y, U, and V in the table show the BD rate of Y, U, and V components, respectively. Ratios of the encoding and decoding time are denoted by \(T_{{\textit{enc}}}\) and \(T_{{\textit{dec}}}\) of the proposed model compared to the original one. The encoding time is defined by Eq. (7), where the coding complexity of the proposed method is defined by \(T_{{\textit{Pro}}}\) and the coding complexity of the original VVC is denoted by \(T_{{\textit{Orig}}}\).

$$\begin{aligned} T=\frac{T_{{\textit{Pro}}}}{T_{{\textit{Orig}}}}\times 100\% \end{aligned}$$
(7)

As shown in Table 2, the proposed scheme achieves better mean coding gains when integrated into VVC with 2.85% BD rate savings for luma Y component under random access configuration, while achieving 8.89% and 10.05% BD rate reduction for both chroma U and V components. The proposed scheme offers significant RD compression performance mainly in U and V chrominance for all test sequences. It is also clear that the coding performance differs for several sequences. This means that the proposed model is impacted by video sequence information. Moreover, the proposed model performs well in terms of coding gains for high motion or rich texture video sequences, as well as Tango2, DaylightRoad2, Kimono2, RaceHorses, etc. In summary, the proposed technique outperforms better than the VVC with traditional in-loop filtering algorithm in terms of RD performance.

Regarding complexity reduction, the time differences between the proposed VVC algorithm and the original VVC standard for encoding and decoding are summarized in Table 2. NVIDIA GeForce RTX 2070 GPU is used to measure the encoding and decoding time of the proposed filtering technique. From the table, we can observe that on average the encoding time overhead is 122% (for all test sequences: class A1 to class D), while the decoding time overhead is 1648.76% compared to the original VVC algorithm under random access configuration. The proposed scheme greatly influences the decoding time because of the forward operation in network and CPU-GPU memory copy operation. Therefore, we note that the proposed model achieves a little increase in encoding time compared to the original VVC algorithm.

To demonstrate the effectiveness of our proposed filtering model integrated into the VVC standard, PSNR is also used as a quality measure, which is calculated by the following equation [46]:

$$\begin{aligned} {{\textit{PSNR}}_{YUV}}= \frac{ {6\times {\textit{PSNR}}_{Y}} + {{\textit{PSNR}}_{U}+ {{\textit{PSNR}}_{V}}}}{8} \end{aligned}$$
(8)
Fig. 6
figure 6

Ablation study. Subjective visual quality comparison (the 12th frame of BQSquare with \({\textit{QP}}=37\): a original; b VVC without in-loop filtering (\({\textit{PSNR}}=31.17\) dB); c VVC (\({\textit{PSNR}}=31.37\) dB); d VVC-based proposed model (\({\textit{PSNR}}=31.68\) dB)

Fig. 7
figure 7

Comparison of QoE variation with respect to bit rate

The BQSquare video sequence encoded with \({\textit{QP}}\) equal to 37 under random access configuration is deployed in order to show the subjective visual quality and to further verify the effectiveness of the proposed model. Figure 6 shows the comparison of video subjective quality. It is clear that frame details are blurry when being compressed by the original VVC standard, but become clearer after being filtered by the proposed model. In fact, the proposed model effectively eliminates blocking artifacts as well as ringing artifacts and blurring, which improves visual quality, compared to the VVC standard with/without traditional in-loop filtering. Therefore, we compare the QoE variation with respect to bit rate of the proposed technique with the original VVC for class A1 to class D using random access configuration at four QPs, as shown in Fig. 7. It is remarkable that the suggested technique meets the QoE requirements of the end users, especially in high-resolution video sequences, as well as in class A1, class A2, and class B.

Table 3 Coding performance comparison with other approaches

We also compared the proposed approach with other filtering models based on CNN network. Table 3 shows the comparison of encoding performance with other approaches [18, 49,50,51] in terms of reducing RD complexity under random access using VVC CTC. In [18], the authors proposed an in-loop filter algorithm-based dense residual convolutional neural network (DRN) to improve the reconstructed video quality. This network is integrated after \({\textit{DF}}\), and before \({\textit{SAO}}\), and \({\textit{ALF}}\) into VVC VTM-4.0 test model. This model is trained using the DIV2K dataset [52]. Moreover, a CNN-based in-loop filter algorithm is proposed for both intra- and inter-pictures placed before \({\textit{ALF}}\) with \({\textit{DBF}}\) and \({\textit{SAO}}\) are disabled [49]. This method is implemented into VVC VTM-3.0 standard [49]. Yet, in [50], the authors proposed a CNN-based loop filter placed in all traditional filters in VVC which is trained based on the DIV2K [52] dataset and implemented in VTM-5.0. In addition, Huang et al. [51] proposed a novel multi-gradient convolutional neural network-based in-loop filter for VVC to replace the original DBF and SAO filters. This network is trained based on the DIV2K dataset [52] and implemented in VTM-3.0.

Fig. 8
figure 8

RD performance curves of the proposed model compared to the other three approaches

As shown in Table 3, the proposed WSE-DCNN framework integrated into VVC standard achieves best \({\textit{RD}}\) performance for both luminance and two chrominance for all test sequences from class A1 to class D, as compared to previous proposed approaches. These results imply that the proposed model performs well in terms of both objective and subjective visual qualities. Regarding the computational complexity, the proposed method outperforms other approaches in terms of encoding time for class A2. Compared with related works in [18, 49, 50], the methods proposed exceed our proposed scheme in terms of complexity reduction. In conclusion, the suggested method leads to efficient performance in almost all test sequences in terms of RD performance, demonstrating the efficiency and universality of the WSE-DCNN solution compared to other methods, whereas the computational complexity of VVC standard is still limited.

For further evaluation, we have provided \({\textit{RD}}\) performance curves of the suggested model-based in-loop filtering versus the other three methods under random access configuration with four \({\textit{QPs}}\) for class A1 to class D. Figure 8 shows the comparison in terms of \({\textit{RD}}\) performance (PSNR based on bit rate). By comparing the associated approaches, we can conclude that the proposed filtering model considerably improves the \({\textit{RD}}\) performance of the VVC standard. Our proposed in-loop filtering model works well especially in high-resolution video sequences, as well as in class A1, class A2, and class B.

6 Conclusion

In this paper, we proposed a deep learning algorithm-based VVC standard to enhance visual video quality while improving the user’s QoE. The proposed WSE-DCNN framework is implemented into VVC standard to replace in-loop filtering in order to alleviate the coding artifacts, such as ringing, blocking, and blurring. The proposed VVC filtering technique is used in the M-IoT scenario-based smart city context to contribute to the centralized cloud that attempts to meet the required user’s video quality. Compared to the traditional VVC-based filters, simulation results prove that the proposed framework achieves the best compression performance in terms of objective and subjective quality, with a BD rate savings about \(-2.85\)%, \(-8.89\)%, and \(-10.05\)% for Y, U, and V components, respectively. Therefore, this has proven the effectiveness of the proposed technique for video quality enhancement. Future works include the improvement in the VVC computational complexity (time encoding and time decoding) [23].