1 Introduction

Video is widely employed in different fields, ranging from social media to self-driving cars. Although modern cameras can capture high-quality videos in many situations, there are some cases in which their quality is significantly reduced. For example, when videos are captured in poor light conditions or compressed to limit memory occupation, visible artifacts are introduced, causing problems to both user experience and many computer vision tasks.

Video restoration aims to recover the clean video sequence from its degraded version. Different video restoration tasks can be defined: video denoising aims to remove noise, whose level can be high when videos are captured in particular imaging conditions; video deblurring aims to remove blur from videos that can be caused by out-of-focus subjects, moving objects, or camera shaking; video super-resolution aims to increase the spatial resolution of a given video to produce a high-resolution version of it from a low-resolution one; video compression artifact reduction aims to reduce artifacts introduced by compression algorithms that are applied to limit video memory occupation.

Many video restoration methods have been proposed in these years. They can be mainly divided into two categories: traditional methods and deep learning-based methods. In this paper, we focus our attention on deep learning methods because they represent an emerging category among the scientific community. We consider all the aforementioned restoration tasks to provide a global picture of the advances in video restoration because, although some methods are proposed to address a specific task, their building blocks and main features are not task-specific. In fact, some architectures were shown to be effective in different restoration tasks.

In this paper, we provide a comprehensive review of recent advances in video restoration using deep learning, analyzing the main features of some representative methods in an organized and structured manner, and highlighting their strengths and weaknesses. To the best of our knowledge, this is the first review of video restoration methods considering baseline schemes, architectural design strategies, convolution types, alignment techniques, and loss functions. Many surveys related to single-image restoration methods exist (Wang et al. 2020b; Tian et al. 2020a; Koh et al. 2021; Liu et al. 2020). Here we consider the video domain, which has been less investigated and presents several and different challenges. Recently, Liu et al. (2022) conducted a study on video super-resolution based on deep learning, focusing on alignment strategies. In this work, we extend the analysis to other video restoration tasks and to other aspects of video restoration methods.

Our main contributions can be summarized as follows:

  • We provide a comprehensive review of existing video restoration methods based on deep learning, analyzing in a hierarchical manner their main features related to architectural choices, motion handling and loss functions, and discussing their advantages and limitations.

  • We describe the characteristics of standard benchmark datasets, including their size in terms of number of sequences and frames, the resolution and the format of the contained video sequences, and classify them according to whether they contain synthetic or real distortions.

  • We summarize the performance of the reviewed methods on the considered benchmark datasets, both in terms of effectiveness, reporting the corresponding information in terms of the standard metrics Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM), and efficiency, reporting the declared computational complexity and/or run time speed on given input resolutions and hardware configurations.

  • We discuss the main challenges and future research directions in video restoration using deep learning, such as the need for real-time processing, improved strategies for frame alignment, multi-distortion restoration methods, better metrics and datasets, as well as the combination with traditional methods.

2 Background

Video restoration is the task that aims to remove artifacts introduced in videos by internal factors (e.g., noise) or external factors (e.g., camera shaking), producing a video of better quality. There is a huge variety of methods addressing the problem of video restoration. In recent years, research has been focused on the use of deep learning techniques. Therefore, this article reviews only methods in this category.

It is possible to see video restoration as a multi-image restoration task, where each video frame is restored using an image restoration method. However, this solution does not allow to exploit the temporal correlation among frames and may obtain suboptimal performance when the artifacts are strong, producing temporally inconsistent results because of the introduction of new temporal artifacts, such as flickering.

The main difference between image and video restoration methods is that the latter have the capability of using the temporal redundancy present in videos. Temporal redundancy means that the same information is contained within multiple frames, and video restoration methods can take advantage of this redundant information to recover details that may be missing in one frame. Indeed, neighboring frames typically contain the same objects, and such objects may appear with different levels of detail because of artifacts altering their aspect.

Fig. 1
figure 1

Example of two consecutive frames containing different levels of distortions. Since in the target frame the person is blurry (red rectangle), some information in the next frame (green rectangle) can be used to restore the target frame, improving the final outcome. Images reprocessed from the REDS dataset (Nah et al. 2019a)

For instance, Fig. 1 shows two consecutive frames representing the same scene, but some contents in Fig. 1b appear sharper than in Fig. 1a. In such a case, temporal redundancy can be used to improve the quality of the results by aggregating sharper information from other frames. Even if neighboring frames contain the same objects, they may be located in different positions due to motion. Hence, an appropriate mechanism able to align frames is usually implemented.

The general framework of video restoration methods is reported in Fig. 2.

Fig. 2
figure 2

General framework for video restoration methods. A sequence of adjacent frames is used as input. The alignment module aligns adjacent frames with the target one, the feature fusion module fuses the information contained in the aligned features, and the reconstruction module reduces the artifacts to produce the restored frame

Given the target frame, video restoration methods take advantage of adjacent frames to obtain additional information useful to restore it. Typically, N previous and N subsequent frames are used to gather information both from the past and the future. Three modules with different purposes can be identified: (i) the alignment module is used to align input frames with the target one so that the obtained representations are spatially aligned; (ii) the fusion module aggregates the aligned representations and further refines them; (iii) the reconstruction module uses the fused representations to reduce artifacts and produce the restored frame. These modules can be implemented in different ways by different restoration methods, as discussed in the following sections.

3 Video restoration methods

Video restoration using deep learning is an active research field, and many methods have been proposed during these years. Although the differences among these methods may be quite large, they share some characteristics related to architectural choices, motion handling approaches and learning strategies. Therefore, instead of analyzing each method in isolation, we identify and review the main characteristics of video restoration methods and discuss their advantages and possible limitations. A graphical organization of the features analyzed in this paper is reported in Fig. 3. Table 1 provides a brief description of each feature analyzed, and summarizes its advantages and limitations, which are better clarified and motivated in the following.

Fig. 3
figure 3

Hierarchical organization of the features of video restoration methods reviewed in this paper. MEMC refers to motion estimation motion compensation

Table 1 Brief description and summary of the main advantages and limitations of the video restoration features analyzed according to the hierarchical organization in Fig. 3

3.1 Architectures

Defining the right neural architecture is one of the most critical problems in the field of video restoration, as it impacts the final performance both in terms of effectiveness and efficiency. In this section, we describe the two possible baseline schemes that can be used by video restoration methods. Then we review the main design strategies and the convolution types used to model the spatial and temporal relationships among frames.

3.1.1 Baseline schemes

Video restoration methods take advantage of temporal redundancy to access the information contained in the temporal neighborhood. To this end, two baseline schemes can be used, i.e., the multi-frame and the recurrent approaches, that are schematically represented in Fig. 4.

Fig. 4
figure 4

Schematic representation of the main baseline schemes

3.1.1.1 Multi-frame

The simplest strategy to give the network access to temporal information is the multi-frame approach. It consists in using a temporal sliding-window of a fixed size centered on the target frame. The target frame and its neighboring frames are stacked and this represents the input to the restoration methods, as shown in Fig. 4a. The dimension of the temporal window is a hyperparameter to tune and is usually set between three (Caballero et al. 2017; Guan et al. 2019; Claus and Gemert 2019) and seven (Jo et al. 2018; Xue et al. 2019; Deng et al. 2020). A too small window may prevent the network from fully exploiting the potential information in the temporal neighborhood (Zhang et al. 2018a; Haris et al. 2019), whereas a too large window increases the computational complexity (Claus and Gemert 2019) and may include frames containing irrelevant information due to large object motion (Zhang et al. 2018a). Methods based on the multi-frame scheme usually process a frame multiple times, depending on the window size, and this might result in a waste of computational resources. Although these limitations can be addressed by using different strategies, the multi-frame scheme is widely employed also by recent methods (Paliwal et al. 2021; Chen et al. 2021; Vaksman et al. 2021), as it is a simple yet effective solution to exploit temporal context.

3.1.1.2 Recurrent

An alternative solution to capture information from the temporal context is the use of Recurrent Neural Networks (RNNs). Following this approach, illustrated in Fig. 4b, each frame is progressively passed through the network that extracts its features, aggregates them into a hidden state to be used for future frames, and uses relevant information from the previously processed frames to restore it. Methods using the recurrent scheme are usually faster than the ones based on multi-frame because each frame is only processed once, and can potentially achieve better performance because they can exploit a larger temporal window. However, they require suitable mechanisms to aggregate the features extracted from multiple frames. To this end, different strategies have been proposed (Hyun Kim et al. 2017; Nah et al. 2019b; Zhong et al. 2020; Zhou et al. 2019; Isobe et al. 2020). Hyun Kim et al. (2017) developed a strategy to blend feature maps of previous frames and the ones of the current frame by using a convolutional layer. Nah et al. (2019b) realized an iterative procedure using the outputs of RNN cells as inputs to the same cell multiple times. Isobe et al. (2020), inspired by Dynamic Filter Networks (Jia et al. 2016), implemented a module to adapt the hidden state to the appearance of the current frame by using correlation to highlighting only the most similar features.

An important aspect of recurrent methods is how the information is propagated through the framework. Usually, it is propagated from the initial frame to the last one (Nah et al. 2019b; Zhong et al. 2020; Zhou et al. 2019; Hyun Kim et al. 2017; Zhao et al. 2021). Such unidirectional propagation may result to be suboptimal because the amount of information received when processing different frames is different, as the first frames have access to less information than the last ones. Some methods (Huang et al. 2017b; Chan et al. 2021a, 2022; Zhu et al. 2022) use bidirectional information propagation, where information is propagated both forward and backward so that each frame can also benefit from the information coming from subsequent frames. Chan et al. (2021a) conducted a study demonstrating that bidirectional propagation improves the restoration performance.

3.1.2 Design strategies

When designing a deep neural network, there are several issues that one has to deal with. For instance, deep architectures suffer from the vanishing gradient problem, which can degrade the performance by preventing layers close to the input to be properly optimized. Another issue is feature modulation, since not all the features extracted by neural networks carry information that are actually useful for the considered task. To tackle these problems, several strategies are proposed and used by researchers to build their networks by combining them in different ways. Figure 5 reports the most common architectural design strategies, which are analyzed in detail in the following.

Fig. 5
figure 5

Schematic representation of the common architecture design strategies

3.1.2.1 Residual learning

He et al. (2016) proposed ResNet and demonstrated that residual learning can facilitate the training process and improve accuracy for image classification. Then, it has been widely adopted for other computer vision tasks, including video restoration. There are two possible implementations of residual learning to design a CNN: global and local residual learning.

Global residual learning is used to model situations where the output is highly correlated with the input, such that it is easier to learn a direct transformation of the input rather than a deep one. It is usually realized by adding a skip connection from the input to the output, so that the network only needs to learn, for example, the difference between input and output (Su et al. 2017; Guan et al. 2019; Zhou et al. 2019; Wang et al. 2019; Deng et al. 2020), as shown in Fig. 5a.

Local residual learning is primarily used to mitigate the vanishing gradient problem, and it consists in using blocks composed of groups of convolutions with skip connections, as in ResNet or variants of it (Zhang et al. 2018a; Nah et al. 2019b). Some works (Nah et al. 2017; Lim et al. 2017) empirically experimented that slight modifications of the original ResNet block were beneficial for the restoration performance. Figure 5b illustrates the design of a network adopting local residual blocks.

3.1.2.2 Dense connections

Dense connections have quickly spread since the introduction of DenseNet (Huang et al. 2017a). A dense block is characterized by several skip connections that forward the output of each layer directly to the input of the subsequent layers, so that each layer receives collective knowledge from all the previous layers. Figure 5c shows the architecture design of a network with dense connections.

Similar to residual learning, the use of dense connections is beneficial for the vanishing gradient problem. In addition, it allows feature reuse making it possible to learn richer patterns, it allows to increase the network receptive field, and it allows to use networks with fewer parameters because dense blocks have a relatively small growth rate, i.e., the additional number of channels for each layer. Some methods (Guan et al. 2019; Jo et al. 2018) adopted dense connections observing an improvement in the overall restoration results. Zhong et al. (2020) integrated dense blocks within RNN cells mainly to reduce the computational complexity of their model.

3.1.2.3 Attention mechanism

Attention mimics cognitive attention, defined as the ability to choose and concentrate mainly on relevant stimuli. In computer vision, the attention mechanism can be considered as a dynamic selection process that is realized by weighting features according to their importance in producing the output. Figure 5d shows a general implementation of the attention mechanism, where a sigmoid activation is used to produce weights between 0 and 1 and element-wise multiplication is used to modulate the input by suppressing irrelevant features. Typical attention types are: (i) channel attention, used to select the most important channels; (ii) spatial attention, used to select the most important regions; (iii) temporal attention, used to select the most important frames (Guo et al. 2022).

Wang et al. (2019) used temporal attention to identify the frames most similar to the target one, and spatial attention to mitigate errors arising from wrong frame alignment. Mehta et al. (2021) inserted the channel attention module proposed in SqueezeNet (Hu et al. 2018) into their network to better model dependencies across channels. Similarly, Zhong et al. (2020) adopted the same module but with slight modifications to improve the fusion of features from past and future frames. Zhao et al. (2021) designed a spatial attention module using deformable convolutions (Zhu et al. 2019) to highlight artifact-rich areas in each frame, such as boundary areas of moving objects, so that their model can focus more on removing artifacts in such areas. Paliwal et al. (2021) combined channel and spatial attention to identify errors related to optical flow computation, such as occlusions, by using SqueezeNet blocks and Convolution Block Attention Modules (CBAM) (Woo et al. 2018).

Designing architectures with attention modules can increase the overall effectiveness because they allow to distinguish relevant features from irrelevant ones and to weight them accordingly. The main disadvantage is related to the efficiency, since including attention leads to an increase in the number of parameters and operations.

3.1.2.4 Multi-path learning

Multi-path learning refers to processing features using multiple and separate paths that finally merge the complementary information. Multi-path learning can be either global or local.

In the global version, multiple parallel paths focus on different aspects of the input, as shown in Fig. 5e. Usually, two separate paths are used by video restoration methods (Jo et al. 2018; Chen et al. 2021; Isobe et al. 2020; Zhou et al. 2019). Jo et al. (2018) used one path to learn upscaling filters (Jia et al. 2016) and the other to learn high-frequency components, with the two paths sharing most of the weights. Isobe et al. (2020) separated low-frequency and high-frequency components, i.e., structures and details in the spatial domain, and processed them using separate branches. Chen et al. (2021) proposed a two-branch network with independent weights, where one branch is used to extract spatial features from individual frames and the other one to extract temporal features from multiple frames. These features are finally merged using a stack of convolutions.

Local multi-path learning is inspired by Inception modules (Szegedy et al. 2015), which are composed of multiple paths containing convolutions with different kernel sizes to analyze the input using multiple receptive fields. Mehta et al. (2021) included local multi-path learning in their network using layers composed of three convolutions with filters of size \(3\times 3\), \(5\times 5\) and \(7\times 7\), whose results are finally summed up. Zhao et al. (2021) employed two local branches with different receptive fields to increase the accuracy of offset prediction for deformable convolutions (Zhu et al. 2019).

While global multi-path learning can provide better modeling capabilities, as there are multiple paths focusing on improving different aspects of the input, local multi-path learning allows to extract multi-scale features by looking at the input with multiple receptive fields.

3.1.2.5 Coarse-to-fine processing

In visual recognition, coarse-to-fine processing refers to applying a method to a downscaled version of the image, i.e., coarse, and then gradually increasing its resolution and propagating the results to the fine version. In a coarse-to-fine architecture, as illustrated in Fig. 5f, the input is downscaled multiple times and processed by the network starting from the coarsest level (i.e., the lowest resolution), and the output is first upscaled and then propagated to the upper level until the finest level (i.e., original resolution) is reached. The main idea behind this approach is that the network can process the main structures at the coarsest level, while focusing on the details at the finest level. Propagating the outputs to upper levels allows the network to reuse features from lower levels, avoiding repeated computations and focusing on higher-level abstractions. All the levels usually share the same structure.

The coarse-to-fine design is typically adopted in the context of optical flow estimation, where it is known to be an effective solution for modeling large motion between objects and improving the estimation accuracy (Amiaz et al. 2007). Some methods for video restoration (Caballero et al. 2017; Guan et al. 2019; Yang et al. 2018) developed a coarse-to-fine module for motion estimation and compensation, which starts by computing optical flow at the lowest resolution and then propagates the estimated flow to upper levels for refinement, whereas others (Xue et al. 2019; Chan et al. 2021a, 2022) integrated an existing coarse-to-fine network (Ranjan and Black 2017) inside their framework to increase flow estimation accuracy.

A limitation of this approach is that coarse-to-fine networks may struggle in detecting small fast moving objects in coarse levels because they are removed by downscaling operations, and thus they are not suitable to handle large motion in this case (Savian et al. 2020).

3.1.3 Convolution types

In video restoration, both spatial and temporal correlations among neighboring frames require to be properly modeled to produce detail-rich and temporally-coherent results. To this end, different convolution types can be used.

3.1.3.1 2D convolutions

2D convolutions are the most commonly used type, which consists in centering a 2D filter on each spatial element of the features and then summing up the element-wise product between element neighbors and filter weights. The 2D convolution transforms a 2D matrix of features into a different 2D matrix of features that is passed as input to the next layer. Video restoration methods use 2D convolutions to process and fuse features coming from multiple input frames. The first convolutional layer typically fuses all the frames, and the next layers have only a limited effect in modeling additional temporal information because, after the application of the first layer, the temporal dimension is squeezed and later convolutions only operate on the spatial dimension (Fan et al. 2019). Therefore, 2D convolutions are effective in abstracting spatial dependencies, but they are not fully adequate in capturing temporal ones.

3.1.3.2 3D convolutions

A solution to take into account the temporal correlation among frames is the use of 3D convolutions. The main difference is that filter depth and input depth in 3D convolutions are not constrained to be equal as in 2D convolutions. Thus, a 3D filter can move in all the three dimensions, i.e., height, width, and depth. At each position, the element-wise product and addition produce one number, hence the output is a 3D data structure. 3D convolutions can capture spatial relationships in the input data, as 2D convolutions do, but they can model temporal relationships as well (Tran et al. 2015). While Zhang et al. (2018a) employed only 3D convolutions, other methods (Jo et al. 2018; Chen et al. 2021; Vaksman et al. 2021) used 3D convolutions together with 2D ones to better handle spatial and temporal information. The main limitation of 3D convolutions is related to efficiency, since applying them increases the number of operations to perform.

3.1.3.3 Deformable convolutions

Deformable convolutions were introduced by Dai et al. (2017) to address the limited capability of CNNs in modeling large and unknown transformations, originated by the rigid sampling grid of standard convolutions. In deformable convolutions, 2D offsets are added to the regular grid sampling locations, deforming the constant receptive field of the standard convolution operation. For each location, the applied deformation depends on the input features: the offsets are computed from the input feature map using additional convolutional layers, whose weights are learned during training. Zhu et al. (2019) proposed an enhanced version of deformable convolutions, where modulation scalars, i.e., position-specific weights used to modulate the weights of each convolution operation, are learned along with 2D offsets. In video restoration, deformable convolutions are typically used for frame alignment (Wang et al. 2019; Tian et al. 2020b; Deng et al. 2020; Yue et al. 2020; Chan et al. 2022; Zhao et al. 2021). Using deformable convolutions, a network can adapt its receptive field according to object scales, being so able to handle the large pixel displacement caused by motion. However, additional parameters representing the 2D offsets and modulation scalars must be learned during training.

3.1.3.4 Efficient convolutions

Model efficiency is crucial in real-time applications. A possible solution to reduce model complexity is to replace standard convolutions with more efficient learnable layers, such as separable and depth-wise convolutions (Chollet 2017; Howard et al. 2017; Mehta et al. 2021; Xiao et al. 2021; Vaksman et al. 2021).

Separable convolutions exploit the separability of the standard convolution operation along the spatial dimensions, so that a two-dimensional kernel can be separated into two one-dimensional kernels, reducing the number of parameters. However, since not all kernels can be separated, the use of separable convolutions may degrade the performance.

In depth-wise convolutions, each input channel is convolved with each kernel channel, but instead of summing them up as in standard convolutions, the output channels are simply stacked together. These kinds of convolutions were introduced to increase efficiency because the total number of operations to perform is lower than the one in regular convolutions, but they also may lead to a decrease in performance (Bao et al. 2020).

3.2 Motion handling

Motion is an intrinsic characteristic of video data. Video restoration methods must deal with it if they want to be able to exploit spatial information of adjacent frames. In this section, we first analyze the main alignment techniques used by video restoration methods to align neighboring frames with the target one, then we discuss why some methods perform alignment at feature level instead of at frame level.

3.2.1 Alignment techniques

Alignment techniques are used by video restoration methods to spatially align adjacent frames with the target one, so that information referring to the same objects in multiple frames will be located at the same spatial positions, and it will be aggregated and accessed more easily. Many solutions to align frames were proposed and can be grouped in a few categories, as reported in the following. The decision of using an alignment technique instead of another one is important for a video restoration method since it can have a measurable impact on the final performance, as some studies demonstrated (Chan et al. 2021a, b; Zhou et al. 2022).

3.2.1.1 Motion Estimation Motion Compensation (MEMC)

The most common technique for handling motion in video restoration is the Motion Estimation Motion Compensation (MEMC) approach (Xue et al. 2019). This solution aligns frames in two steps: first, it performs motion estimation, which aims to estimate per-pixel motion between a source and a target frame, and then applies motion compensation, which aims to warp the source frame to the target one according to the estimated motion. Motion estimation is typically done by optical flow computation (Beauchemin and Barron 1995), which is the task that computes per-pixel motion vectors between two frames. Given the source frame \(I_s\) and the target frame \(I_t\), the flow map \(F_{s\rightarrow t}\) describing how pixels moved can be defined as:

$$\begin{aligned} F_{s\rightarrow t} = ME(I_s, I_t) \end{aligned}$$
(1)

where ME is the motion estimation operation. Motion compensation shifts the pixel positions in the source frame \(I_s\) according to the per-pixel vectors contained in the flow map \(F_{s\rightarrow t}\). The warped frame \(\hat{I}_t\) is obtained as:

$$\begin{aligned} \hat{I}_t = MC(I_s,F_{s\rightarrow t}) \end{aligned}$$
(2)

where MC is the warping operation that can be implemented by using bilinear interpolation or the sampling layer of a Spatial Transformer Network (STN) (Jaderberg et al. 2015).

Optical flow computation was originally defined as a handcrafted optimization problem (Weinzaepfel et al. 2013; Revaud et al. 2015; Hu et al. 2017), but the growing spread of deep learning has led to the development of CNN-based models that can produce more accurate results than traditional methods (Dosovitskiy et al. 2015; Ranjan and Black 2017; Sun et al. 2018; Teed and Deng 2020). Some video restoration methods (Xue et al. 2019; Pan et al. 2020; Chan et al. 2021a; Son et al. 2021; Paliwal et al. 2021) directly integrated an existing CNN-based method for optical flow estimation within their architectures. Xue et al. (2019) adopted SpyNet (Ranjan and Black 2017) as flow estimation network and STN (Jaderberg et al. 2015) to perform frame warping, while Chan et al. (2021a) used the same model but opted for plain bilinear interpolation. In contrast, Pan et al. (2020) employed PWC-Net (Sun et al. 2018), Paliwal et al. (2021) used RAFT (Teed and Deng 2020), whereas Son et al. (2021) adopted LiteFlowNet (Hui et al. 2018) due to its efficiency. Other methods (Caballero et al. 2017; Yang et al. 2018; Guan et al. 2019) developed their own modules to perform frame alignment using MEMC. Caballero et al. (2017) built a Spatio-Temporal Motion Compensation (STMC) module, adopting a coarse-to-fine processing approach that propagates coarser flows to upper levels for progressive refinements. Due to excessive downscaling, the accuracy of the estimated motion vectors was reduced. Therefore, Yang et al. (2018) and Guan et al. (2019) later improved upon STMC by introducing an additional flow estimation layer without any downscaling operation.

Fig. 6
figure 6

Example of motion estimation and compensation between two frames. In the warped frame \(\hat{F}_t\), motion estimation and compensation artifacts are visible (red squares). Black pixels on the right-hand side of \(\hat{F}_t\) are due to occlusions. Images reprocessed from the DVD dataset (Su et al. 2017)

Existing optical flow estimation methods do not expect to receive degraded frames as input, hence a retraining procedure is necessary for the adaptation to the considered task, typically using pretrained models as starting point. Accurate ground truth for optical flow estimation cannot be obtained, unless a dataset is synthetically generated. A possibility is to estimate flow maps using pretrained models on ground truth frames and use the obtained maps as ground truth to adapt flow estimation methods to degraded frames. However, the domain gap between datasets for optical flow methods and video restoration methods may lead to inaccurate flow estimations (Son et al. 2021). Therefore, a common solution is represented by self-supervised training, where the model is used to compute optical flow between two frames, warping operation is performed to align them according to the estimated flow, and a warping loss is employed to guide the learning procedure (Caballero et al. 2017; Xue et al. 2019; Pan et al. 2020; Son et al. 2021; Paliwal et al. 2021).

The MEMC strategy for motion handling is widely used by video restoration methods and has multiple advantages and disadvantages. Accurate flow map prediction enables accurate alignment, making the process of information extraction and fusion easier because information referring to the same objects in multiple frames are located in the same spatial locations. In addition, self-supervised learning represents an effective training strategy to adapt models to compute optical flow even on frames affected by artifacts when ground truth flow maps are not available. However, when videos contain luminance changes, fast motion, or occluded objects, the performance of methods based on MEMC alignment may considerably degrade (Savian et al. 2020). Errors in flow map prediction imply errors in frame alignment, introducing new artifacts that damage the entire restoration process (Tassano et al. 2020). Figure 6 shows an example of artifacts introduced by wrong motion estimation. To address this problem, different solutions were proposed (Tassano et al. 2019; Paliwal et al. 2021; Son et al. 2021). Tassano et al. (2019) suggested to preprocess input frames individually using a CNN with the aim of removing part of the artifacts before flow estimation, because optical flow is highly sensitive to noise. In contrast, Paliwal et al. (2021) postprocessed warped frames using residual modules (Zamir et al. 2020) with attention mechanisms (Hu et al. 2018; Woo et al. 2018) to discard artifacts introduced by MEMC errors. Son et al. (2021) provided multiple alignment candidates so that the network can leverage on multiple alignment solutions and find the most appropriate one. Another drawback in using MEMC for frame alignment is related to the computational complexity, since the estimation of per-pixel flow maps and the warping operation on high-resolution frames considerably impact the overall complexity (Bovik 2009).

3.2.1.2 Deformable alignment

Deformable convolutions in video restoration were introduced as an alternative strategy to align frames without the need to explicitly compute the optical flow between them (Tian et al. 2020b). Depending on the input, the network can decide the best transformations to apply to obtain aligned features, from which it will extract the information needed to restore the target frame.

Fig. 7
figure 7

Deformable alignment. Reference features and features from adjacent frames are processed together to estimate position-specific spatial offsets to deform the rigid sampling grid of standard convolutions

Different implementations of deformable alignment exist (Tian et al. 2020b; Wang et al. 2019; Chan et al. 2022; Deng et al. 2020). The general framework is illustrated in Fig. 7. Given the features extracted from the target frame and the ones extracted from an adjacent frame, they are initially fused (e.g., by concatenation) and then processed by a CNN to estimate the deformable offsets that will be used to deform the sampling grid of the standard convolution used to process, and consequenty align, the features of the adjacent frame. As a result, since deformable convolutions can capture motion cues, the produced features will be spatially aligned with the reference ones.

Tian et al. (2020b) were the first to propose deformable alignment in video restoration, adopting an alternating sequence of regular convolutions for deformable offset estimation and deformable convolutions to perform alignment. Inspired by this work, Wang et al. (2019) developed a deformable alignment module implementing a coarse-to-fine processing approach that propagates the learned offsets from lower levels to upper ones, progressively increasing offset accuracy. A similar solution was later used by Yue et al. (2020). These solutions perform deformable alignment in a pairwise manner, i.e., computing the deformable offsets by taking into account the target frame and only one of its adjacent frames at a time, thus failing to fully exploit temporal correlations among multiple frames. To address this limitation, some methods (Deng et al. 2020; Zhao et al. 2021; Xu et al. 2021) adopted an encoder-decoder architecture to predict deformable offsets by jointly processing the entire stack of frames, better exploiting temporal correlations among frames and also increasing offset prediction accuracy due to the large receptive field of encoder-decoder architectures.

Using deformable alignment instead of MEMC for handling motion brings multiple advantages. While in optical flow only one spatial offset for each spatial location is estimated, deformable convolutions learn multiple and complementary offsets (e.g., nine in the case of a \(3\times 3\) kernel) that can mitigate the problem of occlusions and reduce errors caused by large motion (Chan et al. 2021b). Deformable alignment is also less sensitive to varying illumination and motion conditions than the MEMC approach. Moreover, the module for deformable alignment can be trained together with the restoration framework in an end-to-end manner, without requiring any adaptation as in MEMC. The main issue in using deformable alignment is related to the training process, which may suffer from instability due to offset overflow, degrading the overall performance of the models (Chan et al. 2021b). Chan et al. (2022) tackled this issue by designing a flow-guided deformable alignment scheme, where optical flow is used to guide the deformable alignment. More precisely, they employed optical flow to warp features from the previous frame to the target ones and used them to predict offsets for deformable convolutions.

3.2.1.3 Non-local search

In video restoration, non-local search represents an alignment strategy mainly introduced to obtain a global receptive field, thus overcoming the limitation of convolution operations that perform computations in local areas. The main idea behind this approach is to allow even distant pixels to contribute to the alignment process regardless of motion magnitude. The goal of non-local search is to find pixels within the adjacent frame that are most similar to the ones in the target frame, and use them to perform alignment. Computing pixel similarity between two frames allows to detect region patches belonging to the same objects, whose similarity is expected to be high. Figure 8 shows an example of similar patches in three adjacent frames. Using non-local search, video restoration methods can compute pixel similarity to find matching region patches and combine them to perform frame alignment.

Fig. 8
figure 8

Example of similar patches in consecutive frames. Using non-local search, video restoration methods can localize matching patches in multiple frames and use them to produce aligned features. Images reprocessed from the BSD dataset (Zhong et al. 2020)

Several methods using non-local search to handle object motion have been proposed (Yi et al. 2019; Xu et al. 2019; Li et al. 2020a; Davy et al. 2019; Vaksman et al. 2021). While some methods (Yi et al. 2019; Xu et al. 2019; Li et al. 2020a) integrate non-local search within their network as a learnable component, others (Davy et al. 2019; Vaksman et al. 2021) employ it to generate aligned frames to use as inputs to their CNNs by adopting a handcrafted procedure. Inspired by non-local networks (Wang et al. 2018), Yi et al. (2019) computed pixel correlation between each pixel of the target frame and all the pixels of adjacent frames, then they generated output pixels by performing a weighted sum of pixels of adjacent frames using correlations as weights. Xu et al. (2019) included non-local search within ConvLSTM modules (Xingjian et al. 2015). They computed the similarity between the pixels of the current frame and all the pixels of the previous frame to generate a similarity matrix, which is later used to update the ConvLSTM outputs. Instead of working at pixel level, Li et al. (2020a) locally selected the top-K patches in the adjacent frames that are most correlated with a given patch in the reference frame. They are sorted according to their similarity, fused using convolutional layers, and used to generate aligned feature maps. Davy et al. (2019) proposed a non-local search to produce aligned feature maps to use as input to their CNN. For each pixel in the target frame, they centered a patch on it and searched for similar patches in the temporal neighborhood. Then, they sorted these patches and created a vector containing the central pixels of each patch. Since the use of only central pixels does not allow to properly consider the spatial dependencies among pixels, thereby limiting the alignment effectiveness, Vaksman et al. (2021) crafted different versions of the target frame by directly aggregating patches from adjacent frames. After finding all the possible overlapping patches of the target frame, they searched for the most similar patches in the adjacent frames for each of them. Then, they created different versions of the target frame by stitching non-overlapping patches together, starting from the most similar ones.

Methods adopting non-local search are less sensitive to motion magnitude, since arbitrarily distant pixels can be involved in the alignment process. Their main drawback is related to the increase in computational complexity caused by the computation of pixel similarity. Some methods (Davy et al. 2019; Vaksman et al. 2021; Li et al. 2020a) addressed this problem by limiting the search area, which becomes a hyperparameter to tune. Instead, Xu et al. (2019) proposed to reduce the frame spatial dimension using pooling operations before computing pixel similarity, at the cost of reduced accuracy.

3.2.1.4 Implicit alignment

Methods adopting implicit alignment do not include any specific module for frame alignment, but they rely on the capability of the networks to learn proper transformations that allow them to make the most of the information shared across frames. The key element in implicit alignment is the network layer receptive field that has to be large enough to cover possible pixel displacement to accurately align frames. Convolutions have a receptive field restricted to the kernel size, which is typically constrained between \(3\times 3\) and \(7\times 7\). A common solution to enlarge the receptive field is to stack convolutions and to use pooling operations. Thus, video restoration methods usually implement encoder-decoder architectures, in which the encoder typically contains pooling operations (Su et al. 2017; Zhou et al. 2019; Wang et al. 2020a; Tassano et al. 2020; Chen et al. 2021), or adopt residual blocks, which contain stacked convolutions (Zhang et al. 2018a; Nah et al. 2019b; Isobe et al. 2020).

Su et al. (2017) demonstrated that state-of-the-art performance could be obtained by using implicit alignment, developing an encoder-decoder architecture to extract and fuse information from multiple frames. Later, Nah et al. (2019b) proposed to combine an encoder-decoder architecture with residual blocks at the bottleneck, and to insert it within RNN cells for both frame restoration and hidden state update. Zhong et al. (2020) used a similar approach, replacing residual blocks with dense blocks and adding attention modules (Hu et al. 2018) for feature reweighting. Tassano et al. (2020) proposed to cascade two encoder-decoders, developing a two-stage architecture to avoid flow-related artifacts. A similar approach was later adopted by Wang et al. (2020a), who used the same two-stage architecture but preceded by an encoder-decoder to restore single frames before the aggregation. Some methods (Jo et al. 2018; Zhang et al. 2018a; Chen et al. 2021) also included 3D convolutions for better motion handling, since these are more suitable to model video data because they can also move along the temporal dimension. Jo et al. (2018) combined 2D and 3D convolutions within dense blocks, developing a dense residual network. Zhang et al. (2018a) adopted only 3D convolutions, integrating them in residual blocks and cascading multiple modules. In contrast, Chen et al. (2021) used an encoder-decoder architecture with 3D convolutions to generate aligned features, while using a parallel network with 2D convolutions to obtain only spatial information from single frames. Zhou et al. (2019) proposed to enrich an encoder-decoder with a Filter Adaptive Convolutional (FAC) module that assigns position-specific weights to regular convolutions, as objects in the scene do not have the same motion and should be treated accordingly.

Using implicit alignment allows to prevent artifacts related to wrong motion estimation, typically introduced by methods using the MEMC technique. In addition, it avoids the need of designing ad-hoc modules for frame alignment because the burden of finding suitable frame transformations is entirely left to the network. However, the lack of dedicated mechanisms for alignment might make it difficult to properly align features, especially in presence of large motion, because of the fixed and limited receptive field of convolutions, which could not have access to a context large enough to properly combine the information coming from the input frames (Chan et al. 2021a). Enlarging the receptive field by stacking convolutions quickly increases the computational complexity, while using pooling operations may remove important details.

3.2.2 Alignment levels

Different alignment techniques can be adopted to align adjacent frames with the target one. These strategies can be applied either directly to input frames or to features extracted from them.

3.2.2.1 Frame level

Alignment at frame level is typically adopted by methods using the MEMC alignment strategy. Indeed, several methods perform alignment by computing optical flow and warping frames before the actual restoration process (Caballero et al. 2017; Xue et al. 2019; Yang et al. 2018; Guan et al. 2019; Pan et al. 2020; Paliwal et al. 2021). The warping operation is directly applied to adjacent frames to align them with the target one for later processing. However, spatial warping introduces information loss on frame details because of the interpolation operation required to handle fractional flow offsets (Chan et al. 2021b). Chan et al. (2021a) experimented that performing alignment at frame level using optical flow may also introduce blurriness and other types of artifacts. Some methods based on deformable alignment strategy (Deng et al. 2020; Zhao et al. 2021; Xu et al. 2021) apply deformable convolutions to input frames to produce aligned feature maps later used as inputs to their restoration networks. Similarly, some methods (Davy et al. 2019; Vaksman et al. 2021) applied non-local search to input frames to create multiple frame versions to be fed to their restoration networks. The main advantage of performing alignment at frame level is the possibility of using self-supervised training, where alignment can be directly guided via loss functions imposed between aligned and reference frames. Moreover, with this approach the interpretability of the alignment phase is increased, allowing a straightforward inspection of the results.

3.2.2.2 Feature level

Instead of directly trying to align frames, an alternative solution is to align features extracted from them. All the methods adopting an implicit alignment strategy perform alignment at feature level by progressively applying feature transformations. Chan et al. (2021a2021b) conducted a study on the impact of moving the alignment phase from frame to feature level, showing that the latter improved the performance. This outcome motivated the development of some video restoration methods (Chan et al. 2021a, 2022), which adopt a MEMC alignment strategy with optical flow estimated and applied to features rather than to frames. Similarly, some methods adopting deformable alignment (Tian et al. 2020b; Wang et al. 2019; Yue et al. 2020) apply deformable convolutions to features maps instead of frames. In this case, an encoder is used to extract features from frames before alignment, and deformable convolutions are applied to them. The key advantage of feature alignment is that it leverages the capability of neural networks to learn the most suitable internal representations of input frames to make the alignment process easier and more accurate. Besides, alignment at feature level makes models more robust to noise (Sun et al. 2018).

3.3 Loss functions

Loss functions are used in training to quantify the error made by the network in the forward pass. Backpropagation is then used to adjust the network weights so that in the following iteration the network makes its outputs closer to the ground truth. In this section, we discuss the main loss functions used to train deep video restoration methods.

3.3.1 Reconstruction loss

The most used loss function is the reconstruction loss, which measures the pixel-wise difference between restored and ground truth frames. Common reconstruction loss functions are L2 loss (Mean Squared Error) and L1 loss (Mean Absolute Error). L2 loss is known to have the problem of producing oversmooth results because of the low weight given to small errors. To alleviate this problem, several methods adopt loss functions based on L1 loss. Variants of simple L1 and L2 loss are Huber loss (Huber 1992), used to make the model less sensitive to outliers, and Charbonnier loss (Charbonnier et al. 1994), which adds a small term to be sure the loss will never be zero. The main drawback of using a reconstruction loss is that frames are compared without considering any kind of texture-awareness, which may lead to perceptually unsatisfying results. Therefore, using a reconstruction loss in combination with other types of loss functions is often preferred (Zhang et al. 2018a; Zhou et al. 2019; Li et al. 2020a; Chen et al. 2021; Paliwal et al. 2021).

3.3.2 Adversarial loss

In video restoration applying adversarial learning (Goodfellow et al. 2014) means using the restoration network as a generator and then adding a discriminator to judge whether the input frame is real or not. In this way, the generator can be improved by making frames more and more similar to real ones, so that the discriminator will not be able to recognize them anymore. Since the task of the generator is more complex, the training typically starts from the generator, and the discriminator is added after a number of iterations (Lucas et al. 2019). The adversarial loss is useful to force the generator to remove some artifacts that may be still present in the restored frames. Paliwal et al. (2021) conditioned the discriminator using a gradient-based mask for the identification of textured regions, allowing it to detect high-frequency artifacts in smooth areas and classify them as fake, consequently encouraging the generator to remove them. In general, using only adversarial loss for training restoration methods leads to training instability (Gulrajani et al. 2017), and the restoration network may produce results substantially different from the desired ones (Mustafa et al. 2022). Consequently, the adversarial loss is often used in combination with the reconstruction loss, requiring a hyperparameter optimization for the regularization terms to weight the contribution of each loss (Zhang et al. 2018a; Paliwal et al. 2021).

3.3.3 Perceptual loss

The perceptual loss allows to assess the semantic difference between two frames and measures visual similarity by comparing frame content at feature level. The features are extracted by a neural network usually trained on other tasks, such as image classification. A common practice is to adopt VGG-based features (Chen and Koltun 2017) using a VGG model (Simonyan and Zisserman 2014). Although perceptual loss can produce perceptually satisfying results, using it alone may lead to training instability (Blau and Michaeli 2018). Therefore, it is usually used in combination with a reconstruction loss, with the additional cost of assigning the proper regularization term to each of the components of the total loss (Zhou et al. 2019). Using the perceptual loss adds a computational overhead to the training process, increasing the overall time required to train the network and the memory needed.

3.3.4 Temporal consistency loss

Temporal consistency is an important feature of video restoration methods because they should restore frames without introducing new temporal distortions, such as flickering. Although temporal consistency can be addressed by leveraging information from multiple frames, it can be further improved with the use of proper loss functions. A temporal consistency loss allows to enforce temporal coherence between consecutive frames by focusing on the temporal domain rather than on the spatial one. Typically, the output of the network at timestep t is compared to the outputs at timesteps \(t-1\) and \(t+1\), which are aligned with it via optical flow estimation. Different implementations of temporal consistency loss exist (Yue et al. 2020; Lai et al. 2018; Chen et al. 2021). Yue et al. (2020) first restored the frame at timestep t using its adjacent frames at timesteps \(t-1\) and \(t+1\), and then generated two new versions of the restored frame using two redundant noisy shots at timestep t, respectively. Finally, they imposed L1 loss between the restored frame and each of the two generated frames. Lai et al. (2018) proposed to employ a temporal consistency loss based on warping error between consecutive frames, that is, the output of the network at timestep \(t-1\) is warped to the output at timestep t via optical flow estimation and L2 loss is computed between them. Similarly, Chen et al. (2021) used optical flow estimation to warp the previous restored frame to the current restored frame, and did the same for ground truth frames. Then, they computed L1 loss on the difference between restored frames and the difference between ground truth frames. The application of temporal consistency loss is beneficial for video restoration methods because temporal consistency can be explicitly enforced via loss function and learned during training. When introduced into methods using a multi-frame baseline scheme, the main drawbacks of using a temporal consistency loss is that it requires either redundant computations (Yue et al. 2020) or a modification of the output of the network (Chen et al. 2021), as it requires to restore multiple frames in a single training iteration. It also requires to be used in combination with the reconstruction loss, requiring a proper regularization term (Yue et al. 2020; Chen et al. 2021). Moreover, optical flow computation in the temporal consistency loss increases the training time.

3.3.5 Detail-preserving loss

Restoration methods usually treat low and high frequencies in the same way, consequently producing oversmooth results (Hang et al. 2020). A detail-preserving loss allows restoration methods to improve their capability of recovering details by forcing the details contained within restored and ground truth frames to be the same. To this end, several solutions have been proposed (Li et al. 2020a; Xu et al. 2021; Isobe et al. 2020). Li et al. (2020a) used an edge detector to extract edge information from ground truth frames, generating a mask to highlight edges and force their model to pay more attention to them. Xu et al. (2021) introduced a loss function based on the Fast Fourier Transform (FFT) (Nussbaumer 1981): they computed the FFT on restored and ground truth frames and used L2 loss on both amplitude and phase components. Isobe et al. (2020) extracted high-frequency components on both restored and ground truth frames and computed a Charbonnier loss (Charbonnier et al. 1994) between them. Since the goal of a detail-preserving loss is to improve the detail recovery capability of neural networks, it should be used in combination with other loss functions, thus requiring a regularization term to weight its contribution in the overall loss (Li et al. 2020a; Xu et al. 2021; Isobe et al. 2020).

3.4 State of the art

Here we summarize the characteristics of the state-of-the-art video restoration methods introduced in the previous sections, according to the hierarchical organization in Fig. 3. Table 2 reports the main features of the architecture used (baseline scheme, design strategy, convolution type), how the methods handle motion (alignment technique and alignment level), and the loss functions used (reconstruction loss, adversarial loss, perceptual loss, temporal consistency loss, detail-preserving loss). For each method based on MEMC some details about how optical flow is computed and how the warping operation is performed are present. Besides, we report the number of frames used as input for those methods based on the multi-frame baseline scheme. Note that some methods in Table 2 have two baseline schemes, which means that they are recurrent methods but, at each timestep, they use a stack of frames as done by multi-frame methods.

Table 2 Summary of the the state-of-the-art methods

4 Benchmark datasets

Video restoration methods based on deep learning require benchmark datasets both for training and evaluation. Through the years, several datasets have been proposed for the different restoration tasks. We summarize their characteristics in Table 3.

Table 3 Benchmark datasets for video restoration. DB, SR, DN, and CAR respectively mean deblurring, super-resolution, denoising, and compression artifact reduction

Some datasets provide both degraded input and pristine ground truth sequences, while others only provide the pristine ground truth and the degraded input sequences must be synthetically generated. This solution could be feasible for video compression artifact reduction, because the artifacts introduced by using compression algorithms, such as JPEG2000 (Marcellin et al. 2000) or High Efficiency Video Coding (HEVC) (Sze et al. 2014), appear exactly as in the final application. Conversely, for video denoising, deblurring and super-resolution, this solution may not be optimal because the introduced distortions are merely an approximation of the real ones. For instance, artifacts introduced by adding Gaussian white noise are different from the ones derived from real low-light conditions.

Methods trained on synthetically generated approximated artifacts may perform suboptimally when applied to real-world distortions and, hence, creating datasets with realistic distortions is important to ensure the practical applicability of the restoration methods.

4.1 Datasets with real distortions

Creating video datasets containing real distortions, such as noise and blur, is a challenging task because this requires an acquisition system able to capture noisy/blurry and clean frames simultaneously. Different methods were proposed to generate paired datasets with videos affected by real-world artifacts. In the following, we shortly describe how existing datasets were created.

4.1.1 Beam-splitter deblurring (BSD) (Zhong et al. 2020)

The dataset was built using a beam splitter acquisition system with two synchronized cameras. The system could capture pairs of blurred and sharp videos in one shot by controlling the exposure time and the exposure intensity. A center-aligned synchronization scheme was adopted, so that the sharp exposure time lies exactly in the middle of the blurry exposure time. The dataset contains sharp/blurry videos captured at 15 frames per second (FPS) with different exposure times: 1ms-8ms, 2ms-16ms and 3ms-24ms.

4.1.2 Captured raw video denoising (CRVD) (Yue et al. 2020)

The dataset contains RAW videos captured using a surveillance camera at 20 FPS. Since capturing dynamic scenes using low International Organization for Standardization (ISO) generates motion blur, sequences containing objects were recorded, and the objects were manually moved to create object motion. For each static moment, multiple frames were captured, and the ground truth is obtained by averaging them, with the additional application of the Block-Matching and 3D filtering (BM3D) denoising algorithm (Dabov et al. 2007) to remove the remaining noise. Videos were captured using different ISO, ranging from 1600 to 25600, to capture different levels of noise.

4.1.3 MFQEv2 (Guan et al. 2019)

The dataset is composed of multiple sequences coming from different sources, i.e., Xiph.orgFootnote 1, VQEGFootnote 2 and JCT-VC (Bossen 2013), containing different contents. The video sequences in this dataset are provided in the YUV domain without compression, and compressed sequences are obtained using the HEVC compression standard (Sze et al. 2014). We called this dataset MFQEv2 to differentiate it from MFQE2.0, which is instead a state-of-the-art method.

4.2 Datasets with synthetic distortions

A common practice to generate video sequences for training video restoration methods is to take the clean video sequences and synthetically add the artifacts to obtain input/output pairs. Several datasets proposed for video restoration contain videos collected either from the web or from datasets for other related tasks, such as quality assessment or segmentation, that are synthetically distorted. In the following, we describe these datasets and the types of artifacts present.

4.2.1 GOPRO (Nah et al. 2017)

The dataset was generated using a camera capturing 240 FPS videos. Based on the idea that a long shutter speed can be approximated by averaging frames captured with a short shutter speed (i.e., 1/240 in the case of 240 FPS videos), each blurred frame is obtained by averaging from 7 to 13 sharp frames to produce different blur effects, and the mid-frame among the averaged frames is considered the ground truth.

4.2.2 Deep Video Deblurring (DVD) (Su et al. 2017)

Since a long exposure can be approximated by accumulating a number of short exposures (Telleen et al. 2007), motion blur at 30 FPS can be obtained by recording videos at 240 FPS, subsampling them every 8 frames and finally averaging each group of 7 consecutive frames. To use all the frames, optical flow was computed between adjacent high FPS frames to generate additional frames, which are then averaged. To avoid bias towards a specific device, different devices were used to capture the sequences. In addition, to avoid problems related to noise, all the sequences were recorded in good lighting conditions.

4.2.3 Realistic and Dynamic Scenes (REDS) (Nah et al. 2019a)

Proposed for the New Trends in Image Restoration and Enhancement (NTIRE) 2019 video restoration challenges, the dataset was recorded with a camera at 120 FPS. A CNN-based method (Niklaus et al. 2017) was used to increase the frame rate from 120 to 1920 FPS, and a duty cycle of 0.8 was used to generate blurry frames (from 1920 FPS sharp frames to 24 FPS blurry frames), whereas potential noise and compression artifacts were suppressed by downscaling the original frames. To better mimic the camera imaging pipeline and produce more realistic results, the Camera Response Function (CRF) and inverse CRF were estimated, and the blurry frames are computed in the signal space (obtained by applying the estimated inverse CRF) and converted back to the RGB color space (using the estimated CRF). For another challenge, additional distortions were introduced by compressing the blurry frames using MPEG-4 (Sikora 1997) with quality 60%. Moreover, for video super-resolution, both the sharp and blurry frames were downscaled by a factor of four using bicubic interpolation.

4.2.4 Vimeo90K (Xue et al. 2019)

The dataset is composed of sequences with different contents downloaded from the VimeoFootnote 3 video platform. Since only ground truth sequences are provided, any kind of artifact must be introduced synthetically. The authors of the dataset released the code to add noise, i.e., Gaussian noise and mixed noise (Gaussian + Salt & Pepper) for video denoising, to compress videos using the JPEG2000 algorithm (Marcellin et al. 2000) for video compression artifact reduction, and to reduce the spatial resolution by a factor of four using bicubic interpolation for video super-resolution.

4.2.5 Densely Annotated Video Segmentation 2017 (DAVIS) (Pont-Tuset et al. 2017)

Originally proposed for video object segmentation, this dataset is also employed in video restoration, in particular by video denoising methods. No code to add artifacts is provided.

5 Performance evaluation

5.1 Evaluation metrics

Defining common evaluation metrics to assess deep learning methods is important to objectively measure and compare their performance.

Metrics for the evaluation of restoration methods can be: (i) full-reference, which use reference frames; (ii) reduced-reference, which use partial information of reference frames (e.g., features); (iii) no-reference, which do not use any reference. Many metrics have been proposed to assess video quality (Li et al. 2019). Among them, the most common in video restoration are Peak Signal-to-Noise Ratio (PSNR) (Hore and Ziou 2010) and Structural Similarity Index (SSIM) (Wang et al. 2004). In the following, we describe them more in detail, and we mention other metrics seldom used.

5.1.1 Peak signal-to-noise ratio

Peak Signal-to-Noise Ratio (PSNR) (Hore and Ziou 2010) is a full-reference metric used to measure the quality of reconstruction algorithms. It is defined as the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. PSNR is based on a function of Mean Squared Error (MSE). When dealing with images, MSE allows to compare the true pixel values of the original image with those of the degraded one. Given two images I and K of size \(n \times m\), where I is the original image and K is its degraded version, MSE is computed as follows:

$$\begin{aligned} MSE(I,K) = \frac{1}{n \times m} \sum _{i=0}^{n-1} \sum _{j=0}^{m-1} (I_{i,j} - K_{i,j})^2 \end{aligned}$$
(3)

Given MSE between I and K, PSNR is computed as follows:

$$\begin{aligned} PSNR(I,K) = 20\cdot \log _{10}\frac{MAX}{\sqrt{MSE(I,K)}} \end{aligned}$$
(4)

where MAX is the maximum pixel value of the dynamic range of the images, i.e., 255 for 8-bit images. Since MSE measures pixel errors, and low values of MSE imply high values of PSNR, the higher the PSNR, the better. When the compared images are identical, MSE is 0 and PSNR tends towards infinity.

5.1.2 Structural similarity index

Structural Similarity Index (SSIM) (Wang et al. 2004) is a full-reference metric for measuring the perceptual similarity between two images. SSIM considers image degradation as the perceived change in structural information, relying on the idea that image pixels have strong inter-dependencies, especially when they are spatially closed. These dependencies carry important information about the structure of the objects in the visual scene. Instead of using traditional error summation methods, such as PSNR, SSIM models image distortions as a combination of three factors: luminance distortion, contrast distortion and structural distortion. Given two images X and Y of the same size, SSIM is computed as follows:

$$\begin{aligned} SSIM(X,Y) = \frac{(2\mu _x\mu _y + c_1)(2\sigma _{xy} + c_2)}{(\mu ^2_x + \mu ^2_y + c_1)(\sigma ^2_x + \sigma ^2_y + c_2)} \end{aligned}$$
(5)

where \(\mu _x\) and \(\mu _y\) are the average pixel values, \(\sigma ^2_x\) and \(\sigma ^2_y\) are the pixel variances and \(\sigma _{xy}\) is the pixel covariance of X and Y. The constants \(c_1\) and \(c_2\) are used to stabilize the division when the denominator is close to zero. They are respectively computed as \(k_1L\) and \(k_2L\), where L is the dynamic range of pixel values (255 for 8-bit images), \(k_1=0.01\) and \(k_2=0.03\) by default. SSIM assumes values in the [0, 1] range. Also in this case, the higher the SSIM, the better. When the compared images are identical, SSIM is equal to 1.

5.1.3 Other metrics

Zhang et al. (2018b) proposed Learned Perceptual Image Patch Similarity (LPIPS), a full-reference metric that first uses a pretrained CNN to extract neural features from both degraded and reference frames, and then compares them. MOtion-tuned Video Integrity Evaluation (MOVIE) index (Seshadrinathan and Bovik 2009) is a full-reference metric that uses a multi-scale framework to evaluate video fidelity, integrating both spatial and temporal aspects of distortion assessment. Soundararajan and Bovik (2012) proposed Spatio-Temporal Reduced Reference Entropic Differencies (STRRED), a reduced-reference metric that computes wavelet coefficients of frame differences modeled as Gaussian scale mixture, and measures the difference in the amount of spatial and temporal information contained in distorted and reference frames. Lai et al. (2018) proposed Warping Error (WE), a full-reference metric to evaluate temporal consistency of enhanced frames that makes use of optical flow to estimate pixel motion between two adjacent frames, aligns them according to the estimated flow, and measures the pixel-wise error. Recently, Agarla et al. (20202021) presented a no-reference video quality assessment method based on a CNN that approximates Mean Opinion Score (MOS) by considering both quality attributes, such as sharpness and noisiness, and semantics of videos.

5.2 Performance evaluation of the methods

Here we analyze the performance of the state-of-the-art video restoration methods on the different restoration tasks. Tables 456 and 7 respectively report the performance of video denoising, video deblurring, video super-resolution and video compression artifact reduction methods. For each method we report information about the datasets used for training and evaluation, and their performance in terms of PSNR and SSIM as reported in the original papers. We only considered the results obtained on the datasets reported in Sec. 4, even though some methods have also been evaluated on some less common or custom datasets (Pan et al. 2017; Maggioni et al. 2012). Note that entries in each table are grouped by method to highlight the source of the reported information. A direct comparison among different methods may be not fair since each of them is potentially trained with different settings (such as the software used for synthetic distortion generation).

Video denoising methods are commonly tested on videos containing additive white Gaussian noise (AWGN) (Xue et al. 2019; Mehta et al. 2021; Chen et al. 2021; Tassano et al. 2019, 2020; Wang et al. 2020a; Vaksman et al. 2021). Some methods (Paliwal et al. 2021; Yue et al. 2020; Chen et al. 2021) are also evaluated on real noisy scenes. Here, video denoising is performed either in the sRGB or in the RAW domain, directly processing the output of camera sensors. Based on the results reported in Table 4, MMNet (Chen et al. 2021) and PaCNet (Vaksman et al. 2021) are the best performing methods in removing AWGN from videos. Concerning the removal of real noise from sRGB frames, MaskDNGAN (Paliwal et al. 2021) can produce better results than RViDeNet (Yue et al. 2020) and MMNet (Chen et al. 2021). In the RAW domain, MaskDNGAN (Paliwal et al. 2021) and RViDeNet (Yue et al. 2020) achieve almost the same denoising performance.

Table 4 Performance of the state-of-the-art denoising methods

The performance of deblurring methods is reported in Table 5. Here the best methods are MB2D (Park et al. 2020) and PVDNet (Son et al. 2021). EDVR (Wang et al. 2019) and DLBRGAN (Zhang et al. 2018a) also achieve competitive performance.

Table 5 Performance of the state-of-the-art deblurring methods

Video super-resolution can be performed using different upscaling factors, i.e., \(\times 2\), \(\times 3\) and \(\times 4\). In Table 6 we only report the performance obtained by video super-resolution methods using the \(\times 4\) upscaling factor, which is the most common one. Two degradation types are usually evaluated: bicubic downscaling (BI), which is performed by downscaling frames using bicubic interpolation, and Gaussian downscaling (BD), which is performed by applying a Gaussian filter (with standard deviation \(\sigma =1.6\)) to frames and then downscaling them using bicubic interpolation. The performance on the Y channel of the YCbCr color space is usually evaluated in addition to the one in RGB. BasicVSR++ (Chan et al. 2022) achieves the best performance on all the considered datasets, color channels, and degradation types, demonstrating its superiority compared to the other methods. It is followed by BasicVSR (Chan et al. 2021a) and EDVR (Wang et al. 2019), which obtain competitive performance.

Table 6 Performance of the state-of-the-art super-resolution methods

Table 7 documents the results obtained by video compression artifact reduction methods in restoring videos compressed using JPEG2000 (Marcellin et al. 2000) and HEVC (Sze et al. 2014). ToFlow (Xue et al. 2019) achieves PSNR higher than EVRNet (Mehta et al. 2021) in removing compression artifacts introduced by JPEG2000 when the compression is high (\(q=20\)), while the latter is considerably better when the compression is lower (q is higher). The two methods are equal in terms of SSIM. The performance on MFQEv2 (Guan et al. 2019) is commonly measured using \(\Delta\)PSNR and \(\Delta\)SSIM: \(\Delta\)PSNR is obtained as PSNR\((\hat{F}, \bar{F})\) - PSNR\((F, \bar{F})\), where \(\hat{F}\) is the enhanced frame, \(\bar{F}\) is the ground truth frame and F is the compressed frame; \(\Delta\)SSIM is computed in a similar way. The higher the \(\Delta\)PSNR and \(\Delta\)SSIM, the better. Moreover, the restoration performance is evaluated on the Y channel of the YUV color space. The best performing method is RFDA (Zhao et al. 2021), which obtains the highest \(\Delta\)PSNR and \(\Delta\)SSIM at every compression level. It is followed by STDF (Deng et al. 2020) and MFQE2.0 (Guan et al. 2019).

Table 7 Performance of the state-of-the-art compression artifact reduction methods

Efficiency is another important criterion for the evaluation of video restoration methods. In Table 8, we report the results using five metrics commonly adopted to evaluate efficiency of CNNs. Giga operations per second (GOPs), Giga floating point operations per second (GFLOPs) and Giga multiply-accumulate operations per second (GMACs) refer to the number of operations performed in one second. The lower, the better. Runtime reports how many seconds the methods require to restore a frame at a given resolution. All the values are taken from the original papers. Note that the methods may not be directly comparable because these metrics were computed on different devices and using different software, which might produce slightly different results. In addition, runtime is computed using different Graphic Processor Units (GPUs), whose performance change based on the specific model. For video super-resolution methods we report only information using the \(\times 4\) upscaling factor. Since the number of operations performed by the methods is positively correlated with the running time, i.e., a higher number of operations implies a higher running time (Bianco et al. 2018), here we comment only aspects related to the running time.

We take into account high-resolution videos, i.e., videos containing frames whose size is greater than \(1280\times 720\) pixels. Since real videos at 30 FPS require a processing time lower than 0.03 seconds for each frame, we can observe that none of these methods can achieve real-time restoration performance even using high-performing GPUs. Based on the results in Table 8, RSDN (Isobe et al. 2020) and BasicVSR (Chan et al. 2021a) are the most efficient methods, approaching real-time restoration performance.

EVRNet (Mehta et al. 2021) uses two slight different models to perform the different tasks: one for super-resolution (first row) and one for denoising and compression artifact reduction (second row). In contrast to the other models, it is very lightweight because it was designed to work on edge devices, such as smartphones. PaCNet (Vaksman et al. 2021) requires about 30 seconds to restore a frame at \(854\times 480\) resolution even if it has a limited amount of parameters. This is due to the preliminary alignment process based on non-local search that explicitly tries to craft artificial frames by aggregating similar patches coming from adjacent frames. DVDNet (Tassano et al. 2019) requires about 8 seconds on frames at \(960\times 540\) resolution, where 6 seconds are dedicated to the alignment process performed using MEMC.

Table 8 Efficiency of the state-of-the-art video restoration methods

6 Challenges and future trends

Despite the progress made in video restoration using deep learning, there are still many issues to address. In this section, we point out the main challenges and future trends as emerged from the analysis presented in this paper.

6.1 Real-time restoration

State-of-the-art video restoration methods are characterized by high reconstruction performance. Nevertheless, efficiency still represents an obstacle that makes their application to several real-world problems challenging, especially those requiring real-time computations. Recent methods are typically evaluated on highly performing hardware, such as GPUs, that may not be available in some practical scenarios. Due to the increasing popularity of mobile devices, for example, one may expect to run these models on smartphones and hand-held cameras, which are characterized by limited resources in terms of computational power, memory, and battery consumption. Designing lightweight models able to run on such devices in real time would considerably extend their applicability to real-world problems, and investigations towards this direction are important.

6.2 Improved alignment strategies

The effectiveness of video restoration methods strictly depends on the adopted solution for motion handling. Methods based on optical flow are sensitive to light changes, fast motion, and occluded objects, while methods using implicit alignment are limited by the local receptive field of standard convolutions. Some solutions, such as deformable convolutions, were proposed to address these limitations, but they introduce training instability and increase computational complexity. According to the investigation made by Chan et al. (2021b, 2022), a possible future trend is the exploration of the relationships among existing alignment strategies, with the purpose of developing new solutions that combine all the underlying advantages.

6.3 All-in-one video restoration methods

Most of the video restoration methods proposed during the past few years tackle only one restoration task. Although some methods demonstrated to be flexible to different types of distortion (Xue et al. 2019; Wang et al. 2019; Mehta et al. 2021), they have been optimized for only one task at a time. In real-world scenarios, videos may be simultaneously affected by multiple distortions, because artifacts are introduced at different levels of the camera pipeline: for example, noisy videos are also later compressed. Therefore, designing robust all-in-one methods that can address multiple restoration tasks at the same time, i.e., restoring videos containing multiple distortion types, would extend their applicability to real-world cases. Some methods towards this direction have been recently developed (Rota et al. 2022; Katsaros et al. 2021).

6.4 More representative evaluation metrics

Common metrics for the evaluation of video restoration methods are PSNR and SSIM. However, their values are not well correlated to human perception, meaning that high values of these metrics can be obtained even if the results are unpleasant for humans. To this end, several metrics that better correlate to human perception have been proposed, both for image (Zhang et al. 2018b; Kim and Lee 2017; Reisenhofer et al. 2018) and video assessment (Park et al. 2012; Bampis et al. 2018; Agarla et al. 2020, 2021), but currently there is not a globally-accepted measure for video restoration. Thus, there is the need to define and converge to an accurate and perceptual-based metric for the evaluation of restoration results. Temporal consistency is an important aspect of video restoration, but it is usually underestimated and only occasionally evaluated. In most video restoration papers only metrics applied to each individual frame are typically used, without taking into account any dependency among them. It would be instead appropriate to employ metrics also for temporal consistency evaluation, such as STRRED (Soundararajan and Bovik 2012), MOVIE (Seshadrinathan and Bovik 2009) or Warping Error (Lai et al. 2018).

6.5 Datasets with realistic distortions

Despite the large availability of video datasets for training video restoration methods, the distortions they contain are usually synthetically generated (e.g., noise is typically modeled as additive Gaussian white noise and downscaling degradation is modeled using interpolation methods). Since real-world distortions could have different characteristics with respect to synthetic ones, methods trained on these datasets may underperform when applied to real scenarios. Some datasets with realistic artifacts were proposed (Zhong et al. 2020; Yue et al. 2020), but the difficulty of the collection task largely constrained the acquisition conditions, thereby limiting their potential applicability. Developing complex acquisition systems able to model realistic distortions is a challenge, but could be beneficial to extend the applicability of restoration methods to real-world tasks.

6.6 Combining traditional and deep learning methods

Video restoration methods based on deep learning have three main disadvantages with respect to traditional methods (López-Tapia et al. 2021): (i) they are less frequently found to incorporate domain knowledge, which in turn makes them less robust to videos containing unseen degradations; (ii) they need a large amount of data to learn the non-linear mapping between inputs and outputs, which requires a time-consuming video collection process; (iii) they are less interpretable, which limits their applicability to some sensitive contexts. These problems could potentially be tackled using Deep Unfolding Networks (DUNs), which implement the conventional iterative optimization process of traditional methods using deep neural networks (Gregor and LeCun 2010). Despite many works adopting DUNs have been proposed for different image restoration tasks (Dong et al. 2018; Zhang et al. 2020; Gong et al. 2020; Li et al. 2020b; Ren et al. 2021), fewer are designed for the video domain (Chiche et al. 2020; Sun et al. 2021).

7 Conclusions

In this paper, we provided a review of video restoration methods based on deep learning. We selected well-established and recent methods for video restoration, and analyzed in a structured manner their main features related to architectural choices, strategies for motion handling, and loss functions.

For each restoration task we detailed the characteristics of benchmark datasets and classified them based on the types of distortions they contain. Despite the large availability of video datasets, we highlighted that most of them contain synthetic distortions that may differ from real ones, limiting the applicability of video restoration methods.

The main evaluation criteria are also discussed and used to compare the performance of the considered methods, providing an overview the most promising methods in terms of both effectiveness and efficiency. We noticed that even if video restoration quality made much progress in recent years, video restoration methods cannot yet restore high-resolution frames in real time.

Possible improvements of the research include the development of methods able to run on resource-limited devices in real time, the study of more robust alignment strategies, the development of methods to address multiple restoration tasks at the same time, the definition of more suitable and globally-accepted metrics for result evaluation, the acquisition of freely available datasets containing real-world distortions, and the combination of traditional and deep learning methods.