1 Introduction

High resolution (HR) images and videos are highly desirable, and strongly in demand for most electronic imaging applications not only for providing better visualisation but also for extracting additional information. However, HR images are not always available since the setup of high-resolution imaging can be expensive especially with the inherent physical limitations of the sensors, the optics manufacturing technology, the data storage and the sensor’s communication bandwidth. Therefore, it is essential to find an effective way in image processing to increase the resolution level at a low-cost, without replacing the existing imaging system. To address this challenge, the concept of super-resolution (SR) has now been sought after. This technique aims to produce a single HR image, or HR video, from a set of different successive low-resolution (LR) images captured from the same scene in order to overcome the limitations and/or possibly ill-posed conditions of the imaging system [53]. Due to its wide applications, SR has been an active area of research over the last two decades for a variety of applications, such as satellite imaging [10, 21], medical imaging [19, 46], forensic imaging [29, 47] and video surveillance systems [20, 67].

Most SR methods consist of two main parts: I: image registration and II: image reconstruction. Image registration aims to estimate the motion between the LR images, while image reconstruction aims at combining the registered images to reconstruct the HR image. In image registration, the motion between the reference image and its neighbouring LR images is required to be estimated accurately to reconstruct the super-resolved image [45, 57]. When the camera is moving and the scene is stationary, global motion occurs. On the contrary, when the camera is fixed and the scene is moving, non-global (local) motion occurs. This paper primarily focuses on the first scenario.

1.1 Image registration

Image registration methods can be operated either in the spatial-domain or the frequency-domain. Frequency-domain methods are usually limited to global motion models, whereas spatial-domain methods usually allow more general motion models. In the frequency-domain, Vandewalle et al. [57] presented an image registration algorithm to accurately register a series of aliased images based on their low-frequencies, thereby aliasing its free-part. They used a planar motion model to estimate the shift and rotation between the images, particularly for the scenario when a set of images are captured in a short period of time with a small camera motion. Vandewalle’s method performs better than the other frequency-domain registration methods, such as Marcel et al. [39] and Luchese and Cortelazzo [36]. The advantage of Vandewalle’s method is that it is based on discarding the high-frequency components, where aliasing may have occurred, in order to be more robust. In the spatial domain, Keren et al. [30] developed an iterative planar motion estimation algorithm that uses different, down-sampled versions of the images to estimate the shift and rotation parameters based on the Taylor series expansions. The goal of this pyramidal scheme is to increase the accuracy for estimating large motion parameters. Keren’s method and Vandewalle’s method, have been well accepted to tackle global motions [57]. However, the existing sub-pixel registration methods become inaccurate when the motion is non-global. There are several recent approaches dealing with the general motion estimation in video SR. For example, Liu and Sun [35] used optical flow techniques to register multiple images with sub-pixel accuracy whereas Liao et al. [34] used an ensemble of optical flow models to reconstruct the original HR frames with rich high-frequency details.

1.2 Image reconstruction in spatial and frequency domain

Image reconstruction methods can also be classified into frequency domain-based and spatial domain-based approaches. The first frequency-domain-based SR approach was proposed by Tsai and Huang [55], which formulates the system equations that relate the HR image to the observed LR images by estimating the relative shifts between a sequence of down-sampled, aliased and noise-free LR images. This method was extended by Kim et al. [31] by proposing a weighted least squares solution based on the assumption that the blur and the noise characteristics are the same for all LR images. A major advantage of the frequency-domain-based SR methods is that they are usually theoretically simple and computationally inexpensive. However, these methods are insufficient to handle the real-world applications, as they are limited to only global translational motion and linear space-invariant blur during image acquisition process.

For the spatial-domain-based SR approaches, Non-uniform interpolation method [40, 56] is one of the most intuitive approaches with relatively low computational complexity. However, degradation models are applicable only if all LR images have the same blur and noise characteristics. Iterative back-projection (IBP) method [22, 43] can accommodate both global translational and rotational motions. However, the solution might not be unique due to the ill-posed nature of the SR problem and the selection of some parameters is usually difficult. Projection onto convex sets (POCS) method [17, 42] benefits from the utilisation of the efficient observation model and a proper priori information. The disadvantages, on the other hand, are the lack of a unique solution, a slow convergence rate and an expensive computational cost. Regularised-based SR methods include Maximum likelihood (ML) method [54] and Maximum a posteriori (MAP) method [48, 49]. The ML method only considers the relationship among the observed LR images and the original HR image without priori information while the MAP method considers both. An extension of this approach, called Hybrid ML/MAP-POCS method [16], were proposed to guarantee the single optimal solution. The spatial-domain-based SR methods can tackle the real-world applications better because they can accommodate both global and non-global motion models, linear space-variant blur and the noise during image acquisition process.

1.3 Wavelet-based image reconstruction

In addition to the frequency and spatial-based domains efforts have been made using the wavelet-domain. The wavelet-domain-based SR reconstruction approach is able to exploit both the spatial and frequency-domains, and integrate properties of both to reconstruct a HR image from observed LR images. The wavelet transform (WT) is an effective tool that divides an image into low and high-frequency sub-bands, each of which is examined independently with a resolution matched to its scale [9]. The mechanism behind the strategy of WT is that the features of the image at different scales can be separated, analysed and manipulated. Global features can be examined at coarse scales, while local features can be analysed at fine scales [40]. The attractive properties of WT, such as locality, multi-resolution, and compression make it effective for analysing real-world signals [7]. Discrete wavelet transform (DWT) is one of the recent wavelet transforms: it being employed as a powerful tool for many image and video processing applications to isolate and preserve the high-frequency components of the image. DWT decomposes the given image into one low-frequency sub-band and three high-frequency sub-bands using the property of dilations and translations by a single wavelet function called mother wavelet [18].

One of the challenges in SR is to preserve or recover the true edges of objects meanwhile compressing noise, which is usually difficult to be achieved simultaneously using frequency-based methods due to similar response of edges and noise in frequency band. WT offers an alternative solution to analyse true edges and noise separately. Manipulating wavelet coefficients in sub-bands containing high-pass frequency spatial information is the essential target of wavelet-based methods to solve the SR reconstruction problem. A common assumption of WT-based methods is that the LR image is the low-pass filtered sub-band produced by WT of the HR image [52]. The existing literature on WT-based methods is in both the single frame case and multi-frame (video) case. For the multi-frame case, Izadpanahi et al. [25] presented a SR technique using DWT and bicubic interpolation. They applied an illumination enhancement method based on singular value decomposition before the registration process of the LR frames to reduce the illumination inconsistencies between the frames. Anbarjafari et al. [3] proposed a SR technique for the LR video sequences using DWT and stationary wavelet transform (SWT). However, these available methods have limited performance for variety of noise levels, motion levels, wavelet functions, and the number of used frames.

1.4 Other types of SR approaches

Recently, learning-based SR methods have emerged to further boost the efficiency of SR. These methods consist of two main parts: learning and recovering. In the learning part, a dictionary which contains a large number of LR and HR patch pairs is constructed. In the recovering part, the LR frame is divided into overlapped patches, and each patch searches its more similar LR patch from the dictionary. The HR frame is obtained by incorporating the corresponding HR patch to the LR frame. Takeda et al. [51] introduced a method based on the extension of steering kernel regression framework to 3-D signals for performing video de-noising, spatiotemporal upscaling and SR, without the need for explicit sub-pixel accuracy motion estimation. To generate better results, multi-dimensional kernel regression was applied. Yang et al. [65] proposed a sparse-coding method where LR and HR patch pairs in the dictionary share the same sparse representation. The sparse representation of a LR patch can be incorporated to the HR dictionary to obtain HR patch. Li et al. [33] introduced an adaptive subpixel-guided auto-regressive (AR) model in which key-frames are up-sampled by a sparse regression while non-key-frames are super-resolved by simultaneously exploiting the spatiotemporal correlations. Deep learning-based SR approaches and deep learning networks SR approaches [6, 8, 13, 14, 27, 28, 37, 38, 50, 59, 61,62,63,64] have been developed in recent years to improve SR results, and to better model complex image contents and details. For example, Dong et al. [14] proposed a SR convolution neural network (SRCNN) method to perform a sparse reconstruction. Jiang et al. [28] addressed the problem of learning the mapping functions (i.e. projection matrices) by introducing the non-local self-similarity and local geometry priors of the training data for fast SR. However, this type of methods is usually computationally costly and requires a large amount of training data.

1.5 Focus of this study

This paper proposes a robust video super-resolution approach, based on a combination of the so-called discrete wavelet transform-new edge-directed interpolation and a soft-thresholding for increasing the spatial resolution and recovering the noiseless high-frequency details of the observed noisy LR video frames with global motion, which integrates merits from the methods of image registration and reconstruction in both frequency-domain and spatial-domain. The application of the proposed SR technique is particularly useful when the camera is moving and the observed scene is stationary. One of the motivations of this technique is to provide flexibility for a variety of motion levels, noise levels, wavelet functions, and sufficient number of used LR frames since the existing wavelet-based SR methods have limited performance capabilities for these various factors and this potential has not yet been fully explored. The performance of this approach is tested on three well-known videos. The robustness of the proposed algorithm is then evaluated through an empirical test with various motion levels, noise levels, wavelet functions, and the number of used frames. Most of the existing wavelet-based SR methods have limited discussion on the above-mentioned factors.

2 Methods

2.1 Observation model

The observation model describes the relationship between the original referenced HR image and the observed LR images. The image acquisition process in the spatial domain involves warping, blurring, down-sampling, and noising to produce the LR images from the HR image, as shown in Fig. 1. The blurring step generates blurred images from the warped HR image based on the point spread function introduced by the camera. The down-sampling step produces down-sampled (aliased) LR images from the warped and blurred HR image, and the noise step represents the additive noise applied to each observed LR image. Let us assume that the HR image can be represented in the vector form as \( x={\left[{x}_1,{x}_2,\dots, {x}_{L_1{N}_1\times {L}_2{N}_2}\right]}^T \), where L1N1 × L2N2 is the size of the HR image. Assume that L1 and L2 represent the down-sampling factors in the horizontal and vertical directions, respectively, and each LR image has the size of N1 × N2. Let the LR image be denoted in the vector form by \( {y}_k={\left[{y}_{k1},{y}_{k2},\dots, {y}_{k\Big({N}_1\times {N}_{2\Big)}}\right]}^T \), k = 1, 2, …, p, where p is the number of LR images. If each LR image is corrupted by additive Gaussian noise, the observation model can be represented as

$$ {y}_k=D{B}_k{M}_kx+{n}_k $$
(1)

where Mk is the warp matrix of size L1N1L2N2 × L1N1L2N2, Bk is the camera blur matrix also with the same size, D is the down-sampling matrix of size N1N2 × L1N1L2N2, and nk represents the N1N2 × 1 noise vector. It is assumed that all LR images have the same blur and so the matrix Bk can be substituted by B. These operations can be incorporated into one matrix [41, 67] and can be expressed as

$$ {y}_k= DB{M}_kx+{n}_k={H}_kx+{n}_k $$
(2)
Fig. 1
figure 1

The observation model

2.2 Proposed video super resolution technique

Recovering the missing high-frequency details of the given LR frames is the fundamental target of the video SR methods. The first step is sub-pixel image registration that aims to estimate the motion parameters between the reference frame and each of the neighbouring LR frames. When the camera is moving and the scene is stationary, global motion occurs including translation and rotation. In this work, Keren’s method [30] is selected for global motion estimation which is one of the most accurate methods for sub-pixel image registration in the spatial-domain.

For image reconstruction, conventional interpolation methods (e.g., nearest neighbour, bilinear and bicubic) address the problem of reconstructing a HR image from the available LR image. However, these methods generally yield images with blurred edges and undesirable artefacts because they do not use any information pertinent to the edges in the original image. Therefore, the wavelet-based method is applied to preserve the high-frequency details (i.e. edges) and consequently construct the HR image from the given LR image. In the proposed technique, the discrete wavelet transform (DWT) is employed to isolate and preserve the high-frequency components of the image, and then the interpolation is applied to the high-frequency sub-bands. This is because the interpolation of isolated high-frequency components in the high-frequency sub-bands will preserve more edges of the image than using a direct interpolation. A number of DWT-based interpolation methods [2, 11, 12] have been developed to preserve the high-frequency components in the interpolated sub-bands. Nevertheless, the blurring effect from the employed interpolation method causes the potential loss of edges in these sub-bands. For example, the bicubic interpolation method produces blurring around the edges, even though the method is well accepted for resolution enhancement. Dual-tree complex wavelet transform (DT-CWT)-based interpolation methods [23, 26] are also being applied to address this problem by utilising an alternative interpolation method. Jagadeesh and Pragatheeswaran [26] used edge-directed interpolation EDI [1] as an alternative interpolation method to high-frequency sub-bands produced by DT-CWT. Later, this method was extended by Izadpanahi and Demirel [23] for video SR. Recently, the same authors applied new-edge directional interpolation NEDI [32] method to better preserve the edges of the interpolated high-frequency sub-bands generated by DT-CWT for local motion-based video SR [24]. However, none of these existing wavelet-based methods have tackled the problem of noisy high-frequency details corrupted by the limitations of imaging systems. For the current work, a combination of DWT, NEDI and an adaptive threshold is proposed not only for preserving the high-frequency details, but also for recovering the noiseless high-frequency information. One-level DWT process decomposes the input LR reference frame into four frequency sub-bands (LL, LH, HL and HH) in the frequency-domain. The high-frequency sub-bands (LH, HL and HH) are interpolated using the NEDI method with the scale factor, α. The block diagram of the proposed video SR technique is illustrated by Fig. 2, which shows the combination of the so-called DWT-NEDI and an adaptive threshold with registration and reconstruction methods, in order to produce the estimated high and low-frequency sub-bands, respectively, for the inverse DWT. Generally, real video sequences are most commonly corrupted by noise such as additive Gaussian noise. Therefore, to better preserve the edges and remove potential noise in the estimated high-frequency sub-bands, a thresholding procedure that uses an adaptive threshold is applied to process the produced wavelet coefficients. Many types of thresholding functions have been introduced for the modification of estimated wavelet coefficients, such as Hard, Soft, Semisoft and Garrote [5]. This paper employs a soft-thresholding technique proposed by Donoho [15] and extended by Zhang [66]. A universal threshold τ for the considered sub-band can be calculated by

$$ \tau =\sigma \sqrt{2\log (N)/N} $$
(3)

where σ is the standard deviation of the sub-band, and N is the total number of pixels. The nonlinear soft-thresholding function is defined as

$$ {X}_{out}\left(i,j\right)=\left\{\begin{array}{ll}{X}_{in}\left(i,j\right)-\tau & {X}_{in}\left(i,j\right)>\tau \\ {}0& \left|{X}_{in}\left(i,j\right)\right|\le \tau \\ {}{X}_{in}\left(i,j\right)+\tau & {X}_{in}\left(i,j\right)<-\tau \end{array}\right. $$
(4)
Fig. 2
figure 2

Block diagram of the proposed video SR technique

Equation (3) is chosen in the proposed method considering the prospect of automation in the proposed method and successful application of this equation in similar studies [66].

The rationale to include this thresholding process is that the energy of a signal is often concentrated on a few coefficients while the energy of noise is spread among all coefficients in the wavelet domain. Therefore, the nonlinear soft-thresholding tends to maintain few larger coefficients representing the signal and reduces noise coefficients to zero in the wavelet domain. The universal threshold is intuitively expected to uniformly remove the noise since the Gaussian noise still has the same variance over different scales in the transform domain [66]. On the other hand, in the spatial-domain, when the LR frames are precisely registered by Keren’s method, the registered frames can be combined to reconstruct the missing high-frequency information and produce the low-frequency sub-band. In this work, structure- adaptive normalised convolution (SANC) reconstruction method [44] is applied, with half of the scale factor α/2. This algorithm is used for fusion of irregularly sampled LR frames to recover the high-frequency details and generate the estimated LL sub-band, as the LL sub-band produced by the DWT does not contain any high-frequency information. Finally, inverse DWT (IDWT) process is applied to achieve a super-resolved frame by combining the estimated LL sub-band and processed high-frequency sub-bands.

The combination of DWT with NEDI aims to recover the edge details of directional high-frequency sub-bands and decrease the undesirable inter-directional interference in the SR process. This merit cannot be achieved using only the NEDI method, as indicated in the results section. The application of this soft-thresholding function is based on the hypothesis that the large coefficients in the high-frequency sub-bands reflect the true edges of objects, while the small coefficients reflect the noise, which is demonstrated by Fig. 3. Figure 3a shows the reconstruction image of high-frequency sub-bands only using IDWT without thresholding. Both true edges and noise can be clearly observed. Figure 3b shows the reconstruction image of high-frequency sub-bands only where the small coefficients are removed. It can be observed that the noise is significantly reduced, particularly in the background, while most of the true edges of the human body are preserved. The reconstruction image of high-frequency sub-bands can be acquired only where the large coefficients are removed and is as illustrated in Fig. 3c, which is dominated by noise and very few true edge information can be observed. To demonstrate the importance of this process, a region of the produced HR image using the proposed method without the soft-thresholding is shown in Fig. 3d, where the noise can be clearly observed.

Fig. 3
figure 3

An example to help justify the use of thresholding process. a the reconstruction image of high-frequency sub-bands only without thresholding; b the reconstruction image of high-frequency sub-bands only where the small coefficients are removed; c the reconstruction image of high-frequency sub-bands only where the large coefficients are removed; d the produced HR image without thresholding

The proposed technique can be summarised by the following steps:

  1. 1.

    Consider four consecutive frames from the LR video;

  2. 2.

    Estimate the motion parameters between the reference frame and each of the other LR frames using global motion estimation algorithm proposed by Keren;

  3. 3.

    Apply one-level DWT to decompose the input LR reference frame into four frequency sub-bands;

  4. 4.

    Apply the NEDI method to the LH, HL and HH high-frequency sub-bands with the scale factor of α;

  5. 5.

    Calculate the threshold τ for each high-frequency sub-band;

  6. 6.

    Apply the nonlinear soft-thresholding process for each high-frequency sub-band to create the estimated \( \widehat{\mathrm{LH}} \), \( \widehat{\mathrm{HL}} \) and\( \widehat{\mathrm{HH}} \);

  7. 7.

    In the spatial-domain, employ SANC with half the scale factor α/2 to create the estimated \( \widehat{\mathrm{LL}} \);

  8. 8.

    Apply IDWT using \( \left(\widehat{\mathrm{LL}}\widehat{,\mathrm{LH},}\widehat{\mathrm{HL},}\widehat{\mathrm{HH}}\right) \) to produce the output super-resolved frame.

3 Results

The proposed super-resolution technique was tested on three well-known video sequences, namely, "Mother & daughter", “Akiyo”, and “Foreman”. The files were downloaded from a public database Xiph.org. The proposed algorithm and other methods for comparison were implemented using Matlab 2015. The original high-resolution test videos were resized to 512 × 512 pixels which are considered as the ground truth to evaluate the performance of the proposed approach. The frame rate of the test videos is 30 frames per second and each of the video sequences have 100 frames. Based on the observation model, the input LR video frames with the size of 128 × 128 pixels were created as follows. Each original HR video frame is (1) blurred by a low-pass filter, (2) down-sampled in both the vertical and the horizontal directions by a scale factor of 1/4, and (3) added by a white Gaussian noise with a certain value of signal-to-noise ratio (SNR).

3.1 Visual and quantitative performance evaluation

This example aims to evaluate the overall performance of the proposed technique with a typical selection of parameters against other methods. Four shifted and rotated LR frames for each original HR frame were generated and down-sampled, and Gaussian noise was then added with a SNR value of 30 dB. The motion vectors were randomly produced with a standard deviation (STD) of 2 for shift and 1 for rotation. The wavelet function was chosen as db.9/7.

Figure 4 shows the super-resolved frames using the proposed method and other methods, selected arbitrarily from the video sequences Akiyo, Mother & daughter, and Foreman, respectively. It has been observed that the proposed technique produces the best visual quality in terms of preserving the edges, and removing the noise and aliasing artefacts in comparison to the other considered methods. The proposed technique preserves more information of the edges of the original HR video frame without smearing. For example, the edges of face in Akiyo produced by the proposed method are much cleaner in comparison with the images produced using other methods. Similar visual results have been observed for other edges in Akiyo and for other tested videos. Additionally, the noise and aliasing artefacts have been removed by the proposed method in comparison with the other methods. For example, the aliasing artefacts in Mother shoulders and hands have been removed substantially by the proposed technique as well as the Gaussian noise on the face of Foreman, while these noise and artefacts are clearly presented in the images produced by the other methods. From the motion estimation point of view, the aliasing high-frequency components due to down-sampling process appear to have different motions than the low-frequency components, and cause incorrect motion estimation [35]. Moreover, a larger noise level generates errors in motion estimation.

Fig. 4
figure 4

Results of produced HR images using different SR methods for a randomly selected frame of Akiyo, Mother & daughter, and Foreman video sequence respectively. Column 1: input LR frame, Column 2: Bicubic, Column 3: Vandwalle-SANC, Column 4: Keren-SANC, and Column 5: the proposed technique

To further investigate the improvement of the proposed method, Fig. 5 shows the local PSNR maps for different scenarios of the example Akiyo. The local PSNR map was calculated by a 5 × 5 pixels window. Figure 5a shows the PSNR distribution between the raw HR image and the interpolated HR image produced by nearest neighbour method, which indicates the location of noise introduced by the degrading process. Regions of the human body have more noise (blue regions) introduced by the degrading process, while background regions have less (yellow regions). The blue regions tend to be areas with fine features (like boundary of the human body), while the yellow regions tend to have more coarse structures. Similar representation using the proposed method is shown in Fig. 5b, inspection of which proves that both background and fine features have been better recovered. To break down the contribution of each component, Figure 5c and d show the PSNR gain of the proposed method over NEDI and Keren-SANC, respectively. It has been observed that fine features are significantly improved in comparison to NEDI due to the consideration of adjacent frames, while the improvement of coarse structures is relatively small. Coarse structures have been significantly improved in comparison to Keren-SANC while the improvement of fine features is relatively small. All these observations clearly demonstrate that the proposed method improves the quality of both background and true edges, but other methods can only have one merit.

Fig. 5
figure 5

An example to show the improvement distribution of the proposed technique. a The distribution of PSNR between the HR image using the nearest interpolation and the raw image; b the distribution of PSNR between the super-resolved image using the proposed method and the raw image; c the PSNR gain between the proposed method and NEDI; (d) the PSNR gain between the proposed method and Keren-SANC

For quantitative evaluation of the experimental results, the nearest neighbour and bicubic interpolation methods, state-of-the-art resolution enhancement methods including NEDI [32], DASR [2], DWT-SWT [11], DWT-Dif [12], and state-of-the-art SR methods Keren-SANC and Vandewalle-SANC have been implemented to compare the performance of the proposed technique. The difference between the super-resolved images from different techniques can be small and sometimes it is difficult to be inspected visually. In this paper, as one of the most commonly used objective fidelity criteria to evaluate image quality, the peak-signal-to-noise-ratio (PSNR) between the super-resolved image and the original HR image is used. It can be calculated by

$$ PSNR=10{\log}_{10}\left(\frac{L^2}{MSE}\right) $$
(5)

where L is the maximum fluctuation in the image. If the image is represented by 8-bit grayscale, the value of L will be 255. MSE represents the mean-square-error between the super- resolved image \( \widehat{X}\left(i,j\right) \) and the original HR image X(i, j). It can be calculated by

$$ MSE=\frac{1}{W\times H}{\sum}_{i=1}^W{\sum}_{j=1}^H{\left[\widehat{X}\left(i,j\right)-X\left(i,j\right)\right]}^2 $$
(6)

To complement the quantitative analysis, the structural similarity (SSIM) [58] image quality measure has also been applied. The SSIM index evaluates the visual effect of three characteristics of an image, namely, luminance, contrast and structure. It is based on the computation of these three components and is an inner product of them. It is defined by

$$ SSIM=\frac{\left(2{\mu}_{\widehat{X}}{\mu}_X+{C}_1\right)\left(2{\sigma}_{\widehat{X}}{\sigma}_X+{C}_2\right)}{\left({\mu}_{\widehat{X}}^2+{\mu}_X^2+{C}_1\right)\left({\sigma}_{\widehat{X}}^2+{\sigma}_X^2+{C}_2\right)} $$
(7)

where \( {\mu}_{\widehat{X}},{\mu}_X \) are the local means for the images \( \widehat{X},X \), \( {\sigma}_{\widehat{X}} \),σX are corresponding standard deviations, and C1, C2 are two constants to avoid the instability.

Table 1 shows the comparison of the averaged PSNR and SSIM results of 100 frames from the proposed method and other considered methods on the test videos. It is clearly shown that the proposed method achieves the highest average PSNR and SSIM values (31.48 dB, 30.57 dB and 23.88 dB for PSNR respectively; 0.90,0.91 and 0.84 for SSIM) for three tested videos. Achievement of this improved performance is due to the fact that the DWT-based SR reconstruction approach is more effective to recover the high-frequency details of the given LR frames, where the true edges are preserved and noise is removed benefiting from the nonlinear soft thresholding technique. Additionally, the combination of DWT and NEDI enables the recovering of the edge details of directional high-frequency sub-bands and reduces the annoying inter-directional interference in the SR process. For the videos Mother & daughter and Akiyo, the proposed technique based on global motion of the entire frame is suitable for these videos and produces better PSNR results than the interpolation and classic SR methods (16% and 11% increment over Keren-SANC respectively). For the video Foreman, although PSNR result achieved by the proposed method is higher than the other considered methods (17% increment over KEREN-SANC), the performance gain can be further increased by utilising local motion which divides each frame into multiple blocks and processes each block individually. Similar observation can be concluded based on SSIM values.

Table 1 The averaged PSNR and SSIM values of 100 frames produced from different methods for three tested videos

3.2 Performance for variety of noise levels

To demonstrate the robustness of the proposed method against noise benefiting from the adaptive thresholding process, four shifted and rotated LR frames for each original HR frame were generated and down-sampled, and the motion vectors were randomly produced with a standard deviation of 2 for shift and 1 for rotation. The wavelet function was chosen as db.9/7. The noise level was increased from 50 dB to 20 dB with the step of 5 dB. The first 10 frames from Akiyo video were tested by the proposed method and other different methods, and the results were averaged.

Table 2 shows the comparison of averaged PSNR results between different methods. The last column shows the PSNR increment over Keren-SANC in percentage by the proposed technique. It can be clearly seen that the proposed method consistently has the best performance for every noise level. Furthermore, the performance is even better for images corrupted by high noise level (15%, 21%, 18% increment for 30 dB, 25 dB and 20 dB respectively) than those with low noise level (10%, 8%, 5%, 3% increment for 50 dB, 45 dB, 40 dB and 35 dB respectively).

Table 2 The averaged PSNR results of 10 frames from Akiyo test video for each noise level, range from 20 dB to 50 dB with 5 dB step

3.3 Performance for variety of wavelet functions

The above results from the proposed technique have been produced by the most widely used wavelet function db.9/7 in image SR applications. This section discusses the prospect of the proposed approach using other wavelet functions. Previously, our research showed that the selection of wavelet function can affect the performance of SR techniques [60]. In this experiment, the same parameters were chosen except that the noise level was fixed as 30 dB, and the wavelet function was set as a variable entity. Table 3 shows the averaged PSNR and SSIM results of the first 10 frames for the Akiyo video sequence produced by the proposed method using nine wavelet functions, which include db1, db2, sym16, sym20, ciof1, ciof2, bior4.4 (db.9/7), bior5.5, and bior6.8. The results show that the proposed technique can perform well under other wavelet functions apart from db.9/7, even better than db.9/7. Not that bior4.4 is equal to db.9/7 [4]. In terms of both PSNR and SSIM values, the wavelet function with top 5 performances are sym20, sym16, bior6.8, bior4.4 and coif1 respectively, although the difference between them is not significant.

Table 3 The averaged PSNR and SSIM values of 10 frames produced by the proposed technique for different wavelet functions

3.4 Performance for variety of motion levels

This section is dedicated to discussing the effectiveness of motion level (shift and rotation) on the performance of the proposed algorithm. In this experiment, the shift on both horizontal and vertical directions and rotation angle were randomly selected with the standard variation STD changing from 1 to 4 during generating the input LR frames from a HR frame. Four shifted and rotated LR frames for each original HR frame were generated and down-sampled. The wavelet function was chosen as db.9/7, and the noise level was fixed as 30 dB.

The averaged PSNR and SSIM results of the first 10 frames form Akiyo video produced by the proposed technique with different motion levels are shown in Table 4. It can be observed that the proposed method produces the higher PSNR and SSIM values when the motion level is relatively small. This is because of the fact that the estimation of a small motion is usually more accurate and leads to better reconstruction. When the motion level is large, the values of PSNR and SSIM drop as expected. This is because a large motion is more difficult to be measured accurately and errors in motion estimation prevents reconstructing the original HR frames correctly. From the considered smallest to largest motions, 12% and 7% decrease of PSNR and SSIM values respectively has been observed.

Table 4 The averaged PSNR and SSIM values of 10 frames produced by the proposed technique with different motion levels

3.5 Performance for variety of number frames

This section aims to evaluate the effectiveness of the number of used frames on the performance. In all previous experiments, it was assumed that the shift and rotation parameters are randomly produced. But in real applications, the camera usually moves towards one direction which means the shifts change monotonously. The shifts include two motion vectors, horizontal shift x and vertical shift y. To simplify the process, in this experiment, the rotation angle was randomly selected with the standard deviation of 1, and only shifts are changed. The shifts were produced based on

$$ {\Delta }_x(i)={\Delta }_y(i)=\frac{i}{N} $$
(8)

where i(i = 1, 2, …, N) denotes the time index of LR frames and N denotes the total number of used frames. All other parameters are same as the previous experiment.

Table 5 shows the averaged PSNR and SSIM results of the first 10 frames for the Akiyo video sequence produced by the proposed technique using the number frames of 4, 8, 16 and 32 respectively. It has been observed that, as expected, higher PSNR and SSIM values were achieved with more number of sampled frames. However, the increment is only about 1% when the number changes from 4 to 32. This observation is because the used motion model is simple, and there is very limited extra contribution from 32 frames in comparison to 4 frames. A higher percentage of improvement could be achieved when the motion is more complicated or the motion is corrupted by more noise.

Table 5 The averaged PSNR and SSIM values of 10 frames produced by the proposed technique by sampling different number of frames

4 Conclusions

A robust video super-resolution reconstruction approach based on combining discrete wavelet transform, new edge-directed interpolation and the nonlinear soft-thresholding has been proposed in this paper for noisy LR video sequences with global motion to recover the noiseless high-frequency details and increase the spatial resolution, which integrates properties from methods of image registration and reconstruction. Firstly, an iterative planar motion estimation algorithm by Keren is used to estimate the motion parameters between a reference frame and its neighbouring LR frames in the spatial domain. The registered frames are combined by the SANC reconstruction method to output the estimated low-frequency sub-band. Secondly, the DWT is employed to decompose each input LR reference frame into four frequency sub-bands in the frequency-domain. The NEDI is employed to process each of three high-frequency sub-bands, which are then filtered using the adaptive thresholding process to preserve the true edges and reduce the noise in the estimated high-frequency sub-bands. Finally, by combining the estimated low-frequency sub-band and three high-frequency sub-bands, a super-resolved frame is recovered through the invert-DWT process.

Subjective results show that this approach can better preserve the edges and remove potential noise in the estimated high-frequency sub-bands since a direct interpolation will blur the areas around edges. Three well-known videos (totally 100 frames for each) have been tested, and the quantitative results show that the superior performance of the proposed method. The proposed method tops the averaged PSNR and SSIM values (31.48 dB, 30.57 dB and 23.88 dB for PSNR respectively; 0.90,0.91 and 0.84 for SSIM) for three videos respectively, and the averaged increment over KEREN-SANC is 16%, 11%, and 17% respectively. The performance against noise has also been analysed. Analysis based on the contribution of each component clearly demonstrates that the proposed method improves the quality of both background and true edges, but other methods usually can only have one merit.

One of the motivations of this paper is to address the limited performance capabilities of most of the existing wavelet-based SR methods for a variety of motion levels, noise levels, wavelet functions and adequate number of used frames, do empirical tests and analyse how these factors can affect the performance of the proposed method. The conclusions are:

  • The proposed technique has produced 10%, 8%, 5% and 3% averaged increment of PSNR for an image corrupted by low level noise with the SNR value of 50 dB, 45 dB, 40 dB and 35 dB respectively. It has produced 15%, 21% and 18% averaged increment of PSNR for the image corrupted by high level noise with the SNR value of 30 dB, 25 dB and 20 dB respectively.

  • The proposed technique can perform well using other wavelet functions apart from db.9/7, even better than db.9/7, although the difference between them is not significant.

  • The performance of the proposed method is affected by the level of motion. Based on the considered smallest to largest motions, 12% and 7% decrease of PSNR and SSIM values respectively has been observed.

  • If the motion is simple, the number of sampled frames has limited improvement on the performance due to the limited extra information. If the motion is complex and corrupted by high level of noise, significant improvement is expected using more frames.

A limitation of this method is that it can only be applied to video sequences with global motion. However, it can be extended to local motion by dividing the video frame into multiple blocks and then applying this method to each block.