Introduction

Understanding the changes in climate patterns and their impact on society is of great importance. Rising temperatures, increasing sea-level, and escalating frequency of extreme weather events make many aspects of our society vulnerable. These vulnerabilities extend to our health, urban infrastructure, natural resources, energy systems, and transportation systems (Nicholls and Cazenave 2010; Trenberth 2012; Vandal et al. 2017; Davarpanah et al. 2023). Therefore, to conduct a risk assessment and design adaptation planning, future local and regional projections of climate change are of greater importance (Venetsanou et al. 2019; Mahjour et al. 2024). Earth System Models (ESMs) represent physics-based numerical models that currently run on massive supercomputers (Vandal et al. 2017). These models simulate Earth’s past climate and project future scenarios, considering changes in atmospheric greenhouse gas emissions (Passarella et al. 2022; Vandal et al. 2017).

The ESMs, however, typically work at coarse horizontal resolutions, around 1\(^{\circ }\) to 3\(^{\circ }\), which can lead to inaccuracies in representing crucial physical processes, such as extreme precipitation (Vandal et al. 2017; Schmidt 2010; Passarella et al. 2022; Kharin et al. 2007). Recent progress has allowed global ESMs to operate at higher horizontal resolutions, for example, 0.25\(^{\circ }\)and 0.125\(^{\circ }\) for extended durations (Hurrell et al. 2013; E3SM Project 2018; Hohenegger et al. 2023). This development has demonstrated improvements in the simulation of regional average climate conditions and extreme events (Mahajan et al. 2015), but at a high computational cost. To address the computational challenge, downscaling techniques, such as statistical downscaling and dynamical modeling, have been employed in the literature to generate fine-resolution ESM data (Passarella et al. 2022; Vandal et al. 2017).

Statistical downscaling is a technique used to convert coarse-resolution (CR) climate data into fine-resolution (FR) projections by incorporating observational data through the application of statistical methods (Passarella et al. 2022). Statistical downscaling involves two categories for spatial downscaling: (i) regression models, and (ii) weather classification. Regression models include automatic statistical downscaling (Hessami et al. 2008), Bayesian model averaging (Zhang and Yan, 2015), expanded downscaling (Bürger 1996; Bürger and Chen 2005), and bias-corrected spatial disaggregation (BCSD) (Thrasher et al. 2012). On the other hand, weather classification includes methods such as nearest-neighbor estimates (Hidalgo et al. 2008) and hierarchical Bayesian inference models (Manor and Berkovic 2015). Recently, regression-based statistical downscaling has been extended to include machine learning techniques, such as neural networks (Park et al. 2022; Oyama et al. 2023; Kumar et al. 2023; Vu et al. 2016), quantile regression neural networks (Cannon 2011), and support vector machines (Ghosh 2010), for statistical downscaling. These techniques demonstrate superior performance compared to conventional statistical downscaling methods (Passarella et al. 2022; Oyama et al. 2023). In addition to machine learning techniques, computer vision-based statistical downscaling techniques known as super-resolution have emerged (Vandal et al. 2017).

These super-resolution methods generalize patterns across fine-resolution and show to learn local-scale patterns more efficiently than other statistical downscaling techniques (Park et al. 2022; Vandal et al. 2017). Vandal et al. (2017) introduced the augmented stacked super-solution convolutional neural networks (SRCNN), which downscaled climate and ESM-based simulations based on observational and topographical data across the continental United States. The method outperforms BCSD, artificial neural networks, Lasso, and SVM in the downscaling task. The stacked SRCNN outperformed in terms of bias, correlation, and RMSE with values of 0.022 mm/day, 0.914, and 2.529 mm/day. Furthermore, Passarella et al. (2022) proposed a fast SRCNN ESM (FSRCNN-ESM) that surpasses augmented stacked SRCNN and FSRCNN models in downscaling features such as surface temperature, surface radiative fluxes, and precipitation in ESM data over North America. The FSRCNN-ESM model demonstrated better \(L_1\) and \(L_2\) error metrics, quantifying the reconstruction skill compared to the stacked FSRCNN, with the majority of samples in the testing dataset having \(L_1\) error less than 10% and \(L_2\) error less than 1%.

Other studies include using super-resolution generative adversarial networks (SRGAN), which enhances the resolution of coarse 100-km climate data by a factor of 50 for wind velocity and 25X for solar radiance data (Stengel et al. 2020). The semivariograms of ground truth data exhibit an initial variance jump (nugget) due to cloud cover discontinuities. While bicubic interpolation and CNN super-resolution fail to capture the nugget for solar radiance data, SRGAN reproduces it by learning small-scale texture and discontinuity data (Stengel et al. 2020). Advancements in this domain include physics-informed SRGAN (Oyama et al. 2023), which integrates climatologically important physical information such as pressure and topography into the network to achieve significant downscaling of temperature and precipitation data by a factor of 50 (Oyama et al. 2023). The results showed that physics-informed SRGAN in comparison to the cumulative distribution function-based downscaling method achieved 3.6 times better accuracy for the average mean square error value on 12 sites (Oyama et al. 2023). Kumar et al. (2023) employed SRGAN and compared it with other deep learning approaches such as stacked SRCNN (Vandal et al. 2017), UNET (Serifi et al. 2021), and ConvLSTM (Harilal et al. 2021) for downscaling precipitation data. Their findings highlight the superiority of SRGAN, showcasing a correlation coefficient of 0.8806, while UNET, ConvLSTM, and DeepSD exhibited values of 0.8399, 0.8311, and 0.8037, respectively (Kumar et al. 2023).

The literature shows promising areas where CNN-based deep learning has been employed to downscale ESM data (Passarella et al. 2022; Park et al. 2022; Vandal et al. 2017). However, there is a lack of systematic comparison of state-of-the-art super-resolution techniques on E3SM data, considering the entire globe. In this study, our objective is to address this shortcoming by exploring five downscaling methods, namely, SRCNN (Dong et al. 2014), FSRCNN-ESM (Passarella et al. 2022), efficient sub-pixel convolutional neural networks (ESPCNN) (Talab et al. 2019), enhanced deep residual networks (EDRN) (Lim et al. 2017) and SRGAN (Ledig et al. 2017). The five models in this study are selected based on a literature review of CNN-based super-resolution algorithms (Passarella et al. 2022; Oyama et al. 2023; Park et al. 2022; Soltanmohammadi and Faroughi 2023). SRCNN is selected because it represents a fundamental version of the CNN-based superresolution technique (Dong et al. 2014). FSRCNN-ESM is selected because it was shown to achieve higher accuracy for E3SM data downscaling compared to the stacked SRCNN family (Passarella et al. 2022). ESPCNN is chosen for its lightweight nature and faster training time, as highlighted in (Talab et al. 2019). Furthermore, the selection of EDRN (Lim et al. 2017) and SRGAN (Ledig et al. 2017) is motivated by their deeper architectures, allowing us to analyze their performance compared to previous models. The overarching goal is to employ the vanilla architectures of these selected models and to conduct a comprehensive comparison of their efficacy in downscaling E3SM datasets.

We conduct a comparison of these techniques to assess their performance on an ESM dataset named Energy Exascale Earth Systems Model (E3SM) (E3SM Project 2018). Our focus is on key surface variables including surface temperature (TS), shortwave heat flux (FSNS), and longwave heat flux (FLNS) over the entire globe. We allocate 80% of the sliced first 9 years of ESM simulation data (with monthly interval) for training, reserving the remaining 20% for validation. Additionally, we employ blind testing on two distinct years outside of the training and validation sets. To facilitate model comparison, we employ evaluation metrics including absolute point error (APE) (Faroughi et al. 2022), peak-signal-to-noise ratio (PSNR) (Deng 2018), structural similarity index measure (SSIM) (Dosselmann and Yang 2011), learned perceptual image patch similarity (LPIPS) (Zhang et al. 2018), and mean squared error (MSE).

Earth exascale earth systems model (E3SM) data

In this study, we employed the monthly output information derived from a 30-year period within the 1950-control simulation, utilizing the global fine-resolution (0.25\(^{\circ }\)) setup of the Energy Exascale Earth Systems Model (E3SM) (E3SM Project 2018). The selection of E3SM compared to other earth system models is attributed to its open accessibility through the Earth System Grid Federation (ESGF) distributed archives (E3SM Project 2018). The E3SM simulation data undergo bilinear interpolation from their original non-orthogonal cubed-sphere grid to a regular 0.25\(^{\circ }\) \(\times \) 0.25\(^{\circ }\) longitude-latitude grid, resulting in the interpolated model data known as E3SM-FR (Passarella et al. 2022). To generate the corresponding coarse-resolution input data, the fine-resolution data is further interpolated onto a 1\(^\circ \) \(\times \) 1\(^\circ \) grid using a bicubic (BC) method (Passarella et al. 2022). This interpolation process removes fine-scale details present in the original fine-resolution data, creating coarse-resolution profiles. When applying deep learning techniques to gridded E3SM data, each grid point is treated as an image pixel. The E3SM fine-resolution data is organized into an image with dimensions of 720 \(\times \) 1440 pixels, while the corresponding coarse-resolution data has dimensions of 180 \(\times \) 360 pixels. The global data are then divided into 18 slices, and within each slice, the coarse-resolution images measure 60 \(\times \) 60 pixels, while the fine-resolution images measure 240 \(\times \) 240 pixels. We select surface temperature, shortwave heat flux, and longwave heat flux to evaluate the super-resolution models. As depicted in Fig. 1, the total number of datasets used for the study is calculated as follows: 11 (years) \(\times \) 12 (months), \(\times \) 3 (variables) , \(\times \) 18 (slices). In this study, all three variables are collectively incorporated by normalizing each of them and saving them as an RGB image. This normalization process ensures that the variables are brought to a standardized scale, enabling the use of a multi-channel network during training, thus enhancing the regeneration process. The total dataset is divided into training, validation, and blind testing sets. The training dataset consists of 80% of the total dataset for the first 9 years, while the remaining 20% is used for the validation set. For blind testing, we consider two distinctive sets corresponding to two randomly selected years (15\(^{th}\) and 19\(^{th}\) year), focusing on the surface variables surface temperature, shortwave heat flux, and longwave heat flux.

Fig. 1
figure 1

Schematic procedure of data pre-processing for the super-resolution technique. Panels (a), (b), and (c) show surface temperature, shortwave heat flux, and longwave heat flux, respectively, for the first month of year 1 obtained from the global fine-resolution configuration of E3SM. It indicates the division of each month into 18 distinct sections, as visually represented. Each FR month comprises a matrix of dimensions 720 by 1440. Through the division into 18 segments, each section transforms into a 240 by 240 matrix, subsequently converted as a 240 by 240 image

Table 1 Comparison of the trainable parameter numbers and training time in hours for the algorithms used in this study for 10,000 epochs

Methodologies

This section provides an overview of the five architectures used in our study: super-resolution convolutional neural networks (SRCNN), fast super-resolution convolutional neural network earth system model (FSRCNN-ESM), efficient sub-pixel convolutional neural network (ESPCNN), enhanced deep residual network (EDRN) (Lim et al. 2017), and super-resolution generative adversarial network (SRGAN). The SRCNN model uses bicubic interpolation as input and features a shallow architecture (Soltanmohammadi and Faroughi 2023; Dong et al. 2014). In contrast, both the FSRCNN and ESPCNN techniques use the coarse-resolution images as input (Passarella et al. 2022; Talab et al. 2019). The EDRN takes coarse-resolution as input and model employs a deep architecture and incorporates skip connections to retain important features from earlier layers (Soltanmohammadi and Faroughi 2023; Lim et al. 2017). The SRGAN architecture consists of a generator and a discriminator, where the generator generates a fine-resolution using the coarse-resolution input (Soltanmohammadi and Faroughi 2023; Ledig et al. 2017). The algorithms and their architectures are detailed in Appendix A, and the evaluation metrics used to assess the models’ performance are described in Appendix B.

Table 2 Comparison of evaluation metrics for fine-resolution data of surface variables generated by different models and bicubic interpolation using the training (Train.) and validation (Vali.) set

Computational experiments

Training and validation

The training process is carried out using an NVIDIA RTX A6000 GPU with 47.5 GB of dedicated GPU memory (vRAM). Additionally, the computer used for training is equipped with an Intel (R) Xeon (R) W7-3455 processor and boasts 512 GB of RAM capacity. Detailed information on the number of parameters and the time required for training each algorithm can be found in Table 1. The super-resolution generative adversarial network model has the most trainable parameters with 2,039,939 numbers and the longest training time (41.10X compared to SRCNN). This is because the discriminator network in SRGAN compares two RGB images sized \(240\times 240\). The efficient sub-pixel convolutional neural network model is the most computationally efficient, with a training time relative to SRCNN of 0.64X, despite having more trainable parameters than SRCNN and FSRCNN-ESM. This is due to the efficient training approach used in ESPCNN, as discussed by Shi et al. (2016).

Table 2 presents the evaluation metrics associated with the performance of each algorithm on the training set. The results show a clear trend: PSNR increases from bicubic interpolation to EDRN, with SRGAN exhibiting the lowest PSNR values. In contrast, the SSIM values do not show a clear pattern throughout the dataset, although SRGAN has the lowest SSIM. Furthermore, the mean square error shows an inverse trend, decreasing from bicubic interpolation to enhanced deep residual network, with SRGAN having a value slightly higher than bicubic interpolation but lower than the other methods. Lower PSNR and SSIM for SRGAN have been reported previously (Ledig et al. 2017), however, these are not accurate measures of super-resolution. While PSNR and SSIM may decrease, the fine-scale resolution of the generated images often improves. Therefore, using additional metrics, such as LPIPS, is important to assess the performance of different algorithms. Learned perceptual image patch similarity measures perceptual similarity between images and is a more accurate measure of image resolution than peak-signal-to-noise-ratio and structural similarity index measure. In this regard, LPIPS decreases from 0.297 to 0.234 from bicubic interpolation to SRGAN, indicating that SRGAN generates more accurate fine-resolution images than the other models. Table 2 shows the evaluation metrics for each algorithm tested using the validation set. The results are similar to those for the training set, indicating that the models do not overfit the training dataset. Table 2 has been deliberately included to present training and validation results. Our objective is to provide a comprehensive analysis for understanding the models’ performance across training and validation phases. Next, we will evaluate the performance of each algorithm on the blind testing dataset.

Fig. 2
figure 2

Comparison between generated fine-resolution using SRCNN, FSRCNN-ESM, and ESPCNN compared to bicubic interpolation and ground truth for the zoomed-in cross-section of net longwave flux

Blind testing

SRCNN, FSRCNN and ESPCNN

SRCNN takes the bicubic interpolation image as input and generates a fine-resolution using three convolutional layers. For blind testing, we randomly consider a zoom in section for the longwave heat flux of fall (December) from first blind testing set. This results focuses on a zoomed-in cross-section of the longwave heat flux for North America to assess the fine-scale resolution of the generated fine-resolution profile, as depicted in Fig. 2. The results show that SRCNN outperforms bicubic interpolation in terms of evaluation metrics with LPIPS of 0.197 compared to 0.209 for bicubic interpolation. Furthermore, SRCNN is capable of reproducing the boundary of the zoomed-in cross-section; however, it faces a challenge recovering the internal structure and strong gradients of the selected cross-section for longwave heat flux.

FSRCNN-ESM uses the coarse-resolution image as input, unlike SRCNN, which uses the bicubic interpolated image. It is an improved extension of FSRCNN, where the model after the deconvolutional step employs an additional SRCNN-like convolutional layer to enhance accuracy. During blind testing, similar to SRCNN we focus on a zoomed in section in the fall (December) from the first set of blind test, considering the surface variable longwave heat flux. Figure 2 provides a zoomed-in cross-section of longwave heat flux to assess the fine-scale details generated by FSRCNN-ESM. The outcomes indicate that FSRCNN-ESM surpasses bicubic interpolation and SRCNN in terms of evaluation metrics i.e., LPIPS and is capable of generating boundaries for the zoomed-in longwave heat flux cross-section, but similar to SRCNN, it has limitations in generating internal structure and strong gradients of the longwave heat flux cross-section.

An efficient sub-pixel convolutional neural network, similar to FSRCNN-ESM, uses a coarse-resolution image as input but utilizes an efficient sub-pixel convolutional layer for super-resolution. We evaluated ESPCNN using the cross section in the fall (December) from the first set of blind test, considering the longwave heat flux. The results show that ESPCNN is capable of generating fine-resolution images in E3SM data, similar to FSRCNN-ESM, SRCNN and FSRCNN, as shown in Fig. 2. It is evident that, compared to bicubic interpolation, SRCNN and FSRCNN-ESM the boundaries are slightly sharper and the internal structure is somewhat clearer, leading to improvements LPIPS. However, due to its limited architectural depth, ESPCNN cannot produce an accurate internal structure and fine details in the E3SM data, resulting in a blurriness similar to that seen in FSRCNN-ESM and SRCNN.

EDRN

An enhanced deep residual network, an in-depth residual network designed for the super-resolution task. Figure 3 (a) shows the coarse-resolution, super-resolution generated using EDRN, ground truth images, and error images for the features of surface temperature, shortwave heat flux, and longwave heat flux during the spring (March) from the first set of blind testing data. The results show that the EDRN accurately reconstructs fine-resolution images, preserving gradients, and closely resembling the internal E3SM data structure, all while maintaining a low absolute point error. EDRN faces the greatest challenge in reconstructing longwave heat flux as seen in terms of LPIPS 0.238 compared to 0.231 and 0.179 for shortwave heat flux and surface temperature. Hence, to assess the fine-scale details, we highlight a zoomed-in cross-section in Fig. 3(b). Compared to bicubic interpolation, EDRN enhances PSNR, and SSIM and reduces LPIPS for the zoomed-in cross-section. However, it falls short of fully reconstructing fine-scale features of E3SM data, although it excels in generating boundaries with strong gradients due to its complex and deeper architecture compared to SRCNN, FSRCNN-ESM, and ESPCNN.

Fig. 3
figure 3

Comparison of the EDRN performances for generating fine-resolution E3SM data. Panel (a) compares the generated fine-resolution surface variables from EDRN with the global fine-resolution configuration of the E3SM. The top row shows the surface temperature, while the second and third rows show the shortwave heat flux and the longwave heat flux, respectively. For each row, the left column displays the coarse-resolution surface variable, followed by the EDRN-generated fine-resolution surface variable, the ground truth fine-resolution surface variable from E3SM, and the absolute point error between the generated and ground truth fine-resolution surface variables. Panel (b) compares the coarse-resolution, bicubic interpolation, EDRN-generated, and ground truth images of a net longwave flux cross-section

SRGAN

Super-Resolution Generative Adversarial Network, with a more complex architecture than EDRN, comprises a generator and discriminator. Figure 4(a) shows coarse-resolution, super-resolution generated using SRGAN, ground truth, and error images for surface temperature, shortwave heat flux, and longwave heat flux features during the summer (August) month of the first blind testing set. The results show that SRGAN can capture boundaries with strong gradients and fine-scale details in internal structures with minimal absolute point error, as is evident in the error images for the surface variables surface temperature, shortwave heat flux, and longwave heat flux in Fig. 4(a). Moreover, Fig. 4(b) displays a zoomed-in cross-section of longwave heat flux similar to that of EDRN, indicating SRGAN’s accuracy in capturing boundaries, internal structure, and fine-scale details, despite a low SSIM value. Here, the choice of LPIPS as an evaluation metric for SRGAN is justified because it is designed to capture the perpetual similarity between ground truth and generated fine-resolution images, while PSNR, SSIM rely on mean squared error and structural similarity. For example, the average PSNR value for SRGAN, presented in Table 2 for training and validation phases and Table 3 for blind testing, is lower than that of bicubic interpolation. However, there are instances, as shown in Fig. 4(b), where there is a noticeable increase in PSNR.

Fig. 4
figure 4

Comparison of the SRGAN performances for generating fine-resolution E3SM data. Panel (a) compares the generated fine-resolution surface variables from SRGAN with the global fine-resolution configuration of the E3SM. The top row shows the surface temperature, while the second and third rows show the shortwave heat flux and the longwave heat flux, respectively. For each row, the left column displays the coarse- resolution surface variable, followed by the SRGAN-generated fine-resolution surface variable, the ground truth fine-resolution surface variable from E3SM, and the absolute point error between the generated and ground truth fine-resolution surface variables. Panel (b) compares the coarse-resolution, bicubic interpolation, SRGAN-generated, and ground truth images of a net longwave flux cross-section

Model comparison

Table 3 Comparison of evaluation metrics for fine-resolution data of surface variables generated by different models and bicubic interpolation using the blind testing set 1 (BT 1) and blind testing set 2 (BT 2)
Fig. 5
figure 5

A comparison of various algorithms employed in this study for reconstructing fine-resolution from coarse-resolution images in the blind test reveals a consistent trend of decreasing learned perceptual image patch similarity values from bicubic interpolation to SRGAN. This trend shows that SRGAN achieves more accurate reconstruction compared to all other algorithms

Fig. 6
figure 6

A comparison of various algorithms employed in this study to reconstruct fine-resolution cross-section of surface temperature. The trend indicates that SRGAN could reconstruct the perpetual features with higher accuracy compared to all other algorithms

In the final analysis, we conduct a comparison of all five models to assess their effectiveness in generating fine-resolution E3SM data. Figure 5 shows a comparison of the different algorithms for the longwave heat flux feature in the summer (August) from the first blind test set. Based on the evaluation metrics, the PSNR and SSIM values increase from bicubic interpolation to EDRN, with SRGAN having the lowest structural similarity index measure and PSNR values just above bicubic interpolation. However, SRGAN has the lowest LPIPS value, which indicates its superior performance in generating fine-resolution images compared to other models. Furthermore, Fig. 6 shows the zoomed-in cross-section comparison of all the models for the spring (February) in the first blind testing dataset. A similar trend could be observed, except for FSRCNN-ESM performing better than ESPCN in terms of LPIPS and PSNR. This is because both algorithms have very close values for the evaluation metrics. Additionally, in terms of learned perceptual image patch similarity, there is a significant drop from 0.291 for bicubic interpolation to 0.198 for SRGAN, which explains SRGAN’s capability to generate boundaries with strong gradients, internal structures, and fine-scale details of the surface variable.

Fig. 7
figure 7

A comparison of the mean squared error (MSE) between the fine-resolution predictions and the ground truth along the edge of the white square. The values of location represent positions on the white square, starting from the top-left as the initial point and proceeding through the square, with the last 800 points ending back at the top-left. Panel (a) illustrates the surface temperature (TS), Panel (b) displays the shortwave heat flux (FSNS), and Panel (c) shows the longwave heat flux (FLNS)

Fig. 8
figure 8

Violin plots for the evaluation metrics. Panels (a) to (d) depict the peak-signal-to-noise-ratio, structural similarity index measure, learned perceptual image patch similarity, and mean squared error values, respectively, obtained from all five super-resolution models, including bicubic interpolation. The \(\mu \) is used to denote the mean value

To understand the rationale behind the lower values of structural similarity index measure and PSNR, we present Fig. 7, which displays the mean square error between the ground truth and the generated fine-resolution images using all algorithms employed in this study along the edges of the white square. It is evident that the mean square error for SRGAN slightly surpasses that of the bicubic interpolation for surface temperature, but it remains lower than that of all the other algorithms. The shortwave heat flux surface variable has the lowest MSE. There is an exception in the case of longwave heat flux, where the mean square error for SRGAN performed second best after EDRN. For the majority of the features, we see that SRGAN has a higher mean square error because SRGAN’s loss function is based on perceptual similarity rather than statistical pixel-wise metrics, such as mean square error. Therefore, SRGAN generates fine-resolution images that are more perceptually realistic, even though they have lower PSNR, SSIM, and MSE values. Additionally, to analyze the trend in the evaluation metrics, we plot the values in the violin plot. Figure 8(a) shows that the high-density distribution of PSNR for bicubic interpolation ranges from 15 to 35 dB. In contrast, the distribution for SRCNN is more compact and lies in the upper bound, with PSNR values ranging from 18 to 30 dB. This indicates that the SRCNN results have a more concentrated peak-signal-to-noise-ratio distribution. Furthermore, SRCNN shows a higher mean PSNR of 24.59 dB, compared to 24.13 dB for bicubic interpolation. Similarly, performance in terms of PSNR increases from FSRCNN-ESM to EDRN, with a high-density distribution of PSNR values for EDRN ranging from 18 dB to 36 dB and a mean PSNR value of 25.52, indicating improved performance compared to other algorithms. Note that SRGAN has a high density distribution for PSNR in the lower bound compared to other models, leading to the lowest mean value compared to all other algorithms.

In Fig. 8(b), structural similarity index measure shows a high-density distribution ranging from 0.69 to 0.98, with a mean value of 0.87 for bicubic interpolation. Applying SRCNN results in a more compact distribution, ranging from 0.73 to 0.93, with a reduced mean value of 0.83. For FSRCNN, the values range from 0.75 to 0.95, with a mean value of 0.87. For ESPCN, the values range from 0.75 to 0.97, with a mean value of 0.87. There is an improvement in the range for EDRN, i.e., from 0.82 to 0.99, accompanied by an improvement in the mean, reaching 0.90. EDRN shows a compact, upper-bound, high-density distribution for SSIM values, demonstrating superior performance compared to all other algorithms. Similarly to PSNR, structural similarity index measure in the SRGAN has a high density distribution in lower range values compared to other models, further leading to the lowest mean SSIM value compared to all other algorithms. Figure 8(c) shows the LPIPS value distribution for each algorithm. The results indicate a decreasing trend in the high-density distribution of data, becoming lower and more compact from bicubic interpolation to SRGAN. This trend is also reflected in the mean values of LPIPS, which decrease from 0.258 to 0.193 from the bicubic interpolation to SRGAN. This suggests that SRGAN outperforms other algorithms. The LPIPS evaluation metric is more important compared to peak-signal-to-noise-ratio and structural similarity index measure, as it considers perceptual similarity compared to statistical image metrics. Figure 8(d) shows the inverse trend of PSNR, i.e., mean square error decreases from bicubic interpolation to EDRN, with SRGAN having the lowest mean square error.

Finally, when comparing Tables 2 (training and validation) and 3 (blind testing), we observe that the blind testing dataset yields better results for all evaluation metrics across all models, including bicubic interpolation. To validate the results obtained from the blind test, we apply all the models to a second blind test dataset, as shown in Table 3. The results show a pattern similar to that of the first set of blind tests. Therefore, it can be confirmed that all the models demonstrated superior performance, in terms of metrics, for blind testing datasets compared to the training and evaluation datasets. This trend could be attributed to the lower number of data points in the blind testing set compared to the training and evaluation datasets. Furthermore, for the blind test, a similar trend could be observed as in the training and evaluation datasets, wherein EDRN achieved the maximum peak signal-to-noise ratio and structural similarity index measure, along with the lowest mean square error, while SRGAN achieved the lowest learned perceptual image patch similarity.

Our results show that vanilla super-resolution techniques provide useful prediction, but may face challenges in predicting fine-resolution features and high-gradient regions due to their pure statistical nature not incorporating known physical laws behind the data (Oyama et al. 2023; Faroughi et al. 2022). The most effective super-resolution vanilla techniques, e.g., EDRN and SRGAN, provide better predictions in those regions but are computationally expensive, creating a trade-off between accuracy and computational cost. Current research mainly aims to address these challenges by improving the accuracy of super-resolution techniques by incorporating contextual and physical information into the input array (Oyama et al. 2023; Passarella et al. 2022), modifying architecture such as activation functions (Sitzmann et al. 2020; Dupont et al. 2021), and utilizing advanced learning techniques such as model compression (Cheng et al. 2018; Choudhary et al. 2020), and transfer learning (Tan et al. 2018; Zhang et al. 2017) to reduce computational cost. Future research can incorporate all these advances into a united framework that offers high accuracy in capturing fine-resolution details at a low computational cost. Using such a framework, it will be possible to generate accurate E3SM projections at very fine resolutions, which are necessary to address the complexities posed by climate change and help plan adaptation and critical decision-making processes.

Conclusions

We investigated the application of super-resolution deep learning techniques to downscale the Energy Exascale Earth Systems Model (E3SM) data, generating fine-resolution data using their coarse-resolution simulation counterparts. We compared five super-resolution methods, including super-resolution convolutional neural networks (SRCCNN), fast super-resolution convolutional neural network ESM (FSRCNN-ESM), efficient sub-pixel convolutional neural networks (ESPCN), enhanced deep residual networks (EDRN), and super-resolution generative adversarial networks (SRGAN). We compared the accuracy of each method to produce fine-resolution data using several statistical criteria, such as PSNR, SSIM, and LPIPS. In general, these models can provide finer-resolution E3SM data that better captures the finer features than bicubic interpolation, although we found different performance trends for each model. In terms of PSNR and SSIM, EDRN outperformed all other methods, while SRGAN excelled in LPIPS. The SRCNN technique generates fine-resolution data with low PSNR and SSIM. This is because SRCNN uses a bicubic image as input instead of a coarse-resolution image and has a shallow architecture compared to other methods. In contrast, FSRCNN-ESM and ESPCNN, which take coarse-resolution images as input, produce fine-resolution with PSNR and SSIM better than that of SRCNN. EDRN, which uses a residual network, was able to accurately reconstruct the internal structure and regenerate strong gradients, as evidenced by a 1.39 dB and 1.57 dB improvement in PSRN in the first and second blind testing datasets compared to bicubic interpolation. However, it was unable to properly preserve the fine-scale details of the E3SM data. In contrast, SRGAN was able to reproduce both fine-scale details at boundaries and the internal structure of the E3SM data, but at a higher computational cost (14X more compared to EDRN and 41X more compared to SRCNN). Despite the visual closeness of SRGAN’s outputs to ground truth images, the PSNR and SSIM metrics were not reflective of their true performance. However, the LPIPS metric accurately highlighted this important aspect of the effectiveness of SRGAN, showing a 25% improvement compared to bicubic interpolation and a 7% improvement compared to EDRN for both sets of blind testing data. Future research could focus on unified frameworks incorporating physical information, architecture modifications (e.g., activation functions), and advanced learning techniques like model compression and transfer learning to achieve high accuracy in capturing fine-resolution details at a low computational cost.