1 Introduction

Climate change causes increasingly unpredictable weather, threatening agriculture, transportation, and human safety. Meteorologists are constantly developing a method for accurate and timely weather predictions. Traditional numerical weather prediction (NWP) relies on solving a set of nonlinear equations for weather forecasts. However, NWP faces several challenges. It is highly sensitive to initial conditions, and even small differences can have a significant impact on the prediction results, leading to a decrease in accuracy as the prediction time increases. In addition, the computational cost of solving these equations rises dramatically as the size and complexity of the data increase, resulting in an increasing reliance on supercomputers for NWP [1].

Satellites collect atmospheric data for meteorology, but their high dimensionality makes analysis difficult. In recent years, machine/deep learning has been successful in almost every field, and is therefore being used to efficiently collect, extract, and analyze meaningful data over a wide, high-resolution area. Both traditional machine learning [2, 3] and deep learning-based models [4,5,6,7,8] can improve short- and long-term prediction accuracy over NWP.

Since 2010 s, many deep learning-based methods have been widely used for weather forecasting. Such methods can be applied to several specific areas, such as the prediction of climate change [9, 10], air quality [11, 12], and extreme weather conditions forecasting in the form of extreme temperature [13, 14], forest fires [15, 16], flooding [17], cloud-to-ground lightning [18, 19], and typhoon [20, 21], etc.

Big data tasks have missing values [22], which may cause loss of efficiency, complexity in data processing, and bias from inaccurate data [23]. Therefore, in addition to the structure of the model and the size of the datasets, data preprocessing is crucial to deep learning benefits since it helps the model to learn from the data and interpret it [24].

Due to technical limitations (e.g., sensor failures, etc.) or objective climatic conditions (e.g., cloud cover, atmospheric pollution, etc.) [25,26,27,28], satellite sensors are often unable to generate high spatial-temporal resolution images, resulting in missing values [29]. The absence of these values may lead to a reduction in data integrity and accuracy, makes it difficult to properly identify and analyze surface features, thus having a negative impact on many applications.

A proper interpolation strategy can improve the performance of a model when dealing with imperfect data with missing values, which appear as blank or white dots in the satellite image [23]. Deep learning is widely used for data interpolation since of its ability to efficiently capture complex spatial and temporal patterns. However, these models usually require a large amount of computational resources for training and inference, and the time-consuming and computational costs are significantly increased during the preprocessing process of dealing with large-scale satellite images.

Over the past decades, many traditional interpolating methods have been proposed. Masking datasets with missing values is the simplest method. Mean values or other dataset-specific values can also replace missing values. Maximum likelihood techniques sample probabilistic models for interpolation. These methods are increasing in popularity and do not require prior experience [30]. In addition, some statistical-based algorithms, Kriging, for example, established over fifty years ago, has been frequently used in climate data interpolation and has numerous variations [31,32,33,34].

This study evaluates missing value interpolation methods in data preprocessing for regression weather prediction in continuous datasets. Various interpolation strategies were investigated to improve multichannel weather prediction with a UNet-based deep learning algorithm in a specific region. Ten interpolation methods were analyzed. Model performance is evaluated using seven metrics: mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), percentage rate (PR), cosine similarity, Minkowski distances, and a new proposed measurement ’Shift’ that attempt to minimize the impact of slight movements that may occur after the prediction. Several improvements were made to traditional evaluation metrics to reduce the impact of slight location inaccuracy. To reduce time and avoid large calculations, this study mainly focuses on basic interpolation methods.

2 The problem, dataset, and the prediction framework

2.1 The problem

Many studies have been presented to manage missing values in data preprocessing [35, 36]. However, data mining and interpolation methods are still a challenge. Mismanaged methods can lead to unnecessary complexity and huge results distortions, resulting in a misleading conclusion [37]. The removal of missing values often biases the learning process and discards important data [29]. The elimination of useful data makes it difficult to draw conclusions [30]. In addition, some solutions, such as using a fuzzy similarity matrix to describe fuzzy relations, focus on discrete values and discretize continuous values before interpolation, which might cause the conversion process to lose valid characteristics [38]. Furthermore, some computing algorithms concentrated on missing conditional qualities rather than values [39].

As the size of datasets grows, information extraction and utilization become more challenging. Deep learning has powerful feature extraction and modeling capabilities, which is ideal for data processing. However, deep learning-based methods, as well as some machine learning-based algorithms (e.g., random forest), are inefficient and computationally intensive and due to the need to build equations or predictive models for each missing point. This makes information extraction more time-consuming and reduces its efficiency [40]. Therefore, basic interpolation methods may be more suitable than complex methods when dealing with large-scale data, as they can speed up the computational process and improve overall efficiency while maintaining accuracy.

Due to the importance of missing value computing in deep learning-based prediction model, it is worth investigating which interpolation strategy improves the model performance the best. This study explored a specific weather prediction challenge and intend to generalize its findings for further studies.

Additionally, when employing traditional evaluation methods, image displacement, which frequently occurs during data collection, might make prediction distortions appear worse. Therefore, this study applied the new proposed evaluation strategies to make comparisons more accurately to reduce the negative impacts of location inaccuracy.

2.2 The datasets

This study applied the IEEE Big Data Competition 2021 weather4cast data set [41] to investigate weather prediction. The dataset, collected in the Nile area from February 2019 to February 2020, includes four channels of messages at each time stamp.

- Cloud Top Temperature (CTT): Obtained from cloud surfaces. The cloud temperature for cloudy regions and the surface temperature for unclouded regions.

- Convective Rainfall Intensity (CRI): Convective rainfall accumulation capacity for hours.

- Probability of occurrence of tropopause folding (ASII-TF): Linked to upper-level frontogenesis and jet stream dynamics, and extreme weather.

- Cloud mask (CMA): 1 for the clouded region and 0 for the unclouded region.

Images are taken every 15 min throughout the year. Each 4 km by 4 km 256 \(\times \) 256 image represents one of these four channels. Longitude, latitude, altitude and “time lead” (time serial number) are also provided. Each day has 96 files (24 h x 4 times per hour) with 8 images: 4 channels and 4 additional features.

About 312 days and 29,866 valid data files were produced for each weather product (CTT/CRI/ASII-TF/CMA), each containing a 256 by 256 weather array. For several days and time, the data source did not provide full files. There are 27,597 sets of continuous 4-time data. 85% were for training and 15% for validation. A total of 63 sets were randomly and evenly selected for testing.

Each weather product has 1.96 billion valid values. Unprocessed CTT contains approximately 88.8 million missing values, approximately 2974.30/image, with a value range of 174 K to 343 K; CRI contains 219,136 missing values, approximately 7.34/image value range from 0 mm/h to 34 mm/h; ASII-TF contains 218,904 missing values, approximately 7.36/image value range from 0 to 100%; CMA contains 45,059 missing values, approximately 1.51/image, and has only 2 valid values: 0 for the unclouded region and 1 for the clouded region. After preprocessing, the values of each product would normalize to [0, 1] and compute missing values with computed values.

2.3 The prediction framework structure

Each input data were processed as 4-time, 4-channels in training. Each time, the [1, 4, 256, 256] array would be added with additional characteristics mentioned in Sect. II.B: The Datasets, resulting in a [1, 8, 256, 256] array. The final input data were a [4*8, 256, 256] array. It outputs a [1, 4, 256, 256] array of 4-channel one-time without features, such as next-time weather. Iterative one-step predictions were made for the [32, 256, 256] test datasets. The final output was a [n, 4, 256, 256] array of n next-times, 4 channels, and four chronological weather images. n is 6 in this study. The following sections provide further information. Figure 1 illustrates the structure of this work.

Fig. 1
figure 1

The structure of preprocessing, training, and evaluation. Every input data were a matrix [n, 8*4, 256, 256] (n is the batch size, 8 is the channel of weather attributes, 4 is the channel of time), and every output data are a matrix [x, 4, 256, 256] (x is the number of times after the input data that the model would predict, and 4 is the weather products)

A detailed analysis of the data shows three types of missing values [42, 43]: missing at random (MCR), missing completely at random (MCAR) and missing not at random (MNR)(MNAR). In this study, the missing values are MCAR. They were random and independent and were not affected by observed or missing data. An appropriate missing value computation method can complete datasets, enhance data analysis, and improve the performance of weather forecasts.

2.4 Interpolation methods

Efficiency is the main focus of this study. For this big data task, the number of missing values was high. For CTT, each image had 2972.50 missing pixels. In a relatively high-performance computer, some traditional interpolation methods (e.g., singular value decomposition) take more than 24 h to finish an epoch and 2 to 3 months to complete training. To simplify time-consuming processes, most methods are low-complexity and involve minimal computations.

Table 1 shows 14 strategies, using 10 algorithms with different parameters. Samples of traditional interpolation methods were chosen. Standard and successful machine learning methods were also used. In addition, this study proposed a computing strategy for gradually filling in missing values from outside to inside to interpolate missing points with as much relevant data as possible. Each method emphasizes different data features with various data distributions.

The methods were classified as follows: masking missing values (the default method provided on the data source), filling constant values, filling with basic statistics (like the mean, median, etc.), filling with common machine learning interpolation methods (like KNN), and the newly proposed method mentioned above.

This section will describe the interpolation methods used in this study. Strategies are abbreviated in parentheses for clarity. Table 1 shows a brief list of shortened interpolation methods. The following section is explained in detail below.

Table 1 10 Interpolation methods and 14 strategies

2.4.1 Default method: mask missing values

The usual way to hide the missing value was given by the data source (referred to as ‘Mask’). It is also the default method provided in the original study [41]. White noise spots show where missing values are on the topographic map of the source picture. The image file has two arrays that are the same size: a true information array that is used for training and testing, and a bool array that indicates if the point is missing. Based on the information in the second table, the missing points in the first table would be interpolated with a default number and ignored by the next stage procedure.

2.4.2 Constant value

The constant value (referred to as ‘Value X’) interpolates all missing values with a specific value. As the data distribution range after preprocessing was between 0 and 1, two X values were used: 0 as the minimum valid value and 1 as the maximum valid value. Those two values were treated as two baselines to approximate extreme data distributions. This study assumed such an interpolation may help to maintain the overall stability of the dataset, especially when missing data are considered absent or inapplicable.

2.4.3 Statistic value

Four values with basic and standard statistical methods were used for interpolation, which is the classical choice based on the overall data distribution [44,45,46]. Such simple methods usually provide efficient interpolation and are particularly useful when missing data are randomly distributed and the dataset is large. Mean value (referred to as ’Mean’), median value (referred to as ‘Median’), and mode value (referred to as ‘Mode’) of the valid values in the image, as well as an adjusted method: mean value of a limited region of N * N pixels (referred to as ‘Mean/N’).

The last method fills the missing point [x, y] with valid values between \([x-\frac{N}{2}, y- \frac{N}{2}]\) and \([x+\frac{N}{2}, y+\frac{N}{2}]\) (the given number N is even) or \([x- \frac{N - 1}{2}, y - \frac{N - 1}{2}]\) and \([x + \frac{N + 1}{2} / 2, y + \frac{N + 1}{2} / 2]\) (the given number N is odd). The distribution of details in the input data is not uniform. Some regions have a large smoothing range, while some contain more detailed features. Besides, the missing values are also unevenly distributed. Some areas have only one isolated missing point, while others have a large range of missing block.

Thus, 3 N (5, 10 and 15) were used. The 5 \(*\) 5 small square was used to capture small localized variations in images and provide more detailed information. Setting N value as 5 cannot ensure obtain enough contextual information but is less computational expensive and can capture local details. The 10 \(*\) 10 median square is used to find a balance between capturing localized features and reducing computational costs. The 15 \(*\) 15 large square is suitable for a wide range of smoothing areas to process of data with large missing areas. Set N value to 15 can treat large missing region and provide smoother interpolation results, but requires more computational costs.

Though they are very efficient and fast enough for big data tasks, they could result in large deviations when the spatial and time gaps are too large [47, 48]. Thus, this study also implemented the interpolation with mean value in small regions. Three regions (5 pixel * 5 pixel, 10 pixel * 10 pixel, 15 pixel * 15 pixel) were tried to find the optimal solution.

2.4.4 Classic interpolation algorithms

Linear, K-nearest neighbors (KNN), and inverse distance weighted (IDW) algorithms computed missing values and interpolated missing points with neighboring valid values. They are typical and have been proven effective for years, and they can complete the interpolation process rapidly to match the timely requirement of big data.

  • The linear relationship (referred to as ‘Linear’) replaces isolated missing points by averaging the four closest valid values in the top, bottom, right and left directions. Considering a missing point connected to other missing points in a certain direction, two valid values in this direction obtained in the first stage should be weighted before calculation, depending on the straight-line distance between the two detected valid points and the missing point. This interpolation method is simple, high efficiency, and easy to implement. It guarantees continuity of the interpolation results within the region, which is consistent with natural continuity in geographic data such as temperature or the quantity of rainfall [49]. Thus, linear interpolation has been widely used in recent years, especially in weather data [50,51,52,53].

  • KNN (referred to as ‘KNN/K’) is a machine learning algorithm for interpolation. It allows interpolation based on the similarity of the data and is particularly useful when spatial interrelationships between data points are critical to the interpolation results. This is a popular interpolation method and so on in weather prediction interpolation, since it is intuitive without the requirement of parameter assumption and flexible enough to adapt to local distribution of geographic data [54,55,56,57,58]. The computing of KNN begins with a reasonable K. Calculate all distances using Eq. (1) and choose each k-number of pixels closest to the missing location in the feature space. It requires a relatively considerable number of calculations to measure the distance between two points, but it is acceptable [59].

    $$\begin{aligned} d(x_i, x_j) = \sum _{l=1}^{n}\left( |x_{i}^{(l)}-x_{j}^{(l)}|^p\right) ^{\frac{1}{p}} \end{aligned}$$
    (1)

    For images with simpler textures, a smaller K value ensures that the clustering centers represent the main color or texture features in the image. And for images containing complex details, larger K values can capture richer and more detailed features. Since it is too time-consuming and computationally intensive to perform the interpolation test for all k-values separately, this study run silhouette coefficient on several sets of training data. The results showed that smaller values of K have higher silhouette coefficient (from 3 to 8). Thus, this study chose K value as 5 for interpolation. And in order to avoid prevent the risk of richer and more complex details being lost, K value as 15 was also chosen.

  • IDW (referred to as ‘IDW’) follows the Tobler’s first rule of geography [30]. It is based on an inverse weighted average of distances, with data points closer to the target point having higher weights, and can be computed using Equation (2). This calculation is in a way consistent with the reality of geospatial data, and it can use the relationship between geographic location effectively to predict the value of unknown points [40]. Those characteristics give the IDW flexibility to adapt to different data distribution and the local or global variations, which make the IDW a reliable source in the interpolation of temperature and other geographic data [60,61,62,63]. IDW-based adaptive approaches with different distance-decay relationships have been proposed [37, 64]. In certain instances, advanced methods can outperform constant parameters and universal kriging. However, due to the consideration of computational costs, only the traditional IDW method were utilized in this study.

    $$\begin{aligned} f(x, y)=\frac{\sum _{i=1}^{n}(\frac{1}{d^k})*Z_i}{\sum _{i=1}^{n}\frac{1}{d^k}} \end{aligned}$$
    (2)

2.4.5 A new proposed method: ‘step by step’

This paper proposed a cluster-efficient missing block computation method. This method (named ‘Step by Step’, referred to as ‘SbS’) calculates missing values by iteratively invading valid data. Each iteration computes only missing-value pixels surrounded by n valid values and interpolates using the average of its n neighbors. In the initial round, n is 8 and reduces by 1 in each round until it reaches 3, after which it remains unchanged until the interpolation is complete.

Isolated missing islands are calculated using the eight surrounding points. For scattered missing points that contain neighboring missing ones, this method estimates the outermost points surrounded by as many valid values as possible in each round and repeats until there is no missing value.

Missing points are sometimes formed as a block. The outermost points have enough information, but the innermost point has only missing neighbors. This approach interpolates from outside to inside to calculate each missing point using as much valid information as possible. The interpolation starts in the sixth cycle, when n reaches 3, and computes the farthest points, lowering the missing block’s size to X-2 * Y-2. Continue until the missing block disappears.

Table 2 3 Inaccuracy groups and evaluation metrics/strategies

2.5 Evaluation metrics

Interpolation distorts images and most prediction models generate noises, both distorting expected outcomes. The following measurements compared the quality of the ground-truth image with the expected image to evaluate the model’s performance with different interpolation strategies.

This study classified probable inaccuracies into intensity, structure, and location based on personal understanding. Seven measurements and their related tactics were employed independently and combined to obtain a clearer and more comprehensive conclusion.

Intensity inaccuracy evaluates pixel values in the corresponding images to analyze a precise prediction. They consider physical means, but not perceived visual quality. Structure inaccuracy affects the appearance of the image. They can analyze shape distortion and vector distances in parallel images. Furthermore, this study also suggested location inaccuracy to alleviate the negative influence of prospective relocation on the prediction process, which is often overstated in prior work when analyzing the prediction result’s precision and performance of the associated work.

This section describes the evaluation metrics used for this research. For clarity, the metrics’ abbreviated names are provided in parentheses. Table 2 lists all the evaluation metrics. The following section provides more information.

2.5.1 Intensity inaccuracy

MSE, PSNR, and PR were used for intensity inaccuracy. These measurements concentrate on the different values of two parallel points in the predicted image and in the true image.

Mean Score Error (referred to as ‘MSE’) [65]: MSE is the most widely used image quality measurement. MSE takes the average absolute distance between the truth data and the predictions. It calculates the absolute error by Eq. (3).

$$\begin{aligned} MSE=\frac{1}{n}\sum _{i=1}^{n}(y_i-\hat{y_i})^2 \end{aligned}$$
(3)

Peak Signal-to-Noise Ratio (referred to as ‘PSNR’) [65]: PSNR is the ratio of the maximum possible signal power to the signal-affecting noise power, calculated by Eq. 4. It approximates the human sense of reconstruction quality and is widely used to measure image quality [39]. The larger the result, the better the performance of the model.

$$\begin{aligned} PSNR(dB)=10*log_(\frac{MAX^2}{MSE})=20*log_(\frac{MAX^2}{\root 2 \of {MSE}}) \end{aligned}$$
(4)

Percentage rate (referred to as ‘PR’): PR = 100 % * (actual value) / | true value - predicted value |. In this study, ’PR: A%’ means that A% of predicted values have a percentage error <20% of their true values. 5 A values were used to evaluate the deviation data after the prediction.

Fig. 2
figure 2

Examples of small displacement might happened during prediction

2.5.2 Structure inaccuracy

SSIM, cosine similarity, and Minkowski distances analyzed structure inaccuracy. They are sensitive to geometric deformations [39], focus on the structure or vector retrieved from the image, and calculate the similarity or distance between parallel images.

The Structure Similarity Index Measure (referred to as ‘SSIM’): SSIM assesses structural similarity between two images based on perception [39]. It contains three elements: brightness, contrast, and structure, and is calculated by Eq. 5. Performance is improved when the SSIM is closer to 1.

$$\begin{aligned} SSIM(x, y)=\frac{(2\mu _x\mu _y+c_1)(2\sigma _{xy}+c_2)}{(\mu _x^2+\mu _y^2+c_1)(\sigma _x^2+\sigma _y^2+c_2)} \end{aligned}$$
(5)

Cosine Similarity (referred to as ‘Cosine’): Cosine similarity evaluates sequence similarity. It is the vector cosine of the angle between two sequences, calculated by Eq. 6. The similarity value only considers directions and ignores their length.

$$\begin{aligned} SC(A, B)=cos(\theta )=\frac{\overrightarrow{A}\cdot \overrightarrow{B}}{||\overrightarrow{A}||||\overrightarrow{B}||} \end{aligned}$$
(6)

Minkowski distances (referred to as ‘Distances’): The Minkowski distance, a normed vector space metric that generalizes Manhattan and Euclidean distances, is determined using equation 7. When p=1, the Manhattan distance is the Minkowski distance, and when p=2, the Euclidean distance is.

$$\begin{aligned} dist(x, y)=\left( \sum _{i=1}^n{|x_i-y_i|}^p\right) ^{\frac{1}{p}} \end{aligned}$$
(7)

2.5.3 Location inaccuracy

Fig. 3
figure 3

The new set of images

The displacement in Fig. 2 is small. In such a scenario, the expected distortion cannot be shown by intensity or structure in such a case. This work expanded the predicted image to a set as shown in Fig. 3, including the original image and eight new images with little displacement from [-1, -1] to [1, 1], compared each image in the new set with the true image, adopted the best performance score, and recorded the direction of its displacement.

Location inaccuracy reduces a small displacement effect. In Fig. 2, the intensity inaccuracy is high, but the shape inaccuracy does not disrupt the pattern. These displacements can be diagnosed with location error, making the analysis easier.

With the idea of location inaccuracy, two strategies were used: results after shift (referred to as ‘Shift MSE/PSNR/SSIM’) and improvement after ‘Shift’ (referred to as ‘Improved MSE/PSNR/SSIM’), calculated using shifted MSE / PSBR / SSIM - MSE / PSNR / SSIM.

2.5.4 Extra process

MSE, PSNR, and SSIM have always been analyzed using extra strategies to enrich their meaning and evaluate more accurately. This study examined MSE, PSNR, and SSIM using two additional processes, as described below.

Weight (referred to as ‘Weight’): Not all four weather products are equal. The CTT product, which represents the local temperature, is the most important. Each channel should also prioritize the extremely global condition. These datasets commonly had CTT with 1e-3 (0.1 mm/h) or lower precipitation, which should be considered valid instead of discarded. In this work, scenarios with different data distributions were weighted to reflect their relative relevance, eliminate irrelevant interference elements, and provide accurate and useful weather predictions. Figure 4 shows the weights of the products under various conditions. The ’Weight’ operation is applied to MSE, PRNR, and SSIM, and the ’Weight MSE’ is recorded.

Fig. 4
figure 4

Weigh set for different channels

Mean / N (referred to as “Mean”): The Mean was used to fully assess each part of the image by removing distant influences and focusing on locally extreme situations. After weighting, ’Mean’ splits each image into N pixels * N pixels by N. This study used N, the pixel number of each image’s length and width, as 16. ’Mean Weight MSE / PSNR / SSIM’ values were recorded.

3 Result

This section will describe the implementation details, evaluation and comparison principle, and the experiment results.

3.1 Implementation

The models were trained on a UNet-18 model provided as a baseline in the big data competition [41], using the PyTorch lightning framework. Some modifications were made to increase model performance. If not specified, hyperparameters are listed below.

The combination of UNet with DenseNet outperforms the prior UNet design [66], therefore each convolution layer block was replaced with a micro-DenseNet block to train the model more deeply and accurately [67]. This change utilized DenseNet’s dense connectivity to reduce information loss and promote feature reuse, allowing the model to learn complex features more efficiently.

In addition, the baseline model utilized ReLU as the activation function, in where the positive output is held constant and the negative output is set to zero as \(f(x) = max(0, x)\). In this study, Mish is utilized as an alternative to ReLU, as it can improve model performance through smooth nonlinear transformations and better gradient flow. The Mish is defined by Eq. (8) [68].

$$\begin{aligned} f(x) = x * a (softplus (x)) \end{aligned}$$
(8)

The training batch size was 64, and the worker number was 8 due to hardware limitation. Since weight decay improves generalizability, AdamW optimizer [69] with weight decay of 6e−2 was employed to improve calculation efficiency. The learning rate was dynamically decreased from 1e-3 to 2e-4 in stages to accelerate training and eliminate oscillation and converge to a local minimum [70]. The final learning rate was set to be cyclical, ranging from 1.5e\(-\)4 to 2.5e\(-\)4, to speed up training and improve precision [71]. For best performance, all hyperparameters and strategies were extensively optimized.

Ten interpolation methods and seven evaluation metrics were applied independently. The results are presented in the following section. Each channel’s results were also evaluated using MSE, PSNR, and SSIM to directly analyze each weather product’s prediction and better understand how the data distribution affects the various interpolation strategies, thus improving the model’s performance.

3.2 Comparison principle

This study combined all evaluation results with varying result values to provide a more precise view and compare overall performance more fairly. Prediction results for each evaluation metric were scored and ranked using Table 3’s formulas.

Table 3 The calculated formula of scores for each evaluation metric

3.3 Overall performance

Table 4 shows the performance of the total result, including the evaluated results in the 7 metrics, sorted by performance score calculation from Table 3. The ability of interpolation strategies to increase model performance was assessed using various metrics. The total inaccuracy score is calculated by intensity incorrectness (referred to as ‘II’), structure inaccuracy (referred to as ’SI’), and location inaccuracy (referred to as ’SI’). They are calculated using Equs. (9) to (12). The principle of calculating the score for each metrics with different tactics is provided in Table 4.

Table 4 The calculation of the score of each metric groups
Table 5 The overall performance, the top 3 strategies, and the default method are written in bold

Table 5 provides more information recorded in this study.

$$\begin{aligned}&Total Accuracy Score = 0.5*II Score + 0.3*SI Score \nonumber \\&\qquad \qquad \qquad \qquad \quad \qquad \qquad +0.2*LI Score \end{aligned}$$
(9)
$$\begin{aligned}&II Score=0.4*MSE Score+0.4*{PSNR} Score \nonumber \\&\qquad \qquad \qquad +0.2*PR Score \end{aligned}$$
(10)
$$\begin{aligned}&SI Score=0.6*SSIM Score+0.4*Cosine Score \end{aligned}$$
(11)
$$\begin{aligned}&LI Score=0.8*Shifted Score +0.2*Improved Score \end{aligned}$$
(12)

10 of the 14 interpolation methods outperform ’Mask’. ’Linear’ works best. However, ’Linear’ takes 1.5h to complete 1 epoch in the training procedure, 5 times longer than ’Mean / 5’ and ’Mean / 10’, which are not far behind to ’Linear.’ When the time is considered, interpolation using such a constrained region mean value would be the best choice. Despite outperforming some common method, ’SbS’ did not perform as intended.

Besides, the strategy’s computational complexity has some impact on the efficiency of interpolation, but does not necessarily improve model performance. Complex algorithms like ’IDW’ improve performance. While ’KNN’, a complex procedure involving numerous calculations, doesn’t work as expected.

Depending on parameters, the same interpolation methods can perform differently. ’KNN /15’ outperformed ’KNN/5’. The ’Mean / 10’, ’Mean / 10’, ’Mean / 10’, and ’Mean’(can be seen as ’Mean / 256’) also decreased in performance. Given the same method, the number of neighbor points used to forecast the missing point’s value has a limit, and exceeding or lower than it reduces performance. And the limitations vary by method.

3.4 Performance of 4 separate channels

Four weather products have unique meanings and data distributions. To further analyzed and gain a more complete understanding, this study examined MSE, mean weight MSE, PSNR, and SSIM for each product. Table 6 shows the performance of all weather products (referred to as ’Weather Products’ or "WPs") and each product (referred to as ’CTT’ / ’CRI’ / ’ASII’ / ’CMA’). They are calculated by Eqs. (13) to (14).

$$\begin{aligned}&Weather Products Score\nonumber \\&\quad =0.3*CTT Score +0.2*CRI Score +0.3*ASII Score\nonumber \\&\qquad +0.2*CMA Score \end{aligned}$$
(13)
$$\begin{aligned}&Each Product Score\nonumber \\&\quad =0.2*MSE Score +0.2*Mean Wright MSE Score \nonumber \\&\qquad +0.3*SSIM Score + 0.3*PSNR Score \end{aligned}$$
(14)
Table 6 The overall performance, the top 3 strategies and the default method are written in bold
Fig. 5
figure 5

Performance of models in different evaluation measurements

Considering the four products, the weather product ranking list in Table 6 was close to the overall ranking list in Table 5, but product scores are often lower. As in 4.3, ’Linear’ is the best. And IDW performed well for both weather products.

Interpolation methods affect different conditions differently. Data distributions can affect an interpolation method’s performance. The best interpolation strategy cannot optimize all four weather products.

Surprisingly, ’Value 0’ did well on ’CRI’ and ’CMA’. Since most values in the ’CRI’ and ’CMA’ products were close to 0, they probably prefer the method that computes small values. The prediction result always tends to be higher than truth value, and ’0’ countered this trend and considerably improved product performance. For the same reason, methods that tends to predict greater value were less effective.

3.5 Scores of each metric

And for each evaluation metric, this paper compared the trained model’s prediction to the ground truth value. The outcome and related information is shown in Fig. 5 and analyzed below. To simplify the results, only the top 3 strategies and the default method are shown in figure. Table 6 recorded the best results in all tactics.

Table 7 Best scores (Rank 1 Strategy) with each metric

No interpolation strategy works best for all measurements. However, most good-performing strategies scored well on all evaluation metrics and weather products, while the low-performing strategies always performed poorly.

Most methods worked well on ’SSIM’, and none of the methods performed well on ’PSNR’ (from 18.63 to 23.31) or ’PR’ (less than 5%, 10%), indicating that interpolation cannot significantly improve model’s precise accuracy. ’PSNR’ and ’SSIM’ are decreased after the mean process (mentioned in Sect. II.E.4: Extra Process). Surprisingly, ’Mask’ outperforms most strategies for ’Cosine’.

The new proposed method ‘SbS’ performed best in ‘MSE’, improving the 9.91% model’s performance above ’Mask’.

Unfortunately, the interpolation methods had little effect on the location inaccuracy. For the new proposed evaluation metric ’Shift’, ’Mean / 15’ and ’IDW’ performed slightly better. Though each model’s prediction results improved just little in ’MSE’, ’PSNR’, and ’SSIM’ after ’Shift’, they did not differ significantly from the unshifted method. Generally, the little change after ’Shift’ cannot reflect its capacity to increase model performance as expected.

4 Discussions and conclusions

The study examines data interpolation strategies for a deep learning-based weather prediction task. Some conclusions can be taken from comparing the weather prediction model’s performance with different interpolation strategies.

Most strategies outperformed the default ’Mask’. But no strategy was universal superior. ’Linear’ scored highest for most metrics in both cases. However, it takes almost 1.5 h to finish an epoch, considerably times longer than other viable strategies. ’Mean / 10’ and ’Mean / 5’ performed well overall and may be better for bigdata workloads. They are practical and efficient.

Moreover, some simple interpolation methods can outperform complicated ones. Furthermore, hyperparameters like K in KNN can affect method performance.

For the new interpolation methods proposed in this study (mentioned in Sect. II.D.5: A New Proposed Method ’Step by step’), it shows a considerable result when evaluated with both metrics and has the highest score in ’MSE’. On the other hand, the proposed evaluation metric ’Shift’ (mentioned in Sect. II.E.3: ’Location Inaccuracy’) has only a slight influence.

For future experimentation, this study plans to test more interpolation methods, some of which are time-consuming. Additionally, parameter methods can test with more choices, to see if ’KNN / K’ can improve performance with a proper K value. Thirdly, weather data segmentation and data transformation can affect weather predictions can be explored. Eventually, to make the outcome more widely relevant and valid for more datasets and models, more tasks can be done.

This study was expected to be supplemented with more computing strategies or evaluation measures to produce a more comprehensive and universal investigation. In addition, if luck holds, it is hoped that the conclusion will serve as a basic overview and continuation of the interpolation method’s discoveries, provide as much useful information as possible, and offer potential suggestions in various fields. It expects it to be valuable for future weather prediction or data preprocessing projects, especially for meteorological datasets, as well as other machine learning-based tasks applied to other models and datasets in investigations and applications.