1 Introduction

Climate change is a major environmental threat affecting many countries globally. Food production, water availability, forest biodiversity, and livelihoods highly relate to it. According to projections made by the Intergovernmental Panel on Climate Change (IPCC), the effects of global warming on human society and the environment would vary throughout time and space [1, 2]. When considering the effects of climate change on Earth and its atmosphere, the temperature parameter is often considered to be the most important of all meteorological variables [1]. Over the last 100 years, the average air temperature near Earth's surface has risen by little under 1 °C. Global warming alters the climate of the planet and raises average temperatures all around the globe [2]. The Asian winter monsoon has diminished due to a decrease in snow cover at mid-to-high latitudes, which has raised temperatures along the East Asian coast. Asia would likely experience significantly rising trends in mean surface temperature (0.25 to 0.34 °C per decade under RCP4.5 and 0.42 to 0.6 °C per decade under RCP8.5) [2]. The World Meteorological Organization (WMO) and the United Nations Environment Program (UNEP) developed the Intergovernmental Panel on Climate Change (IPCC) to study the causes of climate change and global warming because it is widely believed that human activity is the main contributor (UNFCCC, 2005) [3]. The extraction of greenhouse gases from air conditioners, refrigerators, and other appliances, as well as the increase in atmospheric CO2, combustion of fossil fuels, deforestation, and other factors, all contribute to global warming. Bangladesh was ranked second among Asian nations and sixth overall in the Global Risk Index 2011 for countries that are most susceptible to natural disasters due to climate change. Bangladesh contributes only 0.3% of the emissions that cause global warming due to its low energy use. But Bangladesh is one of the worst-affected countries by the effects of global warming due to its geographic characteristics. The distinctive features that set Bangladesh’s climate apart from that of other tropical regions are its high temperatures, abundant rainfall, and seasonal change [4]. Bangladesh has seen an average summer temperature of 27.5 °C over the past 30 years, which is somewhat higher than the summer average [2]. Due to the extreme poverty in this country, the problems brought on by climate change are exacerbated (ICDDR B, 2019). Action Aid’s study report identified Bangladesh as the sixth most susceptible nation to famine, hurricanes, and floods [3]. Forecasting air temperatures aids meteorologists in determining the possibility in any region of the country. Air temperature is also considered an important element in evapotranspiration, which is essential for managing water supplies and agricultural operations. Many decision-making industries, including energy, transportation, and tourism, rely on accurate air temperature forecasting. Therefore, the most important component of environmental research involving functional eco-environmental systems is precisely estimating air temperature [1]. Therefore, numerous scientists and researchers over the world are attempting several investigations and creating sophisticated mathematical models to anticipate the air temperature.

Numerical-based and machine learning (ML)-based techniques are the two primary categories of weather forecasting methods used today. Numerical-based weather prediction (NWP) models include erroneous assumptions, unclear physical parameterization, and physical correlations of parameters and mechanisms of atmospheric dynamics. The model output may need to be post-processed to improve the models’ effectiveness in practical applications. It raises the cost of calculation due to complicated mathematical formulas. However, ML-based techniques have gained popularity recently due to their lower processing costs and insensitivity to the multicollinearity of the input variables [5]. Hanoon et al. proposed different machine learning algorithms including Gradient Boosting Tree (GBT), Random forest (RF), Linear regression (LR), multi-layered perceptron neural network (MLP-NN), and radial basis function neural network (RBF-NN) for the prediction of air temperature [6]. The findings indicate that the MLP-NN exhibits commendable performance in forecasting daily temperature. Azamathulla presents artificial neural networks (ANNs) and gene expression programming (GEP) to predict the monthly atmospheric temperature in Tabuk, Saudi Arabia [7]. In previous research, individual ML models have typically been used to verify the predictions and demonstrate their superiority. A single forecast model is challenging to adapt to various weather parameters, even if it can increase forecast accuracy by modifying parameters and selecting features during the forecasting process. Numerous studies have demonstrated that developing ensemble and hybrid models by integrating multiple single forecast models can efficiently harness the benefits of various models and increase the precision and reliability of weather forecasts [8]. The daily temperatures in five cities throughout Belgium were predicted using a 2-layer spatiotemporal stacked LSTM model presented by Karevan et al. [9]. The results show that the spatiotemporal stacked LSTM outperformed stacked LSTM. Roy employs three deep neural networks: Multi-Layer Perceptron (MLP), Long Short Term Memory Network (LSTM), and a hybrid of Convolutional Neural Network (CNN) with LSTM [10]. Out of these models, the CNN + LSTM combination showcases the top performance, with LSTM closely trailing behind. Lee et al. utilized MLP, RNN, and CNN models to predict daily average, minimum, and maximum temperatures. They incorporated input features at a frequency greater than what had been employed in prior studies [11]. Notably, CNN, primarily utilized for processing satellite images rather than numeric weather data in temperature forecasting, surpasses the other models in performance. Mohammadi et al. developed some novel hybrid models combining autoregressive (AR), multi-layered perceptron (MLP), and autoregressive conditional heteroscedasticity (ARCH) to estimate minimum, maximum, and mean air temperatures in Northwestern Iran for both daily and monthly time scales. The research concludes that the hybrid MLP-AR models demonstrated the highest performance out of all the models tested [12]. Zhou employed a hybrid model [i.e., an artificial neural network hybridized with the powerful hetaeristic Honey Badger Algorithm (HBA-ANN)] for forecasting monthly temperatures in the hottest and coldest regions of the world [13]. Nketiah et al. employed RNNs to construct temperature forecast models for five Chinese cities, employing five distinct model configurations. They also implemented the Ridge Regularizer (L2) during the neural network training process to prevent both overfitting and underfitting. In addition, hyperparameters were fine-tuned using the Bayesian optimization method [14].

While some studies have successfully applied ensemble and hybrid models for temperature forecasting, there may still be untapped potential in exploring different combinations of models or ensemble techniques. Further research could focus on identifying more effective ways to integrate and leverage the strengths of different models. The studies mentioned focus on specific regions, such as Belgium and Iran, and specific global extremes. There is a research gap in understanding how these models perform in a wider range of geographical contexts, including regions with different climate characteristics or extreme weather conditions. While individual studies have proposed specific lag-based schemes, there is a gap in comprehensive comparisons across various statistical schemes based on input lags. Such a comparative analysis can provide insights into the relative performance of different approaches. Additionally, previous works have not explored different statistical tests or trend analyses to identify the most appropriate approach for a given context, which may potentially lead to unreliable results.

The majority of Bangladesh is unaffected by initiatives connected to climate change, and there has been relatively limited studies based on daily temperature forecasts undertaken in this nation. The ability to effectively train policymakers and employees for mitigation and adaptation actions depends on their understanding of the nature and scope of potential climatic changes in south-eastern Bangladesh. The study uses Cox’s Bazar, Bangladesh, a tropical climate case study, to forecast air temperature. The research area is distinct and located in the popular tourism eastern coastal region of the Bay of Bengal. The study applies an expansion of the well-known stacking model to complete the forecasting task. The model has been used to analyze a 20-year weather dataset that Bangladesh Meteorological Department (BMD) collected. In the current study, we proposed a perceptible stacking methodology to execute a hybrid scheme that combines the models—LR, Ridge, SVR, RF, and LGBM. They both have complimentary benefits and drawbacks which can be utilized in the stacking ensemble approach. The method chose a meta-learner and base-learners from 12 candidate models to create the stacking model's structure. By contrasting the stacking model with the individual models, the improvement in the performance of temperature forecast is demonstrated. We also compare three types of statistical schemes regarding historical time-series values of lagged days in the feature engineering stage. Also, statistical tests and trend analysis performed in this work ultimately enhanced the quality and reliability of our research findings.

2 Materials and methods

2.1 Methodology

A well-planned approach is essential for doing the investigation systematically. The elaborate framework used to conduct this study is shown in detail in Fig. 1. The methodology is mainly divided into two phases: Phase I: Preparing the data, and Phase II: Training the model.

Fig. 1
figure 1

Flowchart of the proposed stacking ensemble learning-based model for temperature forecasting

2.1.1 Phase I: preparing the data

The phase contains gathering observed data (i.e., raw data collected form BMD), data formatting, data preprocessing, and train-test splitting. Initially, the observed weather data of 20 years (from January 1, 2001, to December 31, 2021) were obtained from BMD. Once the extraneous data had been removed, the data had been rearranged, and descriptive statistics had been calculated. In the data preprocessing stage, there are several steps i.e., missing value imputation, outlier handling, feature engineering and data normalization. After preprocessing, the dataset was divided into two sets: (i) Training set (80%) and (ii) Testing set (20%).

2.1.2 Phase II: training the model

The phase includes the stacking ensemble model set-up and model training along with out-of-fold cross-validation. The model is constructed with level-0 base-learners and level-1 meta-learner. Level-0 base learners were chosen form 11 candidate ML models based on a performance index.

Lastly, the performance of the predicted values of each model was evaluated. The forecasting results were compared with the test dataset of the target variable in terms of mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R2). We also performed statistical tests to distinguish the significance of the performance.

2.2 Background of machine learning models

2.2.1 Stacking ensemble learning model

An ensemble approach for ML, stacked generalization was first presented by Wolpert [15]. The stacking model allows a variety of efficient models to carry out classification or regression tasks and get predictions that outperform all individual models in the ensemble. Two or more level-0 models constitute the framework of a stacking model, jointly with a level-1 model, which incorporates the predictions of the base models. Level-0 Model (Base-learner) predictions are fitted to the training set of data. The level-1 Model (Meta-learner) gains knowledge about the best methods for incorporating the forecasts of the base models. The base-model predictions derived using data from out-of-sample are used to train the meta-learner [16].

2.2.2 Candidate machine learning models

Choosing base-learning and meta-learning combinations is a primary concern while designing stacking ensemble architecture. Stacking is suitable when several ML models have different learning skills and make distinct assumptions on the predictive modeling performance. The 12 candidate models are LR, lasso, ridge, ElasticNet, decision tree (DT), bagging regression (BR), RF, adaptive boosting (AdaBoost), LGBM, extreme gradient boosting (XGBoost), SVR, and k-nearest neighbor (KNN). Among them, all models are non-linear except LR. The meta-model is frequently straightforward, allowing for an easy interpretation of the basic model predictions. As a result, the meta-model is often a basic linear model. The base-models are selected from the remaining 11 models. Table 1 provides the characteristics of the five models selected for generating a stacking model.

Table 1 A brief overview of the key aspects of five different regression models

2.2.3 Repeated K-fold cross-validation

Common cross-validation technique uses K-fold cross-validation (K-fold CV) for evaluating learner performance. K is any integer number. The training set's samples are divided randomly into K folds (subsets), and this procedure is repeated K times. The Kth fold of the dataset serves as the test set for all iterations, and the remaining K-1 folds serve as the training set. The process repeats until each of the K folds had served as the test set [17].

The noisy performance estimate provided by K-fold CV can make choosing a final model crucial to address the problem. An alternate approach is to execute the K-fold CV technique repeatedly and then display the mean performance of overall folds and repeats. Repeated k-fold CV is the name of this technique.

2.2.4 Grid-search cross-validation

The grid search cross-validation (GSCV) is an exhaustive search method that combines parametric search with model assessment indexes and CV techniques. The model is trained using a variety of possible parameter combinations before the grid search ultimately selects the one that yields the best results in terms of error or accuracy.

2.3 Performance metrics

During the testing phase, many metrics can be utilized to compare the performance of the models under consideration. The four performance indicators utilized for evaluation in this study are mean square error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R2), shown in Table 2. In the following formula, n represents the number of observations, \({\widehat{\mathrm{y}}}_{\mathrm{i}}\) represents the predicted value, and \({\mathrm{y}}_{\mathrm{i}}\) represents the actual value.

Table 2 Evaluation metric rules

Computation time: The running time of the used model is contrasted with the running time of comparison models in many articles to aid in the trade-off between the model's accuracy and computation complexity. Execution time, for instance, becomes crucial for training the model when training on continuously changing training data, such as weather conditions [20].

Statistical testing: The statistical tests were carried out in several articles to demonstrate the significance of the findings. Longer horizons and occasionally more complex models are two factors that contribute to the running time growing as the forecast horizon is increased [20]. In this research, a commonly employed nonparametric test, the Friedman test, was performed, which assesses the average rankings of benchmarking models for comparison. The Friedman test was followed by a Nemenyi post hoc analysis, which aids in identifying the most effective model for this particular dataset, ensuring accurate predictions in real-time applications of these models [5].

Trend Analysis: In weather forecasting, trend analysis plays a pivotal role in understanding long-term climate patterns and making informed predictions. In this study, the Mann–Kendall (MK) test was conducted, which is a powerful non-parametric statistical method used to detect trends or monotonic changes in time series data [21].

3 Experimental analysis

Common ML libraries, like scikit-learn and Keras, and additional libraries, such seaborn and matplotlib, were utilized in the simulation testbed with an 11th Gen Intel(R) Core (TM) i7-1165G7 @ 2.80 GHz, 1.69 GHz with 8 GB RAM, 64-bit software, and an × 64-based CPU.

3.1 Study area and data acquisition

For this study, daily weather data over the period between 1 January 2001 and 31 December 2021 of the Cox’s Bazar weather station situated in the South-East part of Bangladesh was obtained from the BMD. The municipality covers 6.85 km2 and borders the Chittagong District, the Bay of Bengal, the Bandarban District, and the Bay of Bengal [19]. Cox’s Bazar has a tropical monsoon climate with an elevation of 0 feet above sea level. The average annual temperature of the district was − 0.71% lower than that of Bangladesh (27.03 °C). Cox’s Bazar experienced 158.42 wet days (43.4% of the time) and received about 140.97 mm (5.55 inches) of precipitation yearly. Figure 2 displays the position of the Cox’s Bazar weather station, while Table 3 provides details regarding the geometric characteristics of the site. The data was provided in Excel files on a daily time scale, which contained date, temperature, maximum temperature, minimum temperature, humidity, wind speed, pressure, and rainfall.

Fig. 2
figure 2

Research the surrounding region and pinpoint the precise position of the Cox’s Bazar weather station

Table 3 Geometric details of the selected site

3.2 Data handling and pre-processing

3.2.1 Formatting

The unnecessary columns, rows, and terms were deleted from the provided files (For example, Station name, and Station ID) and converted into CSV format for further use. The dataset contains 7670 rows and 7 columns, as shown in Table 4, which represents the first five records of the data obtained from January 1, 2001, to December 31, 2021. Here date, temperature, maximum temperature, lowest temperature, humidity, wind speed, pressure, and rainfall are the primary characteristics. The average, high, and low temperatures for the next day are the dependent variables of interest.

Table 4 The first five records in the dataset before data preprocessing

3.2.2 Checking

Table 5 shows the descriptive statistics having mean, minimum, maximum, and standard deviation. Each variable, except the wind speed, contained 7670 records (The wind speed variable had one null value). The mean for daily average temperature was 26.16 °C, the daily maximum temperature was 31.13 °C, and the daily minimum temperature was 22.16 °C. The minimum and maximum values of the average temperature were 15.50 °C and 32 °C. The measured maximum temperature values ranged from 19.70 to 39.50 °C, while the measured values of the minimum temperature ranged from 10.30 to 30 °C. According to the standard deviation, it can be seen that the distribution of total rainfall was 27.11 mm, which was the most dispersed from the mean, and the lowest one was of wind speed of 2.28 m/s.

Table 5 Descriptive Statistics of daily weather dataset (based on minimum, mean, maximum, SD, 25th, 50th, and 75th percentiles)

3.2.3 Pre-processing

Each dataset needs a pre-processing technique that ML algorithms demand. Prior to modeling, it is necessary to handle null values and outliers. Moreover, feature engineering is required to build and train better features in order to achieve effective ML.

3.2.3.1 Missing value handling

The data set was considered complete when the mean wind speed was used to replace the missing value.

3.2.3.2 Outliers handling

The winsorization method, also known as the inter quartile range (IQR) method, was used to replace the outliers with the 25th and 75th percentiles of data for these variables [22]. According to this procedure, the following equation is given below.

$${\text{IQR}} = {\text{Q}}_{3} - {\text{Q}}_{1}$$
(1)

where \({\mathrm{Q}}_{3}\) and \({\mathrm{Q}}_{1}\) represent the first quartile and third quartile. A limit for the minimum and maximum outlier values are set to cap the outliers. According to the IQR method, the normal data is defined within a range (lower limit as \({\mathrm{Q}}_{1}-1.5\times \mathrm{IQR}\) and upper limit as \({\mathrm{Q}}_{3}+1.5\times \mathrm{IQR}\)).

3.2.3.3 Feature engineering

The act of choosing, modifying, and converting unprocessed data into features that can be applied in supervised learning is known as feature engineering. We separately analyzed two complementary effects related to the feature engineering techniques: (i) The impact of the correlation-based feature selection strategy, and (ii) The inclusion of historical series values acquired in several past days.

The impact of the correlation-based feature selection strategy: Date-related features were created from the date column to extract useful information. It results the modified data with the following additional columns: month, ‘day_of_month’, day_of_year, week_of_year, day_of_week, year, is_month_start, is_month_end. The data already contained other attributes provided by the weather station. The process of feature selection involves lowering the number of input variables. Adding more features will make the training process difficult; however it may not always result in more accuracy [23]. The correlation heatmap was generated using the Python matplotlib and seaborn packages, which apply (1) to determine the correlation coefficient (r) between the input features [17].

$$r = \frac{{\sum (x_{i} - x^{ - } ) \left( {y_{i} - y^{ - } } \right)}}{{\sqrt {\sum (x_{i} - x^{ - } ) ^{2} } \sum (y_{i} - y^{ - } )^{2} }}$$
(2)

where \(\mathrm{r}\) represents the correlation coefficient, \({\mathrm{x}}_{\mathrm{i}}\) represents the values of x-variable in a sample, \({\mathrm{x}}^{-}\) represents the mean values of \(\mathrm{x}\)-variable, \({\mathrm{y}}_{\mathrm{i}}\) represents the values of y-variable in a sample, and \({\mathrm{y}}^{-}\) represents the mean values of \(\mathrm{y}\)-variable. The correlation between every two variables is presented through the Pearson’s Correlation Coefficient (PCC) heatmap in Fig. 3. The connection can be anywhere from − 1 to 1. + 1 means that there is a positive correlation, − 1 means that there is a negative correlation, and 0 means that there is no correlation. In Fig. 4 green denotes a positive, whereas red denotes a negative. The correlation magnitude increases as the color intensity does. It has been seen that the temperature has a high positive relationship with the minimum temperature. However, each of the three temperatures has a high negative correlation with the pressure. SelectKBest, a filter-based feature selection method provided by the sci-kit-learn library, has been used. It works based on the PCC between the two input variables, which can help filter out the most relevant features. A subset of the 10 most correlated features were selected and three features i.e., day_of_month, is_month_start, and is_month_end were discarded as they have the least correlation with the target variables.

Fig. 3
figure 3

Correlation among the different features visualized as a heatmap

Fig. 4
figure 4

Temperature, maximum temperature, and lowest temperature, as well as their associated ACF and PACF curves with regard to 60 delays

The inclusion of historical series values acquired in several past days: the influence of the past value on the current value for any given observation (i.e., a time lag) was estimated using the time-lagged data from the temperature time series, as it might be advantageous to consider the significant correlated lags [23]. In Fig. 4, autocorrelation function (ACF) and partial autocorrelation function (PACF) curves with 60 lags were plotted in order to choose the appropriate input lags. Seasonality makes it is difficult to choose the best lags for the daily scale when using ACF [1]. PACF shows a high correlation with the first lag and a low correlation with the second and third lag. Therefore, compared to the ACF, the PACF is more appropriate for choosing the input lags for forecasting the small scale of time series [1]. The PACF curve shows that, the lags, out of the upper and lower bounds, can be ignored as they have lower correlations, which may result in inaccurate predictions. The time lags beginning from 1 day earlier (t−1) to 11 days earlier (t−11) are more prominent and above a 95% confidence level for each target variable. Some feature engineering methods extend beyond simply adding raw lagged values by computing some statistical values based on past values. Among them, rolling window and exponentially weighted moving average (EWMA) schemes are generally effective for improving the performance of time series forecasting models; but using appropriate method is important to avoid overfitting.

Rolling mean features involve calculating the mean of the time series data over a specified window size and using the mean values as input features for the forecasting model. The number of lagged values determines the window size. Exponentially weighted mean features are similar to rolling mean features; but instead of using a fixed window size to calculate the mean, they use an exponentially decaying weighting scheme. The decay factor in the weighting scheme can be determined to control the importance given to recent observations. Table 6 presents the selected features applied in this study.

Table 6 Selected features based on feature engineering

3.2.4 Normalization

Data normalization is frequently employed in ML techniques to lessen the impact of the variety of data. In this research, for data normalization, robust scaling is used [24]. It uses median and interquartile range (IQR) to scale input values. Robust scaling is resistant to the negative effects of outliers. The formula is as follows:

$${\text{Scaled Value}} = { }\frac{{{\text{Original Value}} - {\text{Input Median}}}}{{\text{Input IQR}}}$$
(3)

4 Model set-up

4.1 Data partition

A training set and a testing set were created by dividing the dataset in advance of the training process. Eighty percent (80%) of the dataset was used for training, while the other twenty percent (20%) was used for testing. The model can be constructed and fitted to the available data with the aid of the training set. Estimating the model's efficacy on new data (data not used to train the model) was made easier with the help of the testing set.

4.2 Stacking model construction

Each of the possible models adheres to a slightly different set of rules for fitting data sets, and each has its own set of benefits when working with certain classes of variables.

4.2.1 Model selection

In this study the linear regression (LR) was chosen as the meta-learner. Since linear regression does not need a second hyper-parameter to be adjusted and has weights that can be interpreted to indicate base-learner significance, it is frequently used in regression for the meta-learner. We employed 11 potential models with the default hyper-parameters to compare their performance on the training dataset using repeated tenfold CV with several iterations of 3. The negative mean squared error (NMSE) is considered as the selection score, given in Table 7. The base-learners for the stacking-model were picked from the four models that did the best. Table 7 shows that LGBM had the lowest average CV error, with an MSE of 0.912, 0.895, and 0.897, respectively. The other three best-performed models were Ridge, SVR, and RF. Therefore, LGBM, Ridge, SVR, and RF were selected as base-learners.

Table 7 MSE (°C) of the 11 candidate models obtained by repeated K-fold CV (selected models are bolded)

The selected models can characterize the base-learners that should possess diversity. The radial basis function (RBF) kernel was proved efficient in prior meteorological investigations of SVR application [16]. For the maximum learning problems, RF has approximately the same error rate as the other methods and is less prone to overfitting [8]. LGBM offers effective parallel training with the advantages of a quick training speed, low memory consumption, and the capacity to process massive volumes of data, partially addressing the drawbacks of conventional models [15]. Ridge performs well when a data set has multicollinearity (correlations between predictor variables).

4.2.2 Hyper-parameter optimization

Since the performance of the base-learners is affected by a number of hyper-parameters, the grid search CV was used to find the optimal combination of hyper-parameters within the allowed range based on the results of prior studies [25]. It used negative mean squared error (NMSE) and tenfold CV to rank the effectiveness of the many permutations of the model's hyper-parameters that were provided. Then, the optimized models were applied as base-learners of the stacking model. The gird search values used for each base model are presented in Table 8.

Table 8 Hyper-parameters of the base-learners and range of grid search values used in the study

4.2.3 Stacking model implementation

The proposed model implemented the stacking regression with fivefold CV, which used the concept of out-of-fold predictions to prepare the input data for the level-1 meta-learner. The training dataset was divided into five folds. In five successive rounds, four folds were used to fit the level-0 base-learners. In each iteration the optimized level-0 models were applied to the remaining one subset. The resulting predictions were then stacked and provided to the level-1 meta-learner. Figure 5 shows the fivefold CV for training and testing set division to fit and evaluate the stacking model.

Fig. 5
figure 5

Fivefold CV for training and testing set division to fit and evaluate the stacking model

5 Result and discussion

5.1 Inter-comparison of model performances

The forecasting was evaluated and compared to that of the optimized base models in terms of MSE, MAE, MAPE, and R2 for the three target variables. In addition, we performed an experimental analysis on the feature engineering statistical schemes—lag-based, rolling window, and EWMA. The experimental outcomes for each variable for the training and testing phases are presented in Tables 9, 10, 11.

Table 9 Evaluation results of forecasting daily average temperature, maximum temperature and minimum temperature using lag-based scheme (the best results are boldfaced)
Table 10 Evaluation results of forecasting daily temperature, maximum temperature and minimum temperature using rolling window scheme (the best results are boldfaced)
Table 11 Evaluation results of forecasting daily temperature, maximum temperature and minimum temperature using EWMA scheme (the best results are boldfaced)

All models demonstrate satisfactory performance in predicting average, maximum, and minimum temperatures during the training phase. Notably, the RF model outperforms others in predicting all three temperature variables (average, maximum, and minimum). On the contrary, the performance of the Ridge model was inferior compared to the other models. For average temperature (T) prediction during the training phase, the RF model exhibits the best performance, with MSE values of 0.546, 0.568, and 0.597 for lag-based, rolling window, and EWMA schemes, respectively. The RF model also yields MAE values ranging from 0.572 to 0.591, MAPE values ranging from 2.212 to 2.231%, and R2 values ranging from 0.938 to 0.943. Following closely is the stacking model, with MSE ranging from 0.686 to 0.715, MAE ranging from 0.623 to 0.635, MAPE ranging from 2.421 to 2.471, and R2 ranging from 0.925 to 0.928. In contrast, Ridge models exhibit the least favorable performance, with MSE ranging from 0.909 to 0.912, MAE ranging from 0.713 to 0.715, MAPE ranging from 2.784 to 2.792, and R2 values of 0.905.

In the case of forecasting maximum temperature (Tmax), the RF model outperforms the others, closely followed by the stacking model in most instances. The RF model achieves MSE values ranging from 1.074 to 1.229, mean absolute error (MAE) values ranging from 0.769 to 0.805, MAPE values ranging from 2.533% to 2.658%, and R2 values ranging from 0.829 to 0.805. For minimum temperature (Tmin) prediction, the RF model achieves MSE values ranging from 0.425 to 0.437, MAE values ranging from 0.494 to 0.497, MAPE values ranging from 2.281% to 2.295%, and a R2 value of 0.972. The RF model's performance is followed by LGBM and then the stacking model across all three statistical schemes. Additionally, in the training phase, the lag-based scheme tends to enhance the accuracy of predictions for T, Tmax, and Tmin for all models, with the exception of LGBM’s performance in forecasting Tmax using the EWMA scheme.

All models demonstrate strong predictive capabilities, particularly in forecasting T and Tmin during the training phase. However, it’s important to emphasize that assessing model performance based on the testing dataset is of paramount importance. In the training phase, models are exposed to complete data, including input features and target values, which can potentially lead to overfitting. Therefore, excluding the evaluation of models in the testing phase could yield misleading results. It is worth noting that during testing, models only receive input features, making the forecasting accuracy more dependable compared to the training phase.

In the testing phase, the stacking model outperforms all other models by a certain margin, as measured by a variety of performance metrics applied to all target variables. SVR stands out as the poorest performer across all three temperature forecasting schemes. In forecasting the average temperature, the stacking model has a reduced MSE range between 0.749 and 0.755 °C. The MAE values of the stacking model were 0.656 °C–0.662 °C, whereas MAPE values were 2.538%–2.560%. For the stacking model, the R-squared (R2) value is 0.925 when using the EWMA scheme, while it stands at 0.924 for both the lag and rolling window schemes. Based on the result, the LGBM model comes in second, followed by the RF model. The performance of SVR is comparatively worse than others in respect of all evaluation metrics, with MSE ranging from 0.827 to 0.887, MAE ranging from 0.703 to 0.717, MAPE ranging from 2.700 to 2.791%, and R2 ranging from 0.911 to 0.917.

In the context of predicting Tmax, the stacking model exhibits MSE values between 1.839 and 1.859, MAE values between 1.022 and 1.025, mean absolute percentage error (MAPE) values from 3.381 to 3.394%, and R2 values from 0.733 to 0.736. For Tmin, all models perform adequately, with the stacking model displaying superior performance. It attains an MSE value of 0.0535 and a R2 value of 0.965 across all schemes. Additionally, the model achieves MAE values ranging from 0.536 to 0.537 and MAPE values ranging from 2.510% to 2.515%.

However, the RF model outperforms other models in the training phase (Tables 9, 10, 11). This indicates a tendency towards overfitting. It means that the RF model captures noise or random variations specific to the training data, which may not translate well to new, unseen data. On the contrary, the stacking model exhibits a minimal disparity between its performance in the training and testing phases. This indicates that the model was adeptly shielded against overfitting. Stacking models, by design, possess the potential to generalize more effectively compared to individual models. In this framework, a meta-learner is integrated to learn how to combine the predictions of base models, ultimately enhancing overall performance on unseen data. In terms of forecasting T, the stacking model demonstrated MSE improvement rates of 13.2%, 8%, and 7.8% across the three schemes compared to SVR. Additionally, the MAE improvement rates stood at 5.5%, 4.8%, and 4.7%. When it comes to forecasting Tmax, the MSE improvement rates ranged from 19.6 to 36%, and the MAE increment rates ranged from 4.9 to 11%. Similarly, for Tmin, the MSE improvement rates varied from 16.4 to 25.8%, and the MAE increment rates ranged from 7.7 to 12.1%.

To visually represent the predictions made by the algorithm, Fig. 6 displays both the observed data from BMD and the predicted values generated by the stacking model for the most recent 365 samples. Subgroups (a), (b), and (c) correspond to the three schemes: lag-based, rolling window, and EWMA. The graphs clearly illustrate the stacking model's accurate portrayal of time-series variations when compared to the observed data. This reaffirms the model’s capacity to generalize the actual temperature profile. Additionally, Figure (b) highlights that when forecasting maximum temperature, the model’s performance is relatively lower. This could be attributed to potential challenges in capturing more pronounced temporal trends or patterns specific to maximum temperature data. From Figs. 6a–c, it can be seen that the three statistical schemes exhibit no significant differences in their performance when forecasting the three temperatures (T, Tmax, and Tmin). A comparison of the existing machine learning and other approaches to the prediction of temperature is summarized in Table 12.

Fig. 6
figure 6

Forecasting of temperature, maximum temperature, and minimum temperature using stacking model for three feature engineering statistical schemes: a lag-based, b rolling window, and c EWMA

Table 12 Performance comparison of the proposed model with the existing models

The scatter plots of the stacking model and base models presented in Fig. 7 with sub-groups (a), (b), and (c) for three different schemes. It revealed that the maximum clustered points were close to the diagonal line in case of average temperature and minimum temperature forecasting which confirms that the prediction results of these two weather parameters were highly coincidental with the experimental data. The maximum temperature-predicted data points were comparatively scattered. R2 values of the stacking model and the base models mentioned in Table 11 also offered the same indication. The R2 values of the models were 0.948 or higher in minimum temperature forecasting. R2 more than 0.75 was a very good fit, and R2 equivalent 0.64 to 0.74 was a good fit [6]. All ML methods perform well with accuracy (more than 68.2%). Among them, the stacking model has the best performance with accuracy more than 96.5%. There was no significant difference in evaluation metrics with respect to three feature engineering schemes.

Fig. 7
figure 7

Scatter plots showing the association between actual value and predicted value of the target variables obtained by the stacking model for three feature engineering statistical schemes: a lag-based, b rolling window, and c EWMA

Figure 8 visually represents the prediction errors during the testing phase using a probability density curve. This curve is constructed using the Gaussian kernel density function, ensuring that the errors follow a normal distribution. For each target variable, the graphic depicted a distinct error distribution over the range of schemes and models. The figure demonstrated that the probability density curve derived using the stacking model was a superior match, which was also verified by the standard deviation values provided in Table 13. The stacking model consistently exhibits a lower standard deviation of predicted data compared to the other models in all scenarios. Specifically, for the EWMA scheme, the standard deviations are 0.866, 1.355, and 0.731 for T, Tmax, and Tmin, respectively.

Fig. 8
figure 8

Probability density functions showing the prediction error of the target variables obtained by the stacking model for three feature engineering statistical schemes: a lag-based, b rolling window, and c EWMA

Table 13 Standard deviation of predicted value of the target variables

It is noteworthy to mention that the choice between statistical schemes like EWMA (Exponentially Weighted Moving Average), rolling window, and lag values depends on the specific characteristics of the data. Here, the EWMA approach considerably enhances the predictive capability of the stacking model, particularly in the forecasting of T and Tmin. Optimal accuracy is achieved in forecasting the Tmax when the rolling window scheme is employed. It can be said that the rolling window and EWMA schemes might be more adaptable to changes or variations in the data patterns. They may have a smoother response to shifts in the underlying trends, which can be beneficial when dealing with unseen data.

5.2 Computation time analysis

As depicted in Table 14, the Ridge model exhibited the shortest computational time. However, the RF model, owing to its multitude of finely tuned hyperparameters, displayed a moderately lower level of prediction accuracy. In contrast, the LGBM model demonstrated significantly shorter execution times compared to its counterparts. The stacking model, a key recommendation of this research, necessitated a relatively longer computational time. Nonetheless, this increased duration was accompanied by the model achieving the highest level of performance among all evaluated methods. This highlights the trade-off between computational efficiency and predictive accuracy, underscoring the stacking model's potential as a robust forecasting tool.

Table 14 Time (second) for estimation of the models (the least estimation time is boldfaced)

5.3 Statistical test analysis

The non-parametric Friedman test has been conducted due to the shortage of comparison methods. The Friedman test, a two-way statistical assessment of rank variance, assumes that all machine learning algorithms perform equally under the null hypothesis (H0) in relation to performance metrics like MSE, MAE, MAPE, and R2. Conversely, the alternative hypothesis (HA), suggests that at least one approach exhibits a notable difference in performance. Friedman test results on the performance metrics are given in Table 15. Based on the outcomes presented in Table 10, the Friedman test dismisses the null hypothesis for all temperature conditions. This is indicated by the p-values being below the designated significance level (α = 0.05) and the FF values surpassing the critical value of 19.675. Consequently, the null hypothesis of the Friedman test is rejected, providing evidence of a substantial distinction between the models under comparison. As the Friedman test is significant, the average ranks are determined and presented in Table 16.

Table 15 Results of Friedman tests
Table 16 Rankings of the algorithms using the Friedman test

The Friedman test, however, was not enough for making such a distinction. Therefore, a Nemenyi test was run to compare the six models’ accuracy levels and determine whether or not there was statistically significant variance. The crucial difference (CD) for the Nemenyi test was 2.176. This means that there was a statistically significant difference in performance across models if the difference between their average rankings was more than 2.176. Figure 9 shows the Nemenyi test's CD diagram for all dependent variables.

Fig. 9
figure 9

CD diagram of the average performance rankings

The diagram illustrates a clear distinction between the six models, categorized based on their average ranking. For temperature prediction, the top-performing methods fall into two groups: stacking, LGBM, and RF form one category, while LR, Ridge, and SVR comprise the other. In terms of maximum temperature forecasting, the superior models are categorized into stacking, RF, and LGBM on the one hand, and LR, Ridge, and SVR on the other. Finally, for minimum temperature prediction, the standout models are categorized into two groups. One category contains stacking, LR, and Ridge, and the other category contains LGBM, RF, and SVR.

5.4 Trend analysis

Table 17 presents the outcomes of the Mann–Kendall (MK) and Sen’s slope tests applied to the daily average, maximum, and minimum temperatures in 2021, based on both observed (BMD) and projected data. The BMD data reveals an upward trend (↑) in average temperatures from February to April (summer), with monthly increments ranging from 0.080 to 0.183 °C per day. Conversely, a downward trend (↓) is observed from November to January (winter), accompanied by a reduction in Sen’s slope ranging from − 0.200 to − 0.107 °C per day. No discernible trend (−) is noted from May to October (autumn). The predicted data, obtained by the stacking model using the EWMA scheme, is capable of mirroring these findings from the BMD data, though there are slight discrepancies, particularly in October. Notable trends in maximum average temperature are noted in February, March, April, and August, with temperature increases of 0.799, 0.200, 0.063, and 0.067 °C per day, respectively. Conversely, a decreasing trend was observed from December to January. The remaining months showed no substantial trends. The stacking model with a rolling window scheme yielded the most favorable outcomes for maximum temperature. The model effectively captured monthly trends comparable to those observed in the BMD data. Additionally, Sen’s slope values are nearly identical in both cases.

Table 17 Mann–Kendall trend analysis results of daily temperature, maximum temperature, and minimum temperature for year 2021

Nevertheless, the minimum average temperature exhibited an upward trend from February to May, with temperature increments ranging from 0.048 to 0.272 °C per day. Likewise, a negative trend was observed, leading to a drop in the minimum temperature from October to January. June to September did not display any notable trends. The projected data generated by the stacking model using the EWMA scheme mirrored the trends observed in the BMD data. Sen's slope values closely align in both cases. The annual trend analysis was conducted, and the findings indicate a rise in minimum temperature throughout the year 2021 in Cox’s Bazar. However, no discernible trends were observed for both average and maximum temperatures.

6 Conclusion

Predicting the weather is a challenging and intricate endeavor. In this study, a perceptible stacking model was implemented to estimate daily air temperature parameters, which include average, maximum, and minimum temperature in Cox’s Bazar. The results showed that the suggested stacking model performed better overall than the other optimal base models across all metrics, including MSE, MAE, MAPE, and R2. In terms of forecasting T, the stacking model demonstrated MSE improvement rates of 13.2%, 8%, and 7.8% across the three schemes compared to SVR. Additionally, the MAE improvement rates stood at 5.5%, 4.8%, and 4.7%. When it comes to forecasting Tmax, the MSE improvement rates ranged from 19.6 to 36%, and the MAE increment rates ranged from 4.9 to 11%. Similarly, for Tmin, the MSE improvement rates varied from 16.4 to 25.8%, and the MAE increment rates ranged from 7.7 to 12.1%. Among various statistical approaches, EWMA and rolling window schemes take the lead compared to the lag-based ones. Furthermore, statistical tests validate the stacking model’s performance. Ultimately, the comprehensive annual trend analysis conducted provides invaluable insights into future trends, offering crucial information on whether they are set to ascend or remain stable. Therefore, the suggested stacking model is sufficient and a reasonable match for the data, which can be effectively integrated into web or mobile applications in various sectors, including agriculture and renewable energy planning.

While this study has shed light on the dynamics of temperature forecasting, it is important to acknowledge its limitations. The stacking model can be computationally expensive as it requires training multiple base models and a meta-model while dealing with large datasets (7,670 samples). Training the base models alongside the meta-model consumes more time compared to training a single model, which can be impractical for applications requiring real-time or time-sensitive responses. The model can be susceptible to overfitting, particularly in situations with limited sample sizes. Furthermore, the reliability of BMD meteorological data is compromised by instrument-related issues, data management errors, transparency deficits, and the potential for errors, inconsistencies, and data gaps, all of which undermine trust in its integrity. Addressing these constraints in future research promises to generate a more robust and reliable model for real-life applications. Future research could delve into more granular data sources, such as high-resolution satellite data, to address the spatial limitations we encountered. Additionally, to generate a temperature prediction challenge that involves forecasts over several time steps and encompasses various locations, we will carry out experiments at numerous diverse sites with the impact of temperature during various seasons. Advanced machine learning and deep learning methods can be adopted, which may help mitigate the uncertainties associated with temperature prediction. Collaborative efforts with meteorological experts could further enhance the accuracy of temperature forecasts.