1 Introduction

At the turn of the twenty-first century, the advent of vast amounts of data, effective supercomputers with graphics processing units (GPU), and scholarly interest in fresh emerging methodologies turned out to be key factors in the development of machine learning [1]. Although many AI approaches date back to the 1960s and have been studied extensively since then, the last few years are considered the “golden era” of AI and machine learning. The term “golden age” wasn’t coined until the 1990s. In recent decades, the volume of data and the ability to compute have grown significantly, leading to the current scenario [2]. The integration of machine learning in health, agriculture, and environmental problem-solving empowers professionals in these fields to leverage data-driven insights for better decision-making, improved resource allocation, and enhanced problem-solving capabilities [3,4,5,6]. As technology continues to advance, machine learning will continue to play a pivotal role in addressing complex challenges and driving innovation in these important domains.

Significant progress has been made in machine learning (ML), which has led to the development of innovative methods such as autoregressive neural networks. These developments have begun to have a positive impact on forecasting [7,8,9]. In past M4 and M5 tournaments, ML-based methods performed well. M4 machine learning competition winners were neural network-based techniques and gradient boosting. These strategies outperform benchmarks in estimating future uncertainty [10].

Artificial neural networks (ANN) [11]. The best SR ANN model uses backpropagation (BP) and meteorological data. Malaysian data is used. Leven-berg-Marquardt, SCG, and Bayesian regularization train the model (BR). BP algorithm with weather data builds the most accurate ANN model. Wind temperature has a negligible impact on temperature, relative humidity, and SR. BR-trained ANN models outperformed others.

Deep learning (DL) [12], this work created an LSTM-MVO hybrid intelligence model to forecast and assess CCPP air pollution. The suggested approach uses a long short-term memory model to predict CO2 and SO2 emissions. MVO improves LSTM accuracy. The model was tested using Iranian power plant data. May–September 2019 data include wind speed, air temperature, NO2, and SO2.

Zhao et al. [13] use deep learning to investigate how lockdowns affect AQI. Social and spatiotemporal implications are considered along with historical pollution concentrations and climate. SAC, which combines temporal and geographical autocorrelation, specifically accounts for previous data and nearby cities. Deep learning research revealed: Wuhan’s closure cost 25.88, and Shanghai’s 20.47. AQI projections can be improved by 47% for Wuhan and 67% for Shanghai by lowering prediction errors.

Integrated AQI forecasting models improve urban air pollution control, public health infrastructure, and resident travel planning. Integrated forecasting models predict frequency subsequences. ELM and WOA-LSTM predict different high-low and trend sequences. Beijing decomposition improves RMSE, MAPE, and PA by 8.55, 10.36, and 6.1% [14].

Azhari et al. [15], cover the most significant field advancements from 2011 through 2021. They examined 155 items after a thorough search of major scientific resources. Machine learning, assessment metrics, predictor factors, predicted values and geographic distribution characterize the studies. This satellite-based study [16], analyses India's COVID-19 lockdowns and AQI changes. Covered are cutting-edge statistical and deep learning methods for short-term AQI forecasting.

To prevent overfitting and the local optima trap, Aarthi et al. [17] suggest using balanced spider monkey optimization (BSMO) for feature selection. The Central Pollution Control Board (CPCB) provided air quality estimates for Bangalore, Chennai, Hyderabad, and Cochin, four Indian cities. Missing data are filled by min–max normalization. The input dataset is deepened using a CNN. Based on the balancing factor, BSMO chooses pertinent properties for the bi-directional long short-term memory (Bi-LSTM) model. Four city air quality time series are predicted using Bi-LSTM.

An IVLSTM-MCMR air quality prediction model is presented as a novel approach [18]. The model recommends MCMR and IVLSTM modules. The VLSTM inner structure is strengthened by the IVLSTM module, which lowers convergence acceleration parameters. A novel historical knowledge technique is introduced to guarantee consistency in training. The multichannel data input model (MC) with increased linear similarity dynamic temporal warping is used by the MCMR module to choose the IVLSTM input data. A multi-route output model (MR) outputs the results of several target stations with different attributes via various routes to incorporate MC findings.

Deep learning and machine learning are well liked because they excel at gathering, calculating, and comprehending complex dependencies and multidimensional data better than domain expertise. Gated recurrent units and an attention temporal graph convolutional network are used [19]. Support vector machines and neural networks predict air quality index (AQI) contamination in many investigations in this study. Air pollution from CPCB. The suggested ML model predicts and compares Delhi Air Quality Index (AQI) data [20]. Duan et al. [21] investigate Chongqing weather. Then splits sample data into training and test sets using the wavelet transform to verify the Naive Bayes Model. Second, they compare Naive Bayes, SVM, XGBoost, bagging, and random forest. Finally, Chongqing’s Naive Bayes Model reliably assesses city air quality.

However, there are very few good works that can predict the forecasting of data, especially for the AQI. Therefore, this study developed an MLP-LSTM hybrid model to contribute the AQI predictive forecasting. Moreover, the predictive ability of the model was investigated in this study, which was conducted within the context of artificial intelligence and involved the development of the model. The main challenge with these types of data is the data preprocessing implementation in terms of making the data trainable. The following is a list of the most significant contributions made by the article:

  1. 1.

    Preprocessed the dataset to speed up the model's training and testing processes.

  2. 2.

    Developed the MLP-LSTM Hybrid Model and built it.

  3. 3.

    Utilized the forecasting prediction model on the selected dataset.

  4. 4.

    Examined the efficiency of the proposed model and contrasted its findings with earlier research

In the section where we examine studies that are similar to this one, we have touched briefly on past research that is pertinent. Following that, we described the research approach that we used. In this section, we discussed the characteristics of the dataset as well as the model that was suggested. After that, our suggested model's outcomes were evaluated during the stage labeled Sect. 6. In the end, we compared the results of our model to those of older models that were still in circulation and found our results to be superior.

2 Related Work

Asghari et al. [22] a hybrid method was used to estimate PM10 air pollution in Tehran, combining data from the Aghdasiyeh and Mehrabad weather stations. The method employed an artificial neural network with back propagation (BP) and 11 inputs for daily PM10 prediction. Additionally, a hybrid approach called BP-GA, combining Genetic Algorithm (GA) with BP, was utilized. Comparing the method’s performance with basic artificial neural networks using BP, the results showed that BP-GA achieved a higher R2 value of 0.55, indicating better accuracy. However, it should be noted that the model's reliability decreased for longer time frames due to the impact of long-term data fluctuations on network performance. Further improvements are necessary to address these limitations and enhance the model's effectiveness for longer-term predictions. Zhao et al. [23] established a model STCNN-LSTM, which is a spatiotemporal collaborative prediction model for regional air quality. It combines CNN and LSTM for improved accuracy. The model uses a Relevance Data Cube to analyze and visualize air quality dimensions. It achieves an R2 value of 0.70. However, there are potential limitations such as model complexity, challenges in data preprocessing, and the need for further validation with diverse datasets. Pang et al. [24] compared three machine learning techniques for PM2.5 concentration forecasting (MART, DFNN, and LSTM), and the LSTM model exhibited superior performance. It effectively captured temporal relationships in the data, resulting in an RMSE of 8.91 µg m−3 and an MAE of 6.21 µg m−3. However, it is important to note that the model only predicted 75% of pollution levels and explained 80% of the variability (R2 = 0.8) in PM2.5 concentrations. While the LSTM approach shows promise, there may still be room for improvement in accurately forecasting and reducing air pollution using machine learning techniques.

Zhang et al. [25] compared linear regression, LSTM-FCN, and LightGBM to modify the GRAPES-3 km model’s temperature forecast in Shaanxi, China. Evaluation measures included forecast accuracy, RMSE, MAE, and R2. RMSE, MAE, and R2 were reduced by 33%, 32%, and 40%, respectively. All three machine learning techniques are over 78% accurate. LightGBM’s accuracy is 84%. Kardhana et al. [26] using Sadewa data, LSTM-RNN estimates the Katulampa Barrage water level. 44 Sadewa covers Katu-lampa. 88, 1616, and 2828. The latter crosses Java north-to-south. Sadewa has 30.8, 0.05-degree cells. LSTM-RNN accurately predicts water level. The model estimated Katulampa’s water level using 44 and 88 extents and recurring t-24 h data. 24 h of model predictions kept R2 above 0.80. Won et al. [27] model-driven methods were utilized to develop a physical model with interconnected inland river and flood control systems. The rainfall-runoff model was calibrated using gauging stations and pump stations in August 2020. Using model-driven rainfall scenarios, flood alerts were examined. Urban flood forecasting and warning systems use ANN, LSTM, Stack-LSTM, and Bidirectional LSTM. A 30-min water level forecast using a bidirectional LSTM system had an R2 of 0.9.

Ouma et al. [28] three sub-basins signify three discharge stations. Each LSTM and WNN hidden layer comprises 30 neurons. LSTM and WNN accurately predicted basin runoff with 0.89 and 0.88 R2 values. MAE for the expected monthly rainfall trend was 9–11 mm, and RMSE was 15–21 mm. Both models reached the lowest RMSE in around the same amount of epochs, with WNN taking somewhat longer. Tuerxun et al. [29] several benchmark functions were used to evaluate the proposed MBES-LSTM model for EMD-based wind power forecasting and to estimate the LSTM's parameters. Revised bald eagle-assisted LSTM hyper-parameter tuning. The updated bald eagle method optimized the LSTM’s relevant parameters, resulting in 0.09, 0.07, and 0.01 changes in RMSE, MAE, and R2. Shen et al. [30] a framework anticipated daily streamflow in China's Hanjiang River Basin. During the validation period, the integrated framework/DBN model had an NSE of 0.91 and R2 of 0.93. The integrated framework provides higher streamflow prediction accuracy than the single data-driven model. The combined framework peak flood error was 4.6% lower than the standalone DBN model.

Mani et al. [31] predicted AQI with linear regression and time series. MLR predicts AQI. Regression model using CPCB Chennai data. Sensors verify model parameters. Model validation used RMSE, COD, and MAE. K-cross-fold, 92% MLR. ARIMA forecasts AQI. In parameter evaluation, untimed data were utilized. 95% confident 15-day AQI forecast. 80% of tests passed. Liu et al. [32] an ST-CCN-IAQI model is created. Weather and multi-source air pollution were taken into account. They used dilated convolution and temporal attention stacking to get time-dependent network features. To fine-tune ST-CCN-hyper-parameters, Bayesian optimization was used. They used IAQI’s Shanghai air monitoring baselines (AR, MA, ARMA, ANN, SVR, GRU, LSTM, and ST-GCN). For a single station, ST-CCN-RMSE IAQI and MAE decreased by 24.95% and 16.87%. RMSE and MAE for all nine stations were 9.84 and 7.52, whereas R2 was 0.90. Rahimpour et al. [33] HSD and HPTD predicted Orumiyeh's 1-day AQI. CEEMDAN combined GRNN and ELM as HSD models using AQI data. CEEMDAN IMFs were divided into 9 VMs before forecasting IMF1. GRNN and ELM-HTPD predicted IMFS. CEEMDAN-ELM and CEEMDAN-GRNN predicted HSD’s AQI. R2 = 0.74, RMSE = 5.45, MAE = 3.87 for CEEMDAN-VMD-GRNN HTPD. HSD predicts AQI is worse than HTPD. Liu et al. [34] Zhangdian District air quality is projected using a random forest model and real-time meteorological emissions. MAE outperforms LSSVM, DT, and BP neural networks. Model inversion decreases waste gas emissions using weather and air quality. 2019 industrial waste gas emissions should be 5687.5 million cubic meters daily. This methodology reduces air pollution risk by altering enterprise production capacity based on weather forecast inversion.

Fan et al. [35] address WRF dataset O3 and PM2.5 (PNW). Kennewick sends erroneous O3 warnings. Unified modeling reduces Low-O3 ML2 R2 = 0.79 NMB by 7.6%. AQI and high-O3 capture. ML1&2 predict low-PM2. In fire and cold, ML2 showed increased high-PM2.5 and lower NMB. AIRPACT doubles PNW wildfire PM2.5 estimations. ML estimates NW station O3 and PM2.5. A comparative analysis [36] was conducted on deep learning models such as LSTM, GRU, and a statistical model to forecast air pollutants (NO2, O3, SO2, PM2.5, PM10) using a publicly available dataset from a monitoring station in Belfast, Northern Ireland. The deep learning models consistently outperformed the statistical model, achieving the lowest RMSE (0.59) and the highest R2 score (0.86).

In Table 1, a summarized version of the description of the related work is presented.

Table 1 Summarized description of related work

3 Methodology

In the section on data collection, this study went into considerable depth about the dataset instances as well as the dataset sources. After that, the stage of removing unnecessary data has been finished, which included the removal of data columns with unnecessary values and the deletion of any null values that were there. Examples of pair plots, density maps, and correlation analysis are provided in the section on data visualization so that the dataset can be understood better. The dataset was then prepared for the simulation in the next stage, which was the data preprocessing step. This step included the operations of integer encoding, floating conversion, train test split, min–max scaling, and data reshaping into 3D. Finally, a Hybrid MLP-LSTM technique was developed to accomplish the goals of estimating the air quality index and training the model. After that, conduct an ablution study so that our model can fit best. After this, the findings are assessed by making predictions and computing the MSE, MAE, RMSE, MAPE, and R2 scores. And made predictions for 6 important features and computed the MSE, MAE, RMSE, MAPE, and R2 scores as well. Figure 1 illustrates the full process of the proposed Hybrid MLP-LSTM model, with each phase represented by a block.

Fig. 1
figure 1

Illustrates the full process of the proposed Hybrid MLP-LSTM model, with each phase represented by a block. Block A Described dataset instances and data source B unnecessary data removal C data visualization D data preprocessing E cluster analysis F proposed model G train model H result analysis

3.1 Dataset Description

The Beijing Multi-Site Air-Quality (BMSAQ) Dataset [37] is accessible to the general public. The data was acquired from the UC Irvine Machine Learning Repository. The national government-managed 12 locations across the nation provided the hourly data for air contaminants that make up this data set. The Beijing Municipal Environmental Monitoring Center is where the data on air quality was gathered. The meteorological data gathered at each site for monitoring air quality is compared to that of the weather station closest to the site operated by the China Meteorological Administration. The time's start date is March 1, 2013, and its end date is February 28, 2017. Six major air pollutants and six relevant climatic variables are taken into account in this hourly data collection at various places throughout Beijing. The summarized description of the dataset is presented in Table 2.

Table 2 Summarized description of six important features

3.2 Unnecessary Data Removal

To improve the accuracy of the air quality index prediction algorithm, certain columns were removed from the dataset. The wind direction and station columns were excluded based on their limited impact on the prediction, as determined by domain knowledge and previous research findings. Additionally, the date column was removed since there were already date integer columns available, and the AQI column was eliminated due to duplicate numeric AQI columns. These column removals helped streamline the dataset, eliminating redundancy and allowing the algorithm to focus on the most relevant features for accurate air quality index prediction.

Furthermore, any instances with null values were also removed to ensure the integrity and reliability of the data used for prediction. The dataset graph after the removal of unnecessary data is presented in Fig. 2, showcasing the density values of the remaining attributes in different colors. In which x represents the number of instances and y represents the values.

Fig. 2
figure 2

Dataset after unnecessary data

3.3 Data Visualization

The instances and characteristics of the dataset have been shown, so that they may be better understood through the presentation of the dataset correlation, the pair plot of the data dataset, and the density of each instance. Whereas a correlation provided a clearer understanding of the linear relationship that existed between the continuous variables. It was beneficial to understand the proper collection of features to explain the relationship among the variables in the dataset by using pair plots. This was accomplished by understanding the suitable set of features. The use of density plots allowed for the examination of how a dataset's variables were distributed. Figure 3 shows the (a) dataset correlation, (b) dataset pair plot, and (c) data density.

Fig. 3
figure 3figure 3

a Dataset correlation b dataset pair plot c data density

To acquire insights and comprehend the relationships and distribution of variables inside a dataset, several data analyses and visualizations are utilized.

Dataset correlation, as shown in Fig. 3a, measures the statistical relationship between variables. It indicates how closely related two variables are and the direction of their relationship. The correlation values range from − 1 to + 1, where − 1 represents a strong negative correlation, + 1 represents a strong positive correlation, and 0 indicates no correlation. In the figure, it can be observed that Pm2.5 has the strongest positive correlation among all other attributes.

Pair plots, depicted in Fig. 3b, provide a visual representation of the relationship between pairs of variables. Each pair is represented as a scatter plot, with one variable’s values on the x-axis and the other variable’s values on the y-axis. This graphical representation helps identify patterns, trends, and correlations between variables.

Dataset density, as illustrated in Fig. 3c, showcases the distribution of data values across the dataset. It reveals how the values are spread or concentrated within a certain range. Density plots, often depicted as smooth curves, visualize the shape, peaks, and tails of the distribution, providing insights into the central tendency and spread of the data.

3.4 Cluster Analysis

In this study, two different clustering algorithms, namely K-means and Agglomerative Clustering, were employed to determine the optimal number of clusters in the dataset. For K-means, the optimal number of clusters was identified as 4, with a corresponding silhouette score of 902,920,072,962.847. On the other hand, Agglomerative Clustering also yielded 4 as the optimal number of clusters, with a silhouette score of 1,124,108,699,292.519. These results suggest that both algorithms agree on the optimal number of clusters and provide relatively high silhouette scores, indicating well-separated and compact clusters. Figure 4 shows the (a) K-means clustering elbow method, (b) Agglomerative clustering elbow method, and (c) the optimal number of clusters.

Fig. 4
figure 4

a K-means clustering elbow method, b agglomerative clustering elbow method, and c the optimal number of cluster

3.5 Data Pre-processing

3.5.1 Integer Encoding

Numbers were substituted for the category variable that was originally included in the dataset by applying integer encoding. Where numbering is done on a purely arbitrary basis. This enables the models to be specified more expediently. Table 3 shows a short inspection of integer encoding output.

Table 3 Integer encoding output

3.5.2 Floating Conversion

The processing times of the model are made more reasonable by utilizing floating-point conversion in the systems in our dataset that contain extremely small values and very large numbers. This is done in systems that contain both very large numbers and very small values. Table 4 shows a quick look at the floating conversion implementation output.

Table 4 Floating conversion output

3.5.3 Min–Max Scaling

To reduce the size of the dataset while staying within the parameters of the given range, the Min–Max Scaler was utilized. It modifies our data by adjusting the scale of the features to fit within the specified range. Which adjusts the values to fit within a particular value range without affecting the overall form of the distribution they were based on. Table 5 shows a quick scan of the min–max scaling output.

Table 5 Min–max scaling output

3.6 Train Test Split

After the normalization process, two arrays are created to hold the test dataset and the training dataset separately. This is done to ensure compatibility with scikit-learn libraries, which typically expect input data in the form of arrays rather than lists. Converting the data into arrays enables effective utilization of these libraries. To train the model, 85% of the data is used for training purposes, while the remaining 15% is used for testing. This split helps evaluate the model's performance on unseen data and assess its generalization capabilities. The train-test split curve, as depicted in Fig. 5, visualizes the distribution of the data between the train and test sets, and shows the proportion of data allocated for each.

Fig. 5
figure 5

Train and test split curve

4 Evolution Metrics

This section gives an overview of the several performance criteria that have been applied in the proposed study to gauge how well-built prediction models function. Among other metrics, these include MSE, MAE, RMSE, MAPE, and R2 scores. Based on several metrics, the precision of power consumption estimations can be assessed. The accuracy of the forecasts will increase with decreasing error values. Because of this, accuracy is measured by examining the discrepancy between the output that was expected and the actual result. Each accuracy indicator will frequently produce varied results when testing and comparing various prediction models and strategies utilizing the same dataset. This will make the performance unpredictable.

4.1 MSE

Estimator's mean squared error measures the average squared difference between estimated and actual values (MSE). MSE calculates the average of error squares. It's a risk function whose value is similar to the expected squared error loss. It's usually always positive rather than zero, which may be due to randomness or the estimator failing to evaluate relevant information.

4.2 MAE

Mean absolute error is used to evaluate models, especially regression models. The MAE of a model for the test set is obtained by averaging the absolute values of the individual prediction errors over all of the test set occurrences.

4.3 RMSE

The root mean square represents the residual standard deviation. Standard deviation quantifies dispersion. Residuals measure data point distance from the regression line. RMSE measures the difference between expected and actual values. It displays how closely the data fit the best-fit line.

4.4 MAPE

The mean absolute percentage error (MAPE) is a metric that evaluates how precise a forecasting system is. It subtracts actual values from real values, then divide by actual values. The report gives an accuracy percentage. MAPE is most often used to forecast errors. It’s easier to read because the variable's units are in percentages.

4.5 R 2 Score

R2 score evaluates a regression-based machine learning model. It’s called the coefficient of determination. It computes the range of probable dataset outputs to function. It’s the difference between actual dataset samples and model predictions.

To contrast the results with other methods, the performance evaluation metrics mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), R2 score, and root mean square error (RMSE) were used as follows.

$$MSE=\frac{1}{n}\sum_{t=1}^{n}{({y}_{t}-{\widehat{y}}_{t})}^{2}$$
(1)
$$MAE= \frac{1}{n}\sum_{t=1}^{n}\left|{y}_{t}-{\widehat{y}}_{t}\right|$$
(2)
$$RMSE= \sqrt{\frac{1}{n}\sum_{t=1}^{n}{({y}_{t}-{\widehat{y}}_{t})}^{2}}$$
(3)
$$MAPE= \frac{100\%}{n}\sum_{t=1}^{n}\left|\frac{{y}_{t}-{\widehat{y}}_{t}}{{y}_{t}}\right|$$
(4)
$${R}^{2}=1-\frac{{\sum }_{t=1}^{n}{({y}_{t}-{\widehat{y}}_{t})}^{2}}{{\sum }_{t=1}^{n}{({y}_{t}-{\overline{y} }_{t})}^{2}}$$
(5)

where \(n\) is the number of prediction points, \({y}_{t}\) represents the actual load, \({\widehat{y}}_{t}\) the predicted load, and \({\overline{y} }_{t}\) the average load [38, 39].

5 Applied Models

To compare the performance of the proposed hybrid MLP-LSTM models, we applied three different models to the dataset. MLP and LSTM models have also been used separately to further explain how well the hybrid MLP-LSTM performs to forecast the air quality index. Following is a description of the application and performance.

5.1 MLP Model

The MLP (Multilayer Perceptron) model with the given architecture consisting of three dense layers (with 10, 5, and 1 units) was trained for 10 epochs. The model utilized a tanh activation function for the intermediate layer. The training process resulted in a mean squared error (MSE) of 0.00157, indicating a relatively low level of prediction error. The mean absolute error (MAE) was found to be 0.02782, representing the average absolute difference between the predicted and actual values. The root mean squared error (RMSE) was calculated as 42.3931, which indicates the standard deviation of the prediction errors. The model achieved an R2 score of 0.41, suggesting that it explains 41% of the variance in the target variable. While the model demonstrated promising performance, there is still room for improvement, particularly in reducing the mean absolute percentage error (MAPE) value, which indicates the percentage difference between the predicted and actual values. Table 6 shows the summarized description of the evolution metrics of the MLP model.

Table 6 Summarized description of MSE, MAE, RMSE, MAPE, and R2 score

5.2 LSTM Model

We applied an LSTM model with 128 units on the dataset to predict the air quality index. The model achieved impressive results with an MSE of 0.00018 and an MAE of 0.00875. The MAPE score was 46,053,752,832.00000, indicating a relatively low percentage error. The root mean squared error was 14.5032, which is quite small, indicating accurate predictions. The R2 score of 0.92 suggests that the model explains 92% of the variance in the data, indicating a strong fit. Overall, the LSTM model demonstrated good performance in predicting the air quality index. Table 7 shows the summarized description of the evolution metrics of the LSTM model.

Table 7 Summarized description of MSE, MAE, RMSE, MAPE, and R2 score

5.3 Proposed Model

An MLP-LSTM hybrid model, we have suggested here. Which has a total of six layers in its structure. Both the MLP block and the LSTM block are composed of three layers. The first layer of the MLP block is dense, its output shape was (None, 1, 10) and its parameter value was 190. Layer 2 is also dense. The output shape of Layer 2 was (None, 1, 5) and the parameter value was 55. Layer 3 is also dense, and its output shape is (None, 1, 1), while its parameter value is 6.

An LSTM layer with the output shape of (None, 128) and a parameter value of 66,560 is located in the fourth layer of the LSTM Block. Layer 5 is a dense layer, with the output shape set to (None, 100), and the parameter value set to 12,900. The output shape of Layer 6 was (None, 1) and the value of Parameters was 101. Layer 6 is also dense as shown in Fig. 6.

Fig. 6
figure 6

Architecture of Proposed Model. A total of 12 layers are divided into two sections, Where the MLP section is denoted by Blue color and the LSTM section is denoted by Orange color

5.3.1 Train Model

To train the MLP-LSTM hybrid model that was presented, the Adam optimizer was used. Where the amsgrad variable was set to False, the learning rate was 0.001, beta 1 was 0.9, and beta 2 was 0.9. The training of the model was done using a variety of different epoch sizes to get the optimum results. Figure 7 shows the Pseudo Code.

Fig. 7
figure 7

Pseudo code of proposed model

5.3.2 Hyperparametric Exploration

During the phase that focused on training, the following investigation on ablution was carried out. We have adjusted the values of hyperparameters, such as learning rate, beta values, decay, momentum, rho, and others, to find the best combination that maximizes the model’s performance. Adam’s optimization algorithm with a learning rate of 0.001, beta values of 0.9, beta 2 of 0.9, and Amsgrad set to False achieved the highest R2 score of 0.95. The SGD optimization algorithm with a learning rate of 0.01, decay of 1e-5, momentum of 0.9, and Nesterov set to True achieved an R2 score of 0.92. Lastly, the SGD optimization algorithm with a learning rate of 0.01 and rho of 0.9 achieved an R2 score of 0.90. Table 8 shows the description of hyperparameter fine-tuning.

Table 8 Description of hyperparameter fine-tuning for modeling

Throughout the experimentation process, different epoch sizes were tested to assess their impact on model performance. A maximum R2 score of 0.95 was achieved with 10 epochs, indicating a strong correlation between predicted and actual values. The lowest root mean square error (RMSE) of 11.95 was obtained with 30 epochs, demonstrating reduced average deviation. Mean square error (MSE) remained constant at 0.00016 across different epochs, indicating consistent accuracy. Additionally, the best mean absolute error (MAE) and mean absolute percentage error (MAPE) values of 0.00745 and 0.33, respectively, were achieved with 300 epochs, highlighting improved accuracy in absolute and percentage differences. The hyper-parametric exploration is shown in Table 9.

Table 9 Description of hyperparametric exploration

5.3.3 Prediction

After carrying out the ablution study, we discovered that the optimal number of epochs for the model is 10. After that, we made a prediction based on the test set and computed the following matrices, which included MSE, MAE, RMSE, MAPE, and R2 Score. The facts as well as the prediction are presented in Fig. 8.

Fig. 8
figure 8

Prediction and actual data

5.3.4 Proposed Model’s Performance

The obtained results indicate a good fit of the model for the analysis conducted. The mean squared error (MSE) value of 0.00016 suggests a favorable fit, while the mean absolute error (MAE) value of 0.00746 is deemed appropriate for the analysis, signifying good accuracy. The root mean square error (RMSE) value of 13.45 indicates an overall good fit, and the mean absolute percentage error (MAPE) of 0.42 is considered favorable. Finally, the coefficient of determination R2 value of 0.95 is fairly positive, reflecting a strong correlation between predicted and actual values. Collectively, these results signify the effectiveness of the model and its ability to accurately represent the analyzed data. The overall performance was best with the 10 epochs described earlier considering the execution time as well. The MSE, MAE, RMSE, MAPE, and R2 scores have been condensed and presented in Table 10, and Fig. 9 shows the evolution metrics on the x-axis and values on the y-axis as follows.

Table 10 Summarized description of MSE, MAE, RMSE, MAPE, and R2 score
Fig. 9
figure 9

Model performance according to the evolution metrics

5.3.5 Feature-Wise Performance

We have obtained the following six key characteristics: PM2.5, PM10, SO2, NO2, CO, and O3. And then provide the best guess for each of the features. The actual and predicted values for the following characteristics are displayed in Fig. 10 (a) PM2.5, (b) PM10, (c) SO2, (d) NO2, (e) CO, and (f) O3.

Fig. 10
figure 10figure 10

The actual and prediction values of a PM2.5, b PM10, c SO2, d NO2, e CO, and fO3

6 Result Analysis

6.1 Discussion on Feature-Wise Performance

Table 8 shows that our model has a higher R2 score of 0.93 for the feature PM2.5. There is no difference in the R2 values of the NO2 and CO features; both are equal to 0.91. This is accurate concerning both of these gases. On the other hand, a value of 0.88 was found for the R2 coefficient of PM10 and SO2 together. The O3 gain R2 value of a feature comes in at 0.85, making it the feature's concluding component. The SO2 achieved the best MAE value of 0.00538. And the NO2 obtained the best MAPE value of 0.17. The performance of our model was greater in terms of the overall forecasting prediction as opposed to the prediction of individual variables when compared to Tables 8 and 11.

Table 11 Summarized description of MSE, MAE, RMSE, MAPE, and R2 score of the 6 important features

6.2 Applied Model’s Performance Comparison

We compared the efficacy of three distinct models: MLP, LSTM, and MLP-LSTM hybrid. The MLP model, which consisted of dense layers, obtained MSE values of 0.00157, MAE values of 0.02782, RMSE values of 42.3931, MAPE values of 112,912,818,176.00000, and an R2 value of 0.41. Using long short-term memory units, the LSTM model obtained an MSE of 0.00018, an MAE of 0.00875, an RMSE of 14.5032, a MAPE of 46,053,752,832.00000, and an R2 value of 0. However, the Hybrid MLP-LSTM model exhibited the greatest performance. This model incorporated the advantages of MLP and LSTM architectures. It was composed of dense layers and an LSTM layer. The Hybrid MLP-LSTM model outperformed the other two models across all performance metrics, with an MSE of 0.00016, MAE of 0.00746, RMSE of 13.45, MAPE of 0.42, and an impressive R2 score of 0.95. The capacity of the Hybrid MLP-LSTM model to capture both local and temporal dependencies in the data contributes to its superior performance. The MLP layers contribute by learning nonlinear relationships and feature representations, while the LSTM layer models the sequential nature of the data effectively. This combination improves the accuracy and reliability of predictions. The Hybrid MLP-LSTM model provides the greatest performance among the three evaluated models. It proves to be a potent and efficient method for predicting the target variable in the provided dataset, with enhanced accuracy and the ability to capture significant temporal patterns.

For multiple factors, the Hybrid MLP-LSTM model outperformed the other models in terms of predictive performance. First, the MLP component of the model is well suited for capturing non-linear relationships and learning complex feature representations due to its dense layers. It can effectively extract pertinent features from input data, enabling the model to capture intricate patterns and correlations within the dataset. The model’s capacity to capture temporal patterns and long-term dependencies is enhanced by the LSTM component’s ability to model sequential dependencies. LSTM units are equipped with a memory mechanism that enables them to retain and effectively utilize past information. This is especially crucial for time series data, in which the order and sequence of observations play a crucial role. The Hybrid MLP-LSTM model leverages the MLP’s capability for non-linear feature extraction and the LSTM's ability to capture temporal dependencies by combining the assets of the MLP and LSTM architectures. This combination of advantages enables the model to effectively learn from both the data’s local patterns and temporal dynamics. The superior performance of the Hybrid MLP-LSTM model is a result of its enhanced capacity to capture intricate relationships, both within individual observations (local patterns) and across multiple observations over time (temporal patterns). This adaptability enables the model to make more accurate predictions by taking into account both the immediate and historical contexts of the data.

Overall, the ability of the Hybrid MLP-LSTM model to combine non-linear feature extraction with sequential modeling provides a comprehensive and effective method for capturing the intricate patterns and dependencies present in the dataset, resulting in its superior performance compared to the other models. Table 12 shows the summarized description of the applied model’s performance.

Table 12 Summarized description of applied model’s performance

7 Performance Comparison with Existing Works

The MLP-LSTM model stands out in comparison to other models due to its unique combination of multi-layer perceptron (MLP) and long short-term memory (LSTM) neural network architectures. Unlike traditional models, the MLP-LSTM model excels in capturing both nonlinear and temporal dependencies in the data, allowing it to effectively model complex spatiotemporal relationships. This makes it well-suited for forecasting tasks related to air quality.

Compared to the hybrid methods used by Asghari et al. [18] and Zhao et al. [19], the MLP-LSTM model offers a simpler and more streamlined approach by integrating MLP and LSTM architectures within a single model. This avoids the need for separate hybridization techniques such as Back Propagation with Genetic Algorithm (BP-GA) or combining multiple models like CNN and LSTM. Furthermore, the MLP-LSTM model demonstrates superior performance in comparison to other machine learning techniques, as seen in the study conducted by Pang et al. [20]. It effectively captures the temporal relationships in the data, resulting in lower root mean square error (RMSE) and mean absolute error (MAE) values for PM2.5 concentration forecasting. When compared to models like linear regression, LSTM-FCN, and LightGBM, as evaluated by Zhang et al. [21], the MLP-LSTM model showcases improved accuracy, as evidenced by reduced RMSE, MAE, and improved R2 values. It achieves high forecasting accuracy with an advantage in interpretability and the ability to handle complex temporal patterns.

Traditional models often rely on simplistic linear or statistical methods that struggle to capture complex nonlinear patterns and interactions present in real-world data. They require manual feature engineering, which can be time-consuming and subjective, potentially missing important features. These models may also struggle to handle temporal dependencies and capture long-term patterns effectively, lacking the ability to retain and utilize historical information. Additionally, they face challenges in integrating diverse data types, handling uncertainties, and scaling to large datasets. Moreover, computational constraints can hinder their performance, especially when dealing with complex techniques or real-time forecasting. These limitations collectively contribute to their inability to achieve the required level of performance in forecasting tasks that demand accuracy and adaptability in dynamic and complex environments.

While each model has its strengths and limitations, the MLP-LSTM model’s ability to effectively capture nonlinear and temporal dependencies, its simplicity compared to hybrid methods, and its superior performance compared to other machine learning techniques make it a promising choice for accurate and reliable forecasting in environmental and related domains. Table 13 shows the summarized performance comparison of existing work.

Table 13 Summarized performance comparison with existing works

8 Conclusion

In this study, a hybrid MLP-LSTM model was constructed to create forecasting predictions. This model was used to predict the forecasting. The research included a study of the Beijing Multi-Site Air-Quality Data Set, which was carried out as a part of the investigation. Although the dataset contains multi-site information the target value defines whether the air quality is safe, moderate, or unhealthy. Therefore, the model can target multi-site according to the air quality. According to what the model found to be appropriate for this specific set of circumstances, the MSE value was 0.00016, the MAE value was 0.00746, the RMSE value was 13.45, the MAPE value was 0.42, and the R2 value was 0.95. This demonstrates that the model is functioning as expected, which can be seen from the above. On the other hand, the traditional modeling approaches for forecasting fall short of achieving the requisite level of performance. In this context, future research will focus on the selection of features and the creation of acceptable strategies for altering hyper-parameters to achieve the desired results. Moreover, automatically reshaped, added, and removed various layers to bring about an evolution in the neural network architecture. In addition, the MLP-LSTM model will be implemented on some more air and water quality datasets. Therefore, this model could be compatible with any kind of forecasting prediction.