1 Introduction

Gathering and analyzing data on the circumstances in an urban region or neighborhood allows for what is called “urban profiling”. Using a mix of quantitative data, visualizations, and maps, urban profiling may better represent the requirements, weaknesses, and capabilities of urban systems and the people who live in them. By presenting information in a geographical context, policymakers may better comprehend interdependencies between different sectors, pinpoint regions of a city that are underserviced or vulnerable and describe possibilities for a positive transformation of metropolitan areas for the benefit of all residents. To guarantee that data collecting results in transformational change and develops upon a shared vision for sustainable urban development, urban profiling is often paired with activities aimed at bolstering the skillsets of local actors (Loose and S. M. 2020). One of the most important measures of macro-economy, Gross Domestic Product (GDP) is crucial for gauging a country’s or region’s economic progress. Future macroeconomic goals and economic regulatory strategies will be heavily impacted by this indictor (Götz & Knetsch 2019). The per-person economic situation might be gauged by looking at GDP (Sa’adah & Wibowo, 2020). As the economy develops rapidly, GDP forecasting has become a major area of study. Forecasting increase in the nation’s GDP requires very involved models. These sometimes include dozens or even hundreds of variables, each of which has at least one parameter that must be fitted to data or allocated based on assumptions. Financial indicators (interest rates, fiscal policies, public debt), socio-economic indicators (employment, labor productivity, population age, schooling, and so on), and global trade indicators (trade openness, raw material prices, exchange rates) are only a few examples of these variables. Complex models with a large number of variables provide the impression that all the factors influencing economic development have been accounted for, promising accurate predictions (Tacchella et al. 2018).

GDP forecasting has traditionally depended heavily on statistical analysis derived from economic censuses. These techniques include trend extrapolation, data regression, and time series approaches (De Cola & Mongelli 2018). Extrapolation techniques based on “empirical analysis” are usually time- and labor-consuming, and the findings of economic censuses lag behind substantially. In addition to being influenced by a wide range of exogenous variables, such as the level of economic development, the focus of public policy, the local climate, and the average income of its citizens, GDP also displays some interesting intrinsic properties, such as volatility, feedback, and periodicity. Due to these obstacles, conventional techniques of GDP forecasting have encounter difficult challenges. In the age of big data, an alternate approach is to include data-driven models and AI algorithms into GDP forecasting. The data-driven model naturally prompts us to mine and choose the relevant influencing variables to effectively construct a GDP forecasting model (Wu et al. 2021). Numerous artificial intelligence techniques have been used to predict the most significant economic indicators, including GDP, Consumer Price Indices (CPI), unemployment rate, energy consumption, exports, and interest rate. These techniques include Artificial Neural Networks (ANN), Genetic Programming (GP), Support Vector Regression (SVR), Adaptive Neuro-Fuzzy Inference System (ANFIS), and other approaches (Cicceri et al. 2020). This paper introduces the Urban Profiling Areas which are typically refer to specific regions or neighborhoods within urban areas that are analyzed and characterized based on various socio-economic, demographic, and infrastructural factors. We recommend the urban profiling areas because of the heterogeneity within urban regions can provide valuable insights into GDP prediction. By examining different urban profiling areas, we aim to capture the variations in economic activities, development patterns, and other relevant factors that can influence GDP. The dataset utilized in this manuscript consists of different features which are the inputs refer the urban profiling areas for different countries and other relevant features, and the target feature of the dataset is the GDP. Therefore, we exclude the township GDP in the training dataset. The relationships between world countries and the concept of urban profiling areas are explored through an extensive dataset encompassing diverse aspects such as population, region, area size, infant mortality, and more. The dataset, spanning the years 1970 to 2017 and derived from US government sources and Kaggle, comprises 227 instances with 20 features. The study aims to forecast GDP in these urban profiling areas, with 70% of the dataset utilized for training and the remaining 30% reserved for testing. The research methodology involves leveraging this comprehensive dataset to analyze and understand the factors influencing GDP in urban profiling areas. The paper evaluates the performance of the forecasting model in these areas, shedding light on the intricate relationships between the specified features and economic outcomes in diverse regions. The research gap in this paper revolves around the absence of dedicated models for predicting GDP in urban profiling areas. While existing GDP prediction models often generalize outcomes for entire urban regions or countries, the unique characteristics and heterogeneity within urban areas have not been thoroughly addressed. The need for a model tailored to urban profiling areas arises from the recognition that these areas, defined by specific socio-economic, demographic, and infrastructural factors, can offer valuable insights that more generic models might overlook. The existing models may not fully capture the nuanced variations in economic activities, development patterns, and other influential factors within urban settings. Therefore, the research gap lies in the absence of models explicitly designed for the dynamic and diverse landscape of urban profiling areas. The proposed hybrid model, PC-LSTM-RNN, is specifically designed to address the nuances of urban profiling areas. Unlike generic models, it considers the heterogeneity within urban regions, providing a more accurate and granular prediction of GDP. Furthermore, The model employs a sophisticated feature selection process based on Pearson correlation analysis. This ensures that only the most relevant features, specific to the urban profiling areas, are considered for predicting GDP. By focusing on key attributes, the model enhances its predictive accuracy in the context of these unique regions. The paper goes beyond traditional evaluation metrics and employs a comprehensive set, including Mean Squared Error (MSE), Mean Absolute Error (MAE), Median Absolute Error (MedAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and determination coefficient (R2). This allows for a thorough assessment of the proposed model’s performance, providing a nuanced understanding of its predictive capabilities in the context of urban profiling areas. By addressing the specific characteristics of urban profiling areas and introducing a model tailored to these regions, the proposed approach aims to bridge the existing research gap and contribute significantly to the field of GDP prediction. In this paper, the transfer learning technique employed is the “parameter transfer approach” (Wang et al. 2022). This approach involves the utilization of parameters acquired during the pre-training of a model on a source domain (Model A on dataset A) and applying them to initialize or fine-tune a model for a different target domain (Sourav Mishra et al. 2020), specifically for GDP prediction on dataset B. The process includes the training of Model A on dataset A, capturing relevant information pertaining to the source domain, which may include economic and regional features. Subsequently, the learned parameters, encompassing weights and biases, are extracted from Model A after pre-training and are then used to initialize a new model (target model) for the task of predicting GDP on dataset B. The methodology takes a multi-faceted approach, extracting pertinent features from the source domain and integrating them with temporal data from the target domain during training.

Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are employed for time-series analysis, combining features from both datasets. In this paper, the contributions are given as follows:

  • Proposing a hybrid model based on Pearson Correlation-Long Short-Term Memory-Recurrent Neural Network (PC-LSTM-RNN) for predicting GDP in urban profiling areas.

  • Selecting the most relevant features to the target classes-based Pearson correlation by analyzing 227 instances and 20 features from standard Kaggle World Countries with urban profiling areas dataset.

  • Utilizing parameter transfer approach to improves GDP prediction on dataset B by fine tuning the parameter and combining the features from both source and target domains.

  • Evaluate the proposed model against the state of art using several predictive evaluation metrics including MSE, MAE, MedAE, RMSE, MAPE, and determination coefficient (R2).

The reminder of paper is organized as follows. Section 2 presents some related works of GDP prediction techniques in different cities. Material and methods are introduced in Sect. 3 to describe the proposed framework in this paper. Section 4 displays the results and discussion. Section 5 shows the conclusion of this study.

2 Related work

In the realm of GDP prediction through machine learning, a diverse array of studies has been conducted, each contributing unique insights and methodologies. (Velidi 2022) delves into the global GDP, exploring its yearly growth rate and emphasizing the importance of understanding the current status and significance of factors influencing a country’s GDP due to its inherent volatility over time. (Sa’adah & Wibowo 2020) propose deep learning models, LSTM and RNN, demonstrating their efficacy in handling Indonesia’s GDP issues, achieving accuracy ranging from 80 to 90%. The study envisions the models’ potential in anticipating changes, even amid unforeseen events such as the Covid-19 tragedy.

Jaehyun Yoon (Yoon 2021) focuses on forecasting Japan’s real GDP growth using random forest and gradient boosting techniques, showcasing the models’ performance through evaluation metrics like MAPE and RMSE. Jovic et al. (Jovic et al. 2019) approach GDP estimation from a different angle, considering currency exchange rates. Their study employs the ANFIS model, demonstrating its effectiveness in predicting GDP using exchange rate variables.

Qingwen Li et al. (Li et al. 2022) present a three-stage multi-factor feature selection and deep learning framework for forecasting regional GDP. Their approach involves Temporal Convolutional Neural Network (TCN), Feature Crossing Algorithm (FCA), and Boruta RF, leading to improved prediction accuracy. Laygo-Matsumoto and Samonte (Laygo-Matsumoto & Samonte 2021) discuss the application of machine learning algorithms with macroeconomic data to forecast Philippine GDP, with the Gradient Boosting model yielding the most favorable results.

A. Richardson & Mulder (Richardson & Mulder 2018) investigate various ML systems’ capability to predict New Zealand’s quarterly GDP growth. Their study reveals that Neural Networks (NNs), support vector machine regression, and boosted trees outperform the naive autoregressive benchmark. Maccarrone et al. (Maccarrone et al. 2021) test the accuracy of different models in projecting future U.S. real GDP, highlighting the self-predictive potential of the KNN technique. Muchisha et al. (Muchisha et al. 2021) construct and compare ML approaches for real-time GDP growth forecasting in Indonesia, with Random Forest emerging as the top-performing model. Moving beyond GDP prediction, several studies address time series forecasting using various machine learning techniques. Ortega-Bastida et al. (Ortega-Bastida et al. 2020) propose a method combining Auto Encoder (AE) with Non Causal (NC) Filtered and 𝜺-Support Vector Regression (SVR) for time series forecasting in Spain. Lai (Lai 2022) introduces a time series forecasting approach using Particle Swarm Optimization (PSO) and Elman Neural Network (Elman NN) in Sichuan, achieving low Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) values. Abonazel and Abd-Elftah (Abonazel & Abd-Elftah 2019) utilize Autoregressive-Integrated Moving-Average (ARIMA) for time series forecasting in Egypt, achieving low Mean Squared Error (MSE). Jönsson (Jönsson 2020) employs the K-Nearest Neighbor (KNN) algorithm for time series forecasting in Sweden, obtaining low MAE and MSE values. Qureshi et al. (Qureshi et al. 2020) propose Extreme Gradient Boosting (XGBoost) for time series forecasting in Canada, reporting low RMSE, MAE, and MSE values. Hossain et al. (Hossain et al. 2021) deploy the Random Forest Regressor for time series forecasting in Bangladesh, achieving low MSE, MAE, and RMSE values. Finally, Cicceri et al. (Cicceri et al. 2020) introduce the Nonlinear Autoregressive with Exogenous Variables (NARX) model for time series forecasting in Italy, demonstrating low MSE and high accuracy. Table 1 illustrates some previous of machine learning and deep learning techniques applied for GDP prediction.

Table 1 The most recent approaches for predicting the GDP in some countries

3 Materials and methods

In this section, we will discuss in detail the proposed Pearson Correlation (PC) based the sequential hybridization between LSTM and RNN. First, the data should be preprocessed, therefore, we performed median imputation process for the applied data as well as data normalization (Tarek et al. 2023). Also, for improving the quality of raw data, determination of missing data, imputation and outlier elimination were used. Preprocessing for missing data, outlier detection, data reduction, data conversion, data scaling, and data segmentation are discussed in terms of their applications (Fan et al. 2021). For preparing the applied GDP dataset, this paper employs the mean imputation and data normalization techniques as follows.

3.1 Mean imputation

Initially, the missing values are handled using two ways (Donders et al. 2006): (1) Applying the missing value imputation, and (2) Discarding the data samples with the missing values. According to imputation, the univariate depends on mean/median imputation, forward/backward imputation and moving average imputation (Van Buuren et al. 2006). On the other hand, the multivariate imputation depends on KNN-based method, and the regression- based method as shown in Fig. 1. This paper uses median univariate for missing values imputation.

Fig. 1
figure 1

General block diagram of imputation missing values

3.2 Data normalization

Normalization is an innovative data process that rescales or transforms data such that each feature contributes consistently. It addresses the two most major data concerns that have stymied the advancement of machine learning approaches: notably the presence of dominating features and outliers (Tarek et al. 2023). Numerous strategies for data normalization within a specified range have been investigated utilizing statistical measures generated from raw data (Singh & Singh 2020). These methods of normalizing include standard deviation, minimum, maximum, scaling, Sigmoid, and Tanh based normalization. To stabilize the learning strategy, the input parameters of the proposed method is normalized using the min-max approach to normalize the findings to a range of (0, 1) values (Patro & Sahu 2015). The min-max normalization is computed as:

$$z_n=\frac{f_n-f_{min}}{f_{max}=f_{min}}$$
(1)

where zn is the normalized n enrolled features, and x is the input feature value. \({f}_{min}\) and \({f}_{max}\) are the input feature lowest and upper sets, respectively.

3.3 Pearson correlation

Pearson correlation coefficient is a commonly used indicator to measure the correlation between two data sets and can reflect the degree of correlation between them (Benesty et al. 2008). The Pearson correlation coefficient of two random variables \(X=\left\{x_1,X_2\cdots,x_n\right\}\) and \(Y=\left\{y_1y_2\cdots,y_n\right\}\) (n is the dimension of X and Y), is computed as:

$$r=\frac{\sum_{i=1}^n\left(x_i-\overline x\right)\left(y_i-y\right)}{\sqrt{\sum_{i=1}^n\left(x_i-\overline x\right)^2\sum_{i=1}^n}\left(y_i-\overline y\right)^2}$$
(2)

The correlation is strongest when the value of coefficient is equal to 1 or − 1. Specifically, it is judged by the following rules: when the coefficient is within the range of 0.8 to 1, it is strongly correlated; 0.6 to 0.8 is highly correlated; 0.4 to 0.6 is moderately correlated; 0.2 to 0.4 is weakly correlated; 0.0 to 0.2 is lowly correlated or uncorrelated (Adler & Parmryd 2010; Wu et al. 2021).

3.4 Long short term-memory (LSTM)

LSTM is an RNN proposed to naturally include time dependency into NN topologies. The basic concept is to extend the capabilities of traditional ANN approaches by dynamically introducing a sequential framework. LSTM has been found to have outstanding success owing to its capacity to learn both short and long-term dependencies of the issue and is also built to cope with the vanishing gradient problem, which most RNN designs suffer from. The primary information processing units of LSTM are referred to as “cells.” In ordinary MLP, these cells are more advanced neurons. A cell has many gates that keep and control the flow of information for an indeterminate time period. This capability allows LSTM to determine whether information is important in both the long and short terms (Elshewey et al. 2023). As a result, it is ideal for any form of sequential issue. Equations (3, 4, 5, 6, 7, 8) (Arpit et al. 2019; Assaad & Fayek 2021) can be used to define an LSTM cell.

$${inp}_{t}= Sigmoid ({W}_{i }\left[{hid}_{t-1};{is}_{t}\right]+ {b}_{inp})$$
(3)
$${forg}_{t}= Sigmoid({W}_{f }\left[{hid}_{t-1};{is}_{t}\right]+ {b}_{forg})$$
(4)
$${cell}_{t} =\text{t}\text{a}\text{n}\text{h}({W}_{s }\left[{hid}_{t-1};{is}_{t}\right]+ {b}_{cell})$$
(5)
$${out}_{t}= \sigma ({W}_{o }\left[{hid}_{t-1};{is}_{t}\right]+ {b}_{out})$$
(6)
$${state}_{t}= {forg}_{t}\odot {state}_{t-1}+ {inp}_{t} \odot {cell}_{t}$$
(7)
$${hidden}_{t}= {out}_{t}\odot \text{t}\text{a}\text{n}\text{h}({s}_{t})$$
(8)

where \({inp}_{t}\), \({{forg}_{t} , cell}_{t}\) and \({out}_{t}\) are the input, while forget, cell, the output gates in the current state, respectively. Also, in the above equations: “is” and “hid” represent the state input and the state hidden values, respectively, t represents the current time step, \(\odot\) is an element-by-element multiplication function which called Hadamard Product, \(Sigmoid\) is an activation operation, w and b represent the weights and bias between learned gates of LSTM, respectively. Also, the cell state is “\({state}_{t}\)”, and the hidden state is “hiddent”. The cell memory is controlled by the forget gate “\({forget}_{t}\)”. The hidden state \({hidden}_{t}\) is obtained from \({state}_{t}\) using an output gate. The structure of a standard LSTM neural network is shown in Fig. 2 (Arpit et al. 2019; Nosair et al. 2022).

Fig. 2
figure 2

The general structure of the LSTM architecture

3.5 The proposed PC-LSTM-RNN

RNNs are particularly well-suited to problems involving sequence analysis due to its recurrent connections. Furthermore, RNNs are designed to make an advantage of sequential information in a typical NN such that all inputs and outputs are assumed to be independent of one another (Zaremba et al. 2014). However, this paper aims to predict the upcoming sequences of the applied data by discerning the preceding statistics. RNNs are utilized iteratively for each element in a sequence, with the outcome being dependent on previous calculations (Mikolov et al. 2010). The memory stores information about previous calculations and RNNs uses this information in infinitely long sequences in a set of number of steps. This work uses LSTM with one layer of RNN since it is more computationally efficient (Sherstinsky 2020). LSTM was developed to address a difficulty that regular RNNs face, namely the vanishing gradient problem (Yu et al. 2019). We used a sequential input strategy to feed the model characteristics in a predetermined order. By building on the strengths of RNNs and LSTMs, the model was able to successfully capture temporal dependencies via this sequential presentation. By combining the memory retention techniques of RNNs and LSTMs with this sequential input strategy, the model is able to better capture the dependencies in the data and produce more informed predictions. Therefore, in the proposed model we enter all input in order and the RNNs and LSTMs excel at analyzing the provided country data by leveraging its inherent order. As data enters sequentially, one country at a time, they can capture crucial dependencies across features, even between non-adjacent entries. Imagine analyzing USA after Canada; the model remembers Canada’s features (smaller population, similar GDP, cold climate) and can learn that similar GDPs don’t guarantee identical population sizes or climates, even for geographically distant countries. This sequential processing empowers the model to learn patterns across countries with urban profiling areas, like France (population 65 million) following Germany (population 83 million) in Europe. Ultimately, RNNs and LSTMs excel at uncovering both short- and long-term dependencies within and between features, allowing them to generalize these patterns to new, unseen countries. The general structure of the proposed PC-LSTM-RNN model is shown in Fig. 3. The steps of the proposed PC-LSTM-RNN model is presented in Algorithm 1. The hyperparameter values of the proposed trained model are shown in Table 2. We utilize 3 layers, which are Forget gate, Input gate, and Output gate, with 128 neurons in each layer. The batch size is 64, the learning rate is 0.001, the number of epochs is 50, the optimizer is Adam, the time steps are 16, and the activation function is the linear activation function. he layers in the LSTM memory block are fully connected.

Fig. 3
figure 3

The general structure of the proposed PC-LSTM-RNN

Table 2 Hyperparameter values of the proposed PC-LSTM-RNN model
figure a

Algorithm 1 The proposed PC-LSTM-RNN model

The “parameter transfer approach” (Olivas 2009; Wang et al. 2022) entails leveraging the parameters acquired from pre-training a model on a source domain, specifically Model A on dataset A, and applying them to initialize or fine-tune a model for GDP prediction on dataset B. To implement this approach, Model A is first trained on dataset A, capturing information relevant to the source domain, including economic and regional features. The learned parameters, including weights and biases, are then extracted from Model A after pre-training and utilized to initialize a new model (target model) for predicting GDP on dataset B. This process involves incorporating country-specific features from dataset B, integrating them with the existing architecture of the target model initialized with parameters from Model A. The methodology adopts a multi-faceted approach, extracting pertinent features from the source domain and integrating them with temporal data from the target domain during training. In this paper, Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are employed for time-series analysis, combining features from both datasets. Therefore, the “parameter transfer approach” optimizes model performance for GDP prediction on dataset B by integrating knowledge from both the source and target domains, incorporating country-specific features, utilizing advanced techniques, implementing regularization, and employing ensemble methods. This comprehensive process aims to leverage the strengths of both domains for improved accuracy in GDP forecasts.

3.6 Evaluation metrics

To demonstrate the performance of the proposed model (PC-LSTM-RNN), five error analysis criteria are introduced; Mean Squared Error (MSE), Median Absolute Error (MedAE), Mean Absolute Error (MAE), Root-Mean-Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). Furthermore, we determine the coefficient of determination (R2) as shown in Table 3 using Equations (9, 10, 11, 12, 13, 14)  (Hassan et al. 2022; Shams et al. 2017).

Table 3 Evaluation indictors used in this paper

4 Experiments and results

4.1 Dataset and statistical analysis

The data source for this paper comprises datasets (dataset A, and dataset B). Dataset A containing information on various aspects, including population, region, area size, infant mortality, and more. This dataset is sourced from the US government from 1970 to 2017. This study utilized data obtained from the Kaggle website, which included 227 instances and 20 features. 70% of the dataset was used for training, while the remaining 30% was reserved for testing. The objective of this research was to forecast the GDP in urban profiling areas for each countries. The key characteristics of the dataset are provided in Table 4. For access to the data, we refer to the following link: “https://www.kaggle.com/datasets/fernandol/countries-of-the-world”. While dataset 2 is presented in Sect. 4.2.2.

Table 4 Description of the used dataset features

As indicated in Table 4, the country variables offer valuable insights into connections with “urban profiling areas.” Direct connections established through variables such as population density, coastline, net migration, GDP, and economic sectors, impacting the size, characteristics, and prosperity of urban areas. Additionally, indirect connections involve factors like climate, infant mortality, literacy, and birth and death rates, exerting indirect influences on the dynamics of urban development.

The heatmap serves as a visual representation to elucidate the correlation and facilitate data visualization across all features. This includes a comprehensive examination of both input features and the target variable (GDP), allowing for a detailed analysis of the relationships and patterns within the dataset. The varying intensities of color in the heatmap provide insights into the strength and direction of correlations, contributing to a more nuanced understanding of the interdependencies among the different elements in the dataset. Figure 4 shows the heatmap analysis for the dataset features. The heat map investigated the correlation between the features and indicated that the diagonal is highly corelated (1) with a green color and (-1) indicated the smallest values with a red color. Table 5 illustrates the most common statistical analysis for the features of this dataset includes the mean, standard deviation (std), maximum (max), minimum (min), count, and % values of the input features. A violin plot is a plot used to visualize the distribution of a statistical variable. This is similar to a box plot, but instead of just showing aggregate statistics (minimum, 1st quartile, median, 3rd quartile, and maximum) it also shows data that is the density at various values data points indicate. Figure 5 uses violin plots to visualize the target distribution. This plot can help identify the distribution of the target variable and reveal any outliers or unusual patterns in the data. A boxplot is a standard visualization tool that shows how a data set is divided by its quartiles. It is a quick and easy way to summarize the central tendency, spread, and skewness of a data set. Boxplots can also be used to identify potential outliers in a data set. In Fig. 6, a box plot is used to examine the overall distribution of features in the dataset. This plot can help identify objects with skewed distributions, identify potential outliers, and understand the overall distribution of the dataset.

Fig. 4
figure 4

Heatmap analysis for the features

Table 5 Statistical analysis for the features
Fig. 5
figure 5

Violin plot for the target feature

Fig. 6
figure 6

Box plot for distribution analysis of the features

Table 6 illustrates the correlation values of the features that are correlated with the target feature. Figure 7 shows the bar plot for 40 countries with highest GDP per capita. Figure 8 is a histogram that displays the distribution of the features. This histogram is commonly used for distribution analysis of data features because they provide a visual representation of the underlying data distribution, this allows to see the overall shape and characteristics of the distribution of a particular feature. This is crucial for understanding the central tendency and variability of the data. Figure 9 is a pair plot that shows the distribution of selected features using Pearson correlation. A pair plot, also known as a scatterplot matrix, is a grid of scatterplots where each variable is plotted against every other variable in the dataset. It’s a useful tool for visualizing relationships between multiple variables and examining the distribution of each variable along the diagonal. These visualizations are useful for identifying the shape of the distribution, the central tendency of the data, any potential outliers, and patterns or correlations between features.

Table 6 The correlation values of the features that are correlated with the target feature
Fig. 7
figure 7

Bar plot for 40 countries with highest GDP per capita

Fig. 8
figure 8

Histogram for distribution analysis of the features

Fig. 9
figure 9

Pair plot for distribution analysis of the selected features using Pearson correlation

4.2 Results and discussion

The experiment results were developed and executed using Python 3.8 on Jupiter notebook version (6.4.6) with Intel Core i5 and 16 GB RAM using Microsoft Windows 10 × 64-bit. The selected features by PC are Phones (per 1000), Birthrate, Infant mortality (per 1000 births), Agriculture, Service, and Literacy (%). In the proposed model, we utilized hidden units of 128 and batch size is 64 and learning rate is 0.001 and no. of epochs is 50 and optimizer is Adam and time steps are 16, and activation function is linear activation function. In order to evaluate the performance of the proposed model (PC-LSTM-RNN) for predicting GDP in urban profiling areas, six classifiers’ models (Random Forest (RF) Regressor, KNN Regressor, Gradient Boosting (GB) Regressor, Lasso Regressor, Elastic Net (EN) Regressor, and Ridge Regression (RR) are used for the comparison using the same dataset. These techniques are applied with Pearson Correlation (PC) and results are compared with the proposed PC-LSTM-RNN for predicting GDP in urban profiling areas. The performance of these classification models was evaluated using MSE, MAE, RMSE, MAPE, MedAE, and R2 metrics.

4.2.1 Dataset A

For the dataset A, Table 7 presents the performance measures of these recommended classifiers. Figure 10 investigates the relationship between the actual GDP and the predicted GDP for PC-RR, PC-EN, PC-Lasso, PC-GB, PC-KNN, and PC-RF models, respectively. Figure 11 shows the performance of the proposed model (PC-LSTM-RNN) using the Actual GDP and the predicted one. As shown in Fig. 11, this mapping is fitted to a line which indicates the accurate prediction of the GDP values. Finally, Fig. 12 presents MSE and MAE vs. number of epoch using PC-LSTM-RNN.

Fig. 10
figure 10

The relation between the actual GDP vs. predicted GDP for (a) PC-RR, (b) PC-EN, (c) PC-Lasso, (d) PC-GB, (e) PC-KNN, (f) PC-RF performance

Fig. 11
figure 11

Actual GDP vs. predicted GDP for PC-LSTM-RNN performance

Fig. 12
figure 12

The training and testing (a) Mean squared error and (b) mean absolute error vs number of epoch using PC-LSTM-RNN model

According to Table 7, the proposed model (PC-LSTM-RNN) achieves the best values with 4.2 × 10–33, 3.9 × 10–17, 6.5 × 10–17, 4.3 × 10–15, 0.00001, and 99.99% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively. PC-RF results are the lowest values with 0.00018, 0.0079, 0.0135, 0.913, 0.0061, and 95.20% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively. For the PC-KNN model, the MSE, MAE, RMSE, MAPE, MedAE, and R2 are 0.00017, 0.0068, 0.0131, 0.727, 0.0057, and 95.40%, respectively. For PC-GB model, the MSE, MAE, RMSE, MAPE, MedAE, and R2 are 0.00011, 0.0063, 0.0106, 0.722, 0.0052, 97%. PC-Lasso model archives 3.5 × 10−05, 0.0041, 0.0059, 0.475, 0.0033, 98.90% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively. The MSE, MAE, RMSE, MAPE, MedAE, and R2 for PC-EN are 3.06 × 10−05, 0.0038, 0.0053, 0.448, 0.0031, and 99.10%, respectively. Finally, PC-RR gives 1.12 × 10−05, 0.0023, 0.003, 0.272, 0.0018, and 99.40% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively.

Table 7 Comparing of the proposed work with some other models

There are some current studies uses the same dataset utilized in this study which are as follows. A study presented by Khan (Khan 2021) that investigated the influence of GDP on other factors and assessed whether these factors reciprocally affect GDP. They employed visualization as a method for exploring the countries dataset. This approach served to not only showcase relationships within the dataset but also to elucidate and represent the data in an understandable manner. Visualization enables us to explain the data effectively and assess the impact of various factors on each other. Furthermore, a study presented by Padmawar et al. (Padmawar et al. 2021) that used ML models of LR and RF in forecasting macroeconomic data. Through an optimization process, the RF algorithm exhibited strong performance, achieving an 86% accuracy rate in predicting actual GDP per capita. The RF Classifier outperformed LR, providing more precise forecasts. Evaluation metrics such as MAE and RMSE were used to measure the performance such that they obtain 2356.153, and 3302.030 with R2 = 0.868. Gharte et al. 2022 (Gharte et al. 2022) introduced a model leveraging Logistic Regression (LR), Random Forest (RF), and Gradient Boosting, yielding accuracies of 82%, 87%, and 89%. The performance evaluation, employing Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), resulted in values of 2362.935 and 3469.360, respectively. Notably, the Gradient Boosting algorithm achieved an R2 value of 0.889 for predicting GDP per capita. This paper attains an R2 value of 0.9999, demonstrating exceptional performance. This high-quality model is then implemented in the “GDP Estimation Tool,” facilitating the estimation and forecasting of a country’s GDP based on inputting relevant attributes.

4.2.2 Dataset B

We applied another dataset (dataset B) available at https://www.kaggle.com/datasets/imbikramsaha/indian-gdp/data to ensure that there is no overfitting. This dataset consists of 61 instances and 4 features. In this study, we employed (model A) using the dataset (dataset A) as detailed in Sect. 4.1, titled “Dataset and Statistical Analysis.” Subsequently, we applied this model to forecast the GDP for an alternate dataset, (dataset B), encompassing historical GDP growth data for India spanning from 1961 to 2021. This dataset includes GDP in billion USD, per capita income in USD, and year-on-year GDP growth.

Our approach involves leveraging the country features inherent to Model A and the Indian GDP data specific. This strategy adopts a multi-faceted methodology: extracting pertinent economic and regional features from dataset A, integrating them with the temporal data from dataset B, and employing transfer learning using parameter transfer approach with additional layers tailored for GDP prediction. To conduct time-series analysis, we trained RNNs and LSTMs on the amalgamated dataset. To address overfitting, we implemented regularization techniques and identified the optimal model through cross-validation. Finally, we employed ensemble methods to enhance prediction accuracy. To evaluate the performance of the proposed model PC-LSTM-RNN for predicting GDP, six classifiers’ models, namely, RF regressor, KNN regressor, GB regressor, lasso regressor, EN regressor, and RR regressor are used for the comparison. These models are applied with Pearson Correlation (PC) and results are compared with the proposed PC-LSTM-RNN for predicting GDP. The performance of these classification models was evaluated using MSE, MAE, RMSE, MAPE, MedAE, and R2 metrics. Table 7 presents the performance measures of these recommended models. According to Table 8, the proposed model (PC-LSTM-RNN) achieves the best values with 0.0017, 0.0028, 0.0047, 0.0031, 0.0021, and 99.18% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively. PC-RF results are the lowest values with 0.0049, 0.0057, 0.0073, 0.0071, 0.0058, and 96.16% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively. For the PC-KNN model, the MSE, MAE, RMSE, MAPE, MedAE, and R2 are 0.0043, 0.0051, 0.0068, 0.0065, 0.0049, and 97.53%, respectively. For PC-GB model, the MSE, MAE, RMSE, MAPE, MedAE, and R2 are 0.0038, 0.0047, 0.0062, 0.0060, 0.0043, 97.85%. PC-Lasso model archives 0.0032, 0.0043, 0.0057, 0.0055, 0.0039, 98.14% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively. The MSE, MAE, RMSE, MAPE, MedAE, and R2 for PC-EN are 0.0027, 0.0037, 0.0051, 0.0050, 0.0034, and 98.57%, respectively. Finally, PC-RR gives 0.0024, 0.0031, 0.0049, 0.0047, 0.0029, and 98.85% for MSE, MAE, RMSE, MAPE, MedAE, and R2, respectively.

Table 8 Comparing of the proposed PC-LSTM-RNN model with some other models using another dataset

Regarding the concern about the R2 value of 99.18%, it is indeed exceptionally high. Such a result often hints at the possibility of overfitting, particularly in scenarios where there exists a domain shift between the datasets used. Our primary concern lies in ensuring the model not only fits the training data effectively but also generalizes well to new, unseen data. To tackle this, we meticulously designed our approach to accommodate the potential differences between datasets. Specifically, we leveraged dataset B to focus on predicting Indian GDP, incorporating the nuanced features of model A to enhance the accuracy of our predictions as in Algorithm 1. In this work, we’ve introduced Figs. 13 and 14 to bolster our argument that overfitting isn’t a significant concern. These visual representations serve as evidence supporting the model’s ability to perform consistently across different datasets and affirm our efforts to mitigate potential overfitting issues. These figures illustrate the model’s performance metrics and its generalization capacity, thereby adding robustness to our findings.

Fig. 13
figure 13

The training and testing (a) Mean squared error and (b) mean absolute error vs. number of epoch using PC-LSTM-RNN model

Fig. 14
figure 14

Actual GDP vs. predicted GDP for PC-LSTM-RNN performance

Figure 13 presents MSE and MAE vs. number of epoch using PC-LSTM-RNN. Finally, Fig. 14 shows the performance of the proposed model (PC-LSTM-RNN) using the Actual GDP and the predicted one. As shown in Fig. 14, this mapping is fitted to a line which indicates the accurate prediction of the GDP values.

In addressing the absence of specific variables in the new dataset (dataset B) during the validation experiments, a thoughtful and strategic approach was taken. Notably, variables such as Phones (per 1000), Birthrate, Infant mortality (per 1000 births), Agriculture, Service, and Literacy (%) were missing. To overcome these gaps, feature imputation techniques were employed, wherein missing values were estimated or filled based on the available dataset information. Various imputation methods, including mean imputation, regression imputation, or more sophisticated machine learning-based techniques, were utilized. Moreover, a conscious decision was made to selectively use available features during validation, ensuring consistency between the original training dataset and the new dataset. Adaptations to the model architecture or training process were considered due to the absence of specific variables, aiming to optimize the model’s performance with the available features. Thorough performance evaluation encompassed metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared, assessing not only predictive accuracy but also the model’s generalization across datasets with differing feature sets. Cross-validation techniques were applied to enhance robustness, involving the splitting of data into multiple folds for training and validation across various subsets. In summary, the validation approach in the context of missing variables incorporated feature imputation, selective feature use, model adaptation, and comprehensive performance evaluation to address challenges arising from differences in variable availability between the original and new datasets, ensuring the model’s effectiveness and generalization capacity.

5 Conclusion and future work

GDP has a key role in assessing the health of the local, regional, and global economies in urban profiling regions. This study suggests a new model for estimating GDP in urban profiling regions called Pearson Correlation-Long Short-Term Memory-Recurrent Neural Network (PC-LSTM-RNN). The LSTM is utilized to handle the problem of vanishing gradient face by the convolutional RNN. The experimental results were performed to evaluate the suggested approach’s performance in which their results were compare with RF, KNN, GB, Lasso, EN, and RR machine learning models to demonstrate its viability. To assess the success of the project, six assessment criteria are used. The experimental findings demonstrated that, compared to other regression models utilized in this paper, the proposed model (PC-LSTM-RNN) provides the best outcomes, and it has the best R2 values, coming in at 99.99%. The model leverages Pearson correlation to identify crucial features strongly correlated with the target GDP variable. The study utilizes two distinct datasets, namely Dataset A and Dataset B. Dataset A, comprising 227 instances and 20 features, is divided with 70% for training and 30% for testing. Dataset B, with 61 instances and 4 features, encompasses historical GDP growth data for India spanning from 1961 to 2021. To enhance the predictive performance of GDP, we employ a parameter transfer approach, fine-tuning the parameters learned from Dataset A on Dataset B. In the future directions, we plan to utilize fuzzy rules for the uncertainty variables as well as neutrosophic implementation to handle the limitations of the vanishing gradient and overfitting issues. To address the vanishing gradient and overfitting challenges in LSTM networks, a combination of techniques can be employed. Implementing gradient clipping with an appropriate threshold helps mitigate the vanishing gradient problem, ensuring stable weight updates during training. Introducing dropout between LSTM layers aids in preventing overfitting by randomly deactivating a fraction of input units during each training iteration. Additionally, utilizing batch normalization normalizes input and recurrent activations, contributing to more stable and accelerated training. Careful consideration of model complexity, regularization techniques, and early stopping can further enhance the network’s robustness, providing a well-balanced solution to the vanishing gradient and overfitting issues.