Introduction

During recent decades, energy, along with other production factors, has played a decisive role in the economic growth of countries, and its importance continues to increase (Golpîra et al., 2022). Regarding to the depletion of fossil fuels, it is becoming more vital to optimize energy consumption (Ahmadi, 2022a). The ever-increasing dependence on energy has caused the interaction of this sector with other economic sectors and has made the speed of economic growth and development depend on the level of energy consumption (Amini et al., 2022). Therefore, during recent decades, the economic growth of the world and the process of industrialization have increased demand and energy consumption. The design of buildings in residential and commercial centers should be such that their residents, in addition to the optimal use of energy, have the most comfort and safety (Li et al., 2013) (Karki et al., 2023). Nowadays, energy consumption has risen globally due to heating, ventilation, and conditioning (HVAC). These factors undoubtedly play a central part in increasing energy usage (Yu et al. 2010). In the world HVAC generally contributes to 50% of different sources of energy (Wu et al., 2012). It should be considered that heating and cooling loads contribute to 30–40% of building’s energy usage; therefore, it is vital to dwindle energy usage in buildings (Ilbeigi et al., 2023). It is predicted that energy consumption in the world will be increased by 53% from 505 quadrillion Btu in 2008 to 700 quadrillion Btu in 2035 (IEA 2011). Various factors play a central role in the energy consumption in buildings, such as architectural design, climate, and energy systems. Energy usage will be raised by 34% in the next 20 years, and 1.5% on average per year. In 2030, inhabitants will use 67% of total energy, while the non-domestic section will use only 33%. It is an important factor that must be considered is climate change and global warming. According to the Intergovernmental Panel on Climate Change (IPCC), from 1970 to 2004, greenhouse released emissions has been increased by almost 70% (Ashrafi et al., 2019). The building’s share is around 30% in emissions proportion. Over the mentioned period, Co2 release has risen annually by about 2.5% and 1.7% for commercial and residential buildings, respectively (Zhai & Helman, 2019). The international energy agency (IEA) declares that to retain global warming under 2 Celsius. It is imperative to decline Co2 emissions by 77% by 2050. Therefore, it is crucial to consume less energy if we want to cope with climate change (Peivandizadeh & Molavi, 2023).

According to the available statistics, residential buildings are the biggest energy consumers in the country. Numerous factors affecting the behavior of energy consumption in residential buildings have turned the problem of forecasting and auditing energy consumption into an important challenge in consumption optimization institutions (Khorasgani et al., 2023). Forecasting is one of the issues that is being considered by scientists and researchers due to its applications in the real world. From a scientific point of view, forecasting means reducing the error in the obtained result using the existing results.

In the last 2 decades, different predicting methods have been employed to anticipate energy usage in buildings. They can be categorized into three categories, which are engineering methods, statistical methods, and artificial intelligence methods). Some of the importance of these algorithms include linear regression (LR), logistic regression, artificial neural networks (ANN), and decision tree (ID3) (Wangdong et al., 2015) (Fallah et al., 2023).

In recent years, buildings with low energy consumption have received a lot of attention. Most of the research has focused on the architectural features of the building (construction techniques) as well as alternative sources of energy (Yang et al., 2005) (Tang & Al., 2012). Since the characteristics of energy consumption in the residential sector are complex and interrelated, conceptual models are needed to evaluate the technological and economic effects and the efficiency of renewable energy suitable for home use. The study by Li et al. investigated the use of support vector machine (SVM) for predicting hourly cooling load in a building. The SVM model was applied to an office building located in Guangzhou, China, and its performance was compared to that of a traditional back-propagation (BP) neural network model. The simulation results demonstrated that the SVM model outperformed the BP neural network model in terms of prediction accuracy and generalization ability. The study concluded that SVM is an effective method for predicting building cooling load (Li & Al., 2009).

Kang et al. (2022) propose a method for optimizing ice-based thermal energy storage (TES) in commercial buildings through an integrated load prediction and optimized control approach. The approach includes a cooling load prediction model with mid-day modification and a rule-based control strategy based on a time-of-use tariff. In a commercial complex in Beijing, the proposed approach achieves a mean absolute error of 389 kW and an energy cost-saving rate of 9.9%. The approach significantly enhances the cooling system’s efficiency and automation (Kang & Al., 2022).

Gao et al. (2020) propose CEEMDAN-SVR, an ensemble prediction model that integrates CEEMDAN and SVR to accurately forecast heat load for district heating systems. The model outperforms other modern algorithms and is suitable for heat load forecasting on multiple time scales. The study used heat load data from three buildings in Xingtai City during the 2019–2020 heating season (Gao & Al., 2020).

Moayedi, (2021a) objective is to identify the most efficient predictive model for the heating load (HL) approximation by combining a multi-layer perceptron network (MLP) with six hybrid algorithms. The BBO-MLP, ALO-MLP, and ES-MLP exhibit the best performance based on overall scores, which makes them a promising alternative to traditional HL analysis methods due to their high efficiency (Moayedi and Mosavi 2021).

Tran et al. (2020) propose ENMIM, an ensemble model for estimating energy consumption in residential buildings. ENMIM combines two supervised learning machines and incorporates a symbiotic organism search for optimal tuning parameters. The evaluation demonstrates that ENMIM outperforms other benchmark models in terms of predictive accuracy. The developed self-tuning ensemble model is a promising alternative for energy management planning and has potential applications in various disciplines due to its superior accuracy compared to other artificial intelligence techniques (Tran et al., 2020).

Dinmohammadi (2023) proposes a PSO-optimized random forest classification algorithm to identify factors contributing to residential heating energy consumption. The study introduces a causal inference method to explore the factors influencing energy consumption, identifying a clear causal relationship between water pipe temperature changes, air temperature, and building energy consumption. The research findings can inform decisions around efficient heating management systems in residential buildings to reduce energy bills (Dinmohammadi et al., 2023).

Li and Youming (2023) propose a method to improve building energy performance using natural ventilation. Their approach uses a natural-ventilation strategy as the air conditioning system operation strategy and applies support vector regression and particle swarm optimization to find the optimal solution. The method saves up to 43% of annual thermal energy consumption compared to commonly used building designs and has the least total air conditioning hours, according to a case study (Li & Youming, 2023).

Zhao and Liu (2018) propose a data-preprocessing method based on the Monte Carlo Method (MCM) to improve short-term cooling load forecasting accuracy using weather forecast data. Their approach uses a 24-h-ahead Support Vector Machine (SVM) model for load prediction and reduces the Mean Absolute Percentage Error (MAPE) of load prediction from 11.54 to 10.92%. Sensitivity analysis shows that the 1-h-ahead temperature at the prediction moment has the greatest impact on the prediction results among the selected weather parameters (Zhao et al., 2018).

Khajavi and Amir (2023) propose a machine-learning model to predict heating energy consumption in residential buildings. Their model combines Support Vector Regression (SVR) with six meta-heuristic algorithms to optimize hyper-parameters (Khajavi & Amir, 2023).

The study by Jang, Han, and Leigh found that incorporating operation pattern data improved the accuracy of non-residential building energy consumption prediction models using LSTM networks (Jang, 2022b).

Hossain et al. review energy management in commercial buildings, discuss solutions for improving building energy efficiency and explore future trends and issues for developing an effective building energy management system that supports sustainable development goals (Hossain & Al., 2023).

In this context, in the paper Swan and Ugursal, (2009), Swan and Agursal reviewed the literature on different techniques used in the modeling of energy consumption in the residential sector. A critical review of each technique has been presented, focusing on its strengths, weaknesses, and objectives. Regression analysis has always been considered one of the most common modeling techniques in predicting energy consumption. But since the nature of energy consumption in the world, especially in Iran, is non-linear, the use of non-linear models (branches of artificial intelligence including genetic algorithms, neurophysics networks, and artificial neural networks) to predict the amount of energy consumption as a path-breaking option has been proposed and discussed. One of the strengths of the proposed structure is the study of the effect of data preprocessing on the model’s efficiency. In other research, Rafiei and colleagues have presented a neural network approach with multi-layer perceptrons to predict electricity consumption. To train the artificial neural network, preprocessed data with the help of time series techniques have been used to predict electricity consumption (Khoshnevisan & Al., 2014). Wang and Al., (2021) have presented a research review on the background of the subject of smart buildings. They investigated the hierarchical analysis process as well as investment considerations and smart building evaluation techniques. Hank and colleagues (Hanc et al., 2019) have researched to create a welfare index model that can be used to evaluate very tall residential buildings. In the referred article, AHP has been used to systematize building welfare indicators. Zarghami and colleagues (Fatourehchi & Zarghami, 2020) presented an integrated design and evaluation model to optimize the residential energy system by combining LP linear programming and hierarchical analysis. Wang and Lee used AHP in the multi-criteria analysis of choosing smart building systems, and the importance weights for the criteria were prioritized and determined. The current research aims to consider the necessity of energy management, modeling, and forecasting of cooling and heating energy consumption of residential buildings to audit and determine the energy label of these buildings. Therefore, in this research, using the techniques of linear regression, decision tree, logistic regression, and neural network, the prediction of energy consumption on the information data of residential buildings has been discussed. The architecture of the paper is organized as follows: the first section shows an introduction about predicting heating and cooling loads in buildings with details. “Methods description” represents methods including linear regression, logistic regression, decision tree (ID3), and neural networks, which are employed to predict heating and cooling loads in buildings. “CRISP process for analyzing cooling and heating load of buildings: a structured data mining approach” discusses the data that are used in this paper, the linear regression algorithm, and the different steps of the process used. “Classification methods for evaluating influential factors on heating and cooling loads” represents classification methods and discusses factors that are involved including orientation, glazing area distribution, glazing area, wall area, relative madness, and surface area. “Results” discusses and compares the results of various algorithms employed. Some suggestions are proposed in Sect. 6, and “Conclusion” will represent the conclusion.

Methods description

This section introduces linear regression methods and three classification algorithms: decision tree, regression, and neural network. Data classification methods in data mining are based on the target variable and involve classifying data into two or more classes using the discretization method, after which a learning model is built to predict the results.

Linear regression is a statistical method used to examine the relationship between a dependent variable and one or more independent variables. The method assumes a linear relationship between the dependent variable and the independent variables and aims to find the best-fit line that minimizes the sum of the squared errors between the predicted and actual values.

Decision tree, on the other hand, is a tree-like model that represents decisions and their possible consequences. The model is built by recursively partitioning the data into subsets based on the values of the independent variables and assigning a decision to each subset.

Regression analysis is a statistical technique used to model and analyze the relationship between variables. It is used to predict the value of one variable based on the value of another variable.

Neural network is a computational model inspired by the structure and function of biological neural networks. It is composed of a large number of interconnected processing elements that work together to solve complex problems. Neural networks are used for classification, regression, and other tasks in machine learning.

Linear regression

Linear regression is a type of regression analysis that involves predicting one variable from one or more other variables. In this type of linear predictive function, the dependent variable, or the variable to be predicted, is estimated as a linear combination of independent variables. This means that each of the independent variables is assigned a coefficient during the estimation process. The resulting equation represents a line that best fits the data and can be used to predict the value of the dependent variable for given values of the independent variables. Linear regression is widely used in various fields, such as economics, finance, engineering, and social sciences, to model the relationship between variables and make predictions or forecast future values (Ahmadi et al., 2020). It is a powerful tool for analyzing data and identifying trends, patterns, and correlations. However, it is important to note that linear regression assumes a linear relationship between the variables, and may not be suitable for data that exhibit complex or non-linear patterns it is multiplied (Maulud, 2020b). The linear regression model is defined as follows: the dependent variable is modeled as a linear combination of the independent variables, plus a constant value that is also obtained in the estimation process. Simple linear regression involves only one independent variable, while multiple linear regression involves more than one independent variable. In multivariate linear regression, multiple dependent variables are predicted instead of just one. The estimation process aims to select the coefficients of the linear regression model in a way that best fits the available data. This means that the predictions made by the model should be as close as possible to the actual values observed in the data. Therefore, one of the most important issues in linear regression is to minimize the difference between the predicted values and the actual values (Abdollahzadeh, 2022):

$$Y = w0 + w1x1 + w2x2 + \dots + wnxn$$
(1)

The linear regression model is represented by Eq. (1), where xi is the independent variable in the dataset, and Y is the dependent variable to be predicted. The regression coefficients wi are unknown parameters that need to be estimated from the available data (Dehghani & Larijani, 2023a). The model assumes that the dependent variable Y is a linear combination of the independent variables xi, with each independent variable multiplied by its corresponding regression coefficient wi. The constant term w0 represents the intercept value of the regression line. The estimation process aims to find the values of w0, w1, w2, …, wn that best fit the data, which means minimizing the difference between the predicted values and the actual values observed in the data. Linear regression is a powerful tool for analyzing the relationship between variables and making predictions, but it is important to ensure that the assumptions of the model are met and that the results are interpreted correctly (Kim & Al., 2023).

Logistic regression

Logistic regression is a statistical technique used to analyze the effect of quantitative or qualitative independent variables on a bivariate dependent variable. It is similar to linear regression analysis, but with the difference that in logistic regression, the dependent variable is qualitative and binary, whereas in linear regression, the dependent variable is quantitative. In logistic regression, qualitative independent variables must either be binary or converted into binary dummy variables. The logistic regression model estimates the probability of occurrence of a certain class of dependent variables based on the values of the independent variables. The model assumes a relationship between the independent variables and the log odds of the dependent variable, which is modeled as a linear combination of the independent variables. The log odds are then transformed into probabilities using the logistic function. The logistic regression model is widely used in various fields such as medicine, social sciences, and business to model binary outcomes and make predictions. It is a powerful tool for analyzing the relationship between variables and identifying the factors that contribute to a certain outcome. However, it is important to ensure that the assumptions of the model are met and that the results are interpreted correctly (Nhu, 2020c):

$$logit\left(\frac{p}{1-p}\right)={B}_{0}+{B}_{1}{x}_{1}+\dots +{B}_{d}{x}_{d}$$
(2)

The logistic regression model is represented by Eq. (2), where xis is the independent variable, B0, B1, …, Bd are the regression coefficients to be estimated, and p is the probability of observing a value of 1 for the dependent variable given the values of xi. The logistic function is used to transform the log odds into probabilities, which range from 0 to 1. The logistic regression model is widely used in various fields such as medicine, social sciences, and business to model binary outcomes and make predictions. It is a powerful tool for analyzing the relationship between variables and identifying the factors that contribute to a certain outcome. However, it is important to ensure that the assumptions of the model are met and that the results are interpreted correctly.

Decision tree

The ID3 tree algorithm is a top-down decision tree algorithm that selects the best feature at each node based on the amount of information gained from the training data. The algorithm starts by selecting the feature with the highest information gain as the root node. Then, for each possible value of the selected feature, a branch is created, and the training examples are sorted accordingly. The process is repeated recursively until all examples are classified, resulting in a deep tree structure (Naseri et al., 2022). However, this can lead to overfitting, especially when the training data contain noise or has a small number of examples.

Both ID3 and C4.5 tree algorithms use the purity criterion to select features for splitting the tree. The C4.5 tree algorithm improves on ID3 using the Gain Ratio criterion to select features and by adding a pruning method to the tree after it has been constructed. The pruning method helps to prevent overfitting by removing branches that do not improve the accuracy of the tree (Maydanchi et al., 2023).

The 5.C4 tree algorithm is a variant of the C4.5 algorithm that can handle numerical data as well as categorical data. It also includes a pruning method to prevent overfitting and stops building the tree when the number of samples is below a certain threshold.

Overall, decision trees such as the ID3, C4.5, and 5.C4 algorithms are powerful tools for classification and prediction tasks. However, it is important to carefully select the features and tune the parameters to avoid overfitting and obtain accurate results (Chen & Al., 2016).

Neural network

Artificial neural systems are computational models that are inspired by the structure and function of natural neural systems, such as the human brain (Chen & Al., 2021b). The key component of these models is their data processing framework, which consists of a large number of interconnected components called neurons. These neurons work together to solve specific problems by performing computations on the input data. By training on sample data, artificial neural systems learn to extract hidden information or patterns within the data (Chen & Al., 2020).

Each neural network contains an input layer, where each node corresponds to one of the input variables, and a hidden layer, which is connected to multiple nodes in the input layer (Chen & Al., 2018). The neurons in the hidden layer perform computations on the input data, and the output of the hidden layer is passed to the output layer, which produces the final output of the neural network (Chen & Al., 2021a). To simulate a classification method using neural networks, we must specify the number of neurons in the input layer and have the same number of neurons in the output layer corresponding to the number of classes.

In practice, the neuron with the highest value in the output layer is taken as the predicted class for that particular data sample. This process is known as feedforward propagation, and it is the basis for many neural network architectures, including multi-layer perceptrons and convolutional neural networks. Artificial neural systems have shown promising results in various applications, including image recognition, speech recognition, and natural language processing. However, training neural networks can be a challenging task, and it requires careful selection of network architecture, training data, and hyper-parameters (Duan & Al., 2020). Deep learning proved to be a superior method in terms of producing stable results with lower errors and higher generalization compared to other techniques, making it an effective approach for modeling high nonlinearity in optimization tasks(Kaveh and Khalegi 1998) (Kaveh & Al., 2021). Kaveh and Kavaninzadeh, (2023) uses meta-heuristic algorithms to optimize the parameters of two artificial neural network structures for predicting.

Moreover, overfitting can occur if the model is too complex or if the training data are insufficient. Therefore, the performance of neural networks depends on careful design, training, and evaluation. Further research is needed to improve their performance and to better understand their underlying mechanisms. Overall, artificial neural systems are powerful tools for solving complex problems in various fields, and they have the potential to revolutionize many areas of research and development (Runge and Radu, 2019). Figure 1 represents neural network topology.

Fig. 1
figure 1

Neural network topology (Kashani et al., 2023)

CRISP process for analyzing cooling and heating load of buildings: a structured data mining approach

In this study, the CRISP process was used to analyze the data and investigate the cooling and heating load of a building. The process consists of six major steps, which include data extraction, statistical information extraction, discovering project requirements and selecting an appropriate model, executive steps such as preprocessing and data cleaning if needed, analysis of influential characteristics, classification, and evaluation and application of the model.

The first step involves extracting the necessary data from various sources, such as sensors and other relevant sources. The second step involves extracting statistical information from the data to understand the underlying patterns and relationships in the data. The third step involves identifying the project requirements and selecting an appropriate data mining model based on the requirements.

The fourth step involves performing necessary preprocessing and data cleaning to ensure that the data are suitable for analysis. This step may include identifying and removing missing data and outliers, and normalizing the data to bring it to a common scale.

The fifth step involves analyzing the influential characteristics or features that affect the cooling and heating load of the building. Feature selection techniques may be used to identify the most relevant features.

The final step involves applying the selected model to the data and evaluating its performance. Performance metrics such as accuracy, precision, recall, and F1-score may be used to evaluate the model.

In conclusion, the CRISP process provides a structured approach to implementing data mining projects. By following this process, we can analyze data and investigate the cooling and heating load of a building, which can be used to improve the building’s energy efficiency and reduce energy costs.

UCI database for predicting building heating and cooling load

In this study, the dataset used for analyzing the cooling and heating load of buildings was obtained from the UCI database, which is a standard data mining repository. The dataset consists of 12 different types of residential buildings with the same volume but with variations in parameters such as the glass surface, glass surface distribution, and orientation. The dataset includes 768 samples and 10 features, out of which 8 are predictor variables and 2 are target variables. The study aims to predict the heating and cooling load of the building with real values. The dataset was selected for its suitability and availability for the study ([CSL STYLE ERROR: reference with no printed form.]).

Statistical analysis of building characteristics for data mining of heating and cooling load: insights from box and frequency plots

In the statistical analysis step, two types of charts were used to evaluate and investigate data errors: a box plot and a frequency plot.

The box plot (Fig. 2) displays the maximum, minimum, and standard deviation of the data for each feature. If there are any outliers in a feature, they are indicated by a box in the data. The box plot shows that none of the features have outliers.

Fig. 2
figure 2

The degree of dependency between the building features

The frequency plot (Fig. 3) displays the frequency of the data and shows that the distribution of data is uniform.

Fig. 3
figure 3

Box diagrams of the dataset

Overall, these charts provide valuable insights into the characteristics of the dataset and help identify any potential data errors or outliers.

Using Pearson correlation coefficient to evaluate feature performance in linear regression

Linear regression is a widely used method for modeling the relationship between two variables, with one variable seen as the dependent variable and the other as the independent variable. To evaluate the performance of each feature and its influence on linear regression, we used the correlation coefficient. The correlation coefficient measures the ability to predict the value of one variable in terms of the other. The closer the correlation coefficient is to 1 or -1, the more linear the relationship between the two variables. A value close to 1 indicates a direct relationship, while a value close to -1 indicates an inverse relationship (El-Sisi & Al., 2022).

One of the most popular ways to measure the dependence between two quantitative variables is by calculating the Pearson correlation coefficient. For instance, if \(Y=a + bX\), the Pearson correlation coefficient between X and Y can be calculated using the following formula (Cen et al., 2017):

$$\rho \left( {X,Y} \right)^{ \wedge } = {{\left( {E\left[ {\left( {X - E\left( X \right)} \right)\left( {a + bX - } \right)} \right]} \right)} \mathord{\left/ {\vphantom {{\left( {E\left[ {\left( {X - E\left( X \right)} \right)\left( {a + bX - } \right)} \right]} \right)} {\left[ {V\left( X \right) b^{ \wedge } 2 V\left( X \right)} \right]^{ \wedge } 0.5}}} \right. \kern-0pt} {\left[ {V\left( X \right) b^{ \wedge } 2 V\left( X \right)} \right]^{ \wedge } 0.5}}$$
(3)

where E denotes the mathematical expectation, and V denotes the variance. For a random sample of size n of random variables X and Y (Xi, Yi), the Pearson sample correlation coefficient is calculated using the following formula (Moshtaghi Largani & Lee, 2023):

$$r\left( {x,y} \right)^{{}} = {{\left( {n\Sigma {\text{xiyi}}} \right) - \Sigma \left( {{\text{xi}}\Sigma yi} \right)} \mathord{\left/ {\vphantom {{\left( {n\Sigma {\text{xiyi}}} \right) - \Sigma \left( {{\text{xi}}\Sigma yi} \right)} {\left( {\left[ {n\Sigma xi^{ \wedge } 2 - \left( {\Sigma xi} \right)^{ \wedge } 2 } \right]^{ \wedge } 0.5 \left[ {n\Sigma yi^{ \wedge } 2 - \left( {\Sigma yi} \right)^{ \wedge } 2 } \right]^{ \wedge } 0.5 } \right)}}} \right. \kern-0pt} {\left( {\left[ {n\Sigma xi^{ \wedge } 2 - \left( {\Sigma xi} \right)^{ \wedge } 2 } \right]^{ \wedge } 0.5 \left[ {n\Sigma yi^{ \wedge } 2 - \left( {\Sigma yi} \right)^{ \wedge } 2 } \right]^{ \wedge } 0.5 } \right)}}\;$$
(4)

In this formula, Σ denotes the sum of the values, and the square root of the variances is used to normalize the coefficient.

Using the correlation coefficient in linear regression, we were able to evaluate the performance of each feature and understand the relationship between the variables (Chen et al., 2022). This information was critical in predicting the dependent variable’s value based on the independent variable. Overall, the correlation coefficient was a valuable tool for analyzing the relationship between variables in linear regression (Molina-Solana & Al., 2017). Correlations between features are shown in Fig. 2.

The effective weight of the features using the correlation coefficient can be seen in Table 1.

Table 1 Uncovering the influence of features on cooling and heating load property through correlation coefficients

The study focused on examining the impact of different features on heating and cooling load properties in 12 building models (Fig. 3). To evaluate the performance of each feature, weights were calculated, which ranged between zero and one. The results indicated that building height had the most significant impact on heating and cooling loads, with a weight of 1. This feature directly affected the heating load of all 12 building models, followed by the roof area, which weighed 0.96. On the other hand, the features of building location and orientation were found to be the least efficient for predicting heating and cooling loads, while the method of glass distribution had very little effect. Regression equations were developed for heating and cooling loads. The equation for the heating load is

$$Y = 97.337 - 70.788 X1 - 0.088 X2 + 0.045 X3 + 4.284 X5 + 0.122 X6 + 14.818 X7 + 14.818 X8$$
(5)

The equation for the cooling load is

$$Y = 83.932 {-} \;64.773 {\text{X}}1 {-} \;0.082 {\text{X}}2 \; + \;0.055 {\text{X}}3 {-} 0.011 {\text{X}}4 \; + \; 4.17 {\text{X}}5\; + \; 19.933 {\text{X}}7\; + \; 0.204 {\text{X}}8$$
(6)

To validate the accuracy of the models, the coefficient of determination (R2), the root mean square error (RMSE), and the mean absolute error (MAE) were calculated.

The R2 value indicates the proportion of the variance in the dependent variable that is explained by the independent variables and values greater than 0.6 indicate that the independent variables can significantly explain the changes in the dependent variable. The formula for R2 is as follows (Sasani et al., 2023):

$${R}^{2} =1-\frac{S{S}_{res}}{S{S}_{tot}}$$
(7)

\(S{S}_{res}\) is the residual sum of squares and equals to

$$S{S}_{res}=\sum_{i=1}^{n}({y}_{i}-{\widehat{y}}_{i}{)}^{2}$$
(8)

\(S{S}_{tot}\) is the total sum of squares and is equal to

$${\text{SS}}_{{{\text{tot}}}} = \mathop \sum \limits_{i = 1}^{n} ({\text{y}}_{{\text{i}}} - \overline{y})^{2}$$
(9)

where \(y_{i}\) is the actual value of the dependent variable in the ith sample. \(\hat{y}_{i}\) is the predicted value by the model for the ith sample. \(\overline{y}\) is the mean of the dependent variable in all samples.

RMSE or root mean square error is one of the most commonly used statistical parameters in predicted data. This parameter shows us hoCSL STYLEw much the predicted model differs from the actual data. The lower the RMSE value, the better the model is. The formula for calculating RMSE is as follows (Dehghani and Larijani, 2023):

$$RMSE = \sqrt {\frac{{\sum\limits_{i = 1}^{n} {\left( {P_{i} - O_{i} } \right)^{2} } }}{n}}$$
(10)

where \(Pi\) is the predicted value for the ith data. \(O_{i}\) is the actual value for the ith data. \(n\) is the number of data. MAE or mean absolute error is another statistical parameter that is used to measure the difference between predicted and actual data. This parameter shows us how far the predicted model is from the actual data. The lower the MAE value, the better the model is. The formula for calculating MAE is as follows:

$$MAE=\frac{\sum_{i=1}^{n}\mid {P}_{i}-{O}_{i}\mid }{n}$$
(11)

where \(Pi\) is the predicted value for the ith data. \(O_{i}\) is the actual value for the ith data. \(n\) is the number of data.

The study provides valuable insights into the impact of different features on heating and cooling load properties and offers regression equations that can accurately predict these properties. These findings can be useful for architects, engineers, and building designers in creating more energy-efficient buildings. Figure 4 represents a comparative diagram of three statistical indicators (Table 2).

Fig. 4
figure 4

Histogram charts of the dataset

Table 2 RMSE, R2 and MAE values in linear regression

Classification methods for evaluating influential factors on heating and cooling loads

The study used the information gain index with entropy, a weighting index method of data mining, to assess the performance and influence of each feature on heating and cooling loads. This method calculates the relationship between variables based on entropy or information benefit in the dataset and assigns weights to each variable as an output. The closer the weight is to 1, the greater the impact of the feature on success or failure.

The information gain formula is defined based on Shannon’s entropy, which is calculated as the difference between the entropy of the entire dataset and the entropy of the dataset after splitting based on a feature. The entropy formula is defined as the sum of the probability of each class label multiplied by the logarithm of the probability (Duan & Al., 2017):

$$\mathrm{information gain}\left(\mathrm{A}\right) =\mathrm{Entropy}\left(\mathrm{D}\right)-{\mathrm{Entropy}}_{\mathrm{A }}^{\mathrm{D}}$$
(12)

In the information gain index with entropy, A represents the feature being evaluated, and D is the training dataset. Shannon's entropy formula is used to calculate the entropy of the dataset, which is defined as the measure of impurity or unpredictability in the data. The formula takes into account the number of classes (C) in the training data, the probability (Pi) of an example belonging to the ith class, the number of members (v) in the characteristic feature domain A, and Dj which represents a subset of the original data. The size of the dataset is denoted by |D| (Cen et al., 2018).

The entropy formula is calculated as the sum of the probability of each class label multiplied by its logarithm. A lower entropy value indicates a more homogeneous dataset, while a higher value indicates a more diverse dataset. By calculating the entropy of the dataset before and after splitting based on a feature, we can determine the information gain, which represents the reduction in entropy or the increase in homogeneity achieved by splitting the dataset based on that feature:

$$\begin{aligned} {\text{Entopy}}\left( {\text{D}} \right) = & - \mathop \sum \limits_{i = 1}^{c} {\text{P}}_{{i{ }}} \times \log \left( {{\text{pi}}} \right){\text{Entopy}}\left( {\text{D}} \right)_{{\text{A}}} { } \\ = \mathop \sum \limits_{j = 1}^{v} \frac{{\left| {Dj} \right|}}{{\left| {\text{D}} \right|}} \\ \times \;{\text{Entropy}}\;({\text{D}}) \\ \end{aligned}$$
(13)

First, the results of the weighted average of the indicators are reported. Weights report numbers between 0 and 1, and the closer to 1, the more impact the teams have on success or failure (Iraji et al., 2023).

In this study, decision trees, regression, and neural network techniques were employed for the classification step to anticipate building heating loads. An intelligent method of cross-validation was used to validate and split the data into training and experimental parts, ensuring that all data are used for training and validation.

The accuracy of the classification models was evaluated using a confusion matrix, which shows the performance of the classifier based on input datasets from different classes of the classification problem. The matrix includes four categories: true positive, false positive, true negative, and false negative. True positive represents the number of records whose real class is positive and correctly recognized by the classifier, while false positives refer to the number of records whose true class is negative but were incorrectly classified as positive. True negatives indicate the number of records whose true class is negative and correctly identified by the classifier, and false negatives refer to the number of records whose actual class is positive but were incorrectly identified as negative by the classifier.

The evaluation criterion is based on the accuracy or classification rate, which represents the percentage of samples correctly classified by the classifier in either the positive or negative class (Table 3). The accuracy is calculated using the values from the confusion matrix as follows (Hashemi et al., 2023):

Table 3 Confusion matrix
$$Accuracy = (TP + TN) / (TP + TN + FP + FN)$$
(14)

The confusion matrix is used to calculate accuracy, which has four categories: true positive, false positive, true negative, and false negative.

True positive represents the number of records whose real class is positive and correctly recognized by the classifier. False positive refers to the number of records whose true class is negative but were incorrectly classified as positive. True negative indicates the number of records whose true class is negative and correctly identified by the classifier, and false negative refers to the number of records whose actual class is positive but were incorrectly identified as negative by the classifier. The accuracy of a classification model is calculated by adding the true positives and true negatives and dividing the result by the total number of samples, which includes true positives, true negatives, false positives, and false negatives. This formula provides a percentage value for the accuracy of the model. Using this evaluation criterion, we can compare the performance of different classification models and select the best one for predicting building heating loads.

Results

As shown in Table 4, the decision tree method achieved the highest classification accuracy, with 98.96% accuracy for predicting heating load and 93.24% accuracy for predicting cooling load. The logistic regression and neural network techniques also showed reasonably high accuracies, with 93.36% and 95.31% accuracy for predicting heating load, and 90.63% and 91.93% accuracy for predicting cooling load, respectively.

Table 4 Performance comparison of classification techniques for forecasting heating and cooling loads in buildings

The results indicate that the decision tree method is more effective in predicting heating load compared to cooling load, as the accuracy for heating load prediction was higher and had less error compared to cooling load prediction. These findings suggest that decision tree classification could be a valuable tool in predicting building heating and cooling loads, which can help architects, engineers, and building designers in creating more energy-efficient buildings. Figure 5 and Fig. 6 represents a comparison between results.

Fig. 5
figure 5

RMSE, R2 and MAE comparison

Fig. 6
figure 6

Accuracy comparison in three machine-learning methods

Conclusion

In this study, linear regression and classification techniques were used to evaluate and predict the heating and cooling loads of a building using a dataset of 768 samples with 8 predictor characteristics and 2 target variables. The correlation coefficient method was used in linear regression, and the information gain entropy method was used in classification techniques to check the influence of factors on the data. The weight of the influence of each factor on the heating load of the building was reported, and the results indicated that the height and area of the roof, followed by the area of the floor and the relative compactness, are among the most important influencing factors. It was also found that the location and orientation of the building do not affect the heating and cooling loads.

In the classification step, the dataset was divided into two training categories, and three decision tree techniques (decision tree, logistic regression, neural network) were applied to the data. The decision tree technique reported the highest classification accuracy with 98.96% accuracy in the heating load and 93.24% in the cooling load, which was determined by reporting total accuracy and confusion matrix results.

Overall, the results of this study provide valuable insights into the factors that influence the heating and cooling loads of a building and the effectiveness of different data mining techniques in predicting these loads. These insights can be used to improve the energy efficiency of buildings and reduce energy costs. Future studies can focus on exploring the effects of human behavior on energy usage and utilizing quantum computing to improve the accuracy and speed of prediction models for enhancing building energy efficiency.