1 Introduction

Human-made climate change is in full swing and revealing first negative effects (Larsen et al. 2020). The United Nations’ Paris Agreement declared ambitious climate goals and aims to decrease energy end-use below 1990 levels by 2030 (Boden et al. 2017). However, current efforts are not sufficient to achieve the intended goals and therefore additional steps are necessary (European Environment Agency 2019). One of the largest single energy consuming sectors in Germany are residential single- and two-family buildings, accounting for 11% of the overall final energy consumption, 84% of which relates to heating and hot water production, with similar figures for many other countries (Cao et al. 2016; Federal Ministry for Economic Affairs and Energy 2018). Moreover, 64% of the German residential buildings were erected before 1979, which were subject to less strict construction codes than today, thus offering considerable energy savings potential when conducting retrofits (Federal Statistical Office of Germany 2011). Nevertheless, retrofit measures on these buildings are carried out sparsely and the retrofit rate – the percentage of buildings that undergo retrofits in one year – is too low to reach the climate goals (Achtnicht and Madlener 2014).

In this vein, Energy Performance Certificates (EPC) have been designed to support achieving the climate goals in the EU and particularly in Germany (European Parliament and the Council 2002). EPCs are issued by qualified auditors and are intended to increase the retrofit rate by providing general information about buildings, their Final Energy Performance (FEP) – the annual amount of energy required for space and water heating, cooling, and ventilation per square meter effective building area – and possible retrofit measures (Arcipowska et al. 2014). To achieve its full effect, accurate prediction of the FEP is important to decide on purposeful retrofit measures, as uncertainty and incomplete information are substantial investment barriers (Amecke 2012). However, today’s most frequently used and by law prescribed Energy Quantification Methods (EQM) are hotly debated in the research community, as they exhibit low prediction accuracy (Hardy and Glew 2019). The prescribed engineering EQM bases on physical laws to calculate thermal dynamics and energy behavior (Zhao and Magoulès 2012) and requires detailed information on building components, gathered by auditors during on-site inspections (Arcipowska et al. 2014). If the input data quality is low, e.g., because the insulation materials are not known and cannot be determined with reasonable effort, the result will also be erroneous.

To enhance the prediction accuracy, data-driven EQMs were introduced in research and obtained promising results in preliminary studies (Sutherland 2020). They learn underlying dependency structures from available data without relying on expert knowledge of building physics or precise information on building components (Amasyali and El-Gohary 2018). This allows data-driven EQMs to potentially overcome the shortcomings of engineering EQMs. However, there is a lack of studies on data-driven EQMs in residential buildings considering heating energy with a focus on long-term (annual) energy prediction, as required for EPCs (Amasyali and El-Gohary 2018). Furthermore, most studies are based on simulated building and energy data, which limits their practical applicability and the validity of the findings (Wei et al. 2018). It is therefore unclear whether data-driven methods can outperform the engineering EQM with respect to annual energy prediction of residential buildings necessary for EPCs, and, if so, which data-driven EQMs are particularly suited. Even though different EQMs have been applied in several case studies (Buratti et al. 2014; Tsanas and Xifara 2012), to the best of our knowledge no benchmarking of different EQMs on the same underlying real-world data has been performed, which is nonetheless essential for full comparability and transparency of the algorithms’ performance in practice. Thus, we formulate our guiding research question as follows:

Which of the investigated energy quantification methods yields the highest accuracy for predicting final energy performance of real-world residential single- and two-family buildings in Germany?

In this sense, our goal on the methodological level is not to explain the underlying causality, but to predict energy consumption, allowing us to benchmark prediction accuracy (Shmueli and Koppius 2011). Since the computational performance of data-driven methods generally exceeds that of engineering methods after initial training, we focus on the prediction accuracy, i.e., effectiveness, and not efficiency.

We address the research question by implementing and tuning several machine learning algorithms – Artificial Neural Network (ANN), D-vine copula quantile regression, Extreme Gradient Boosting (XGB), Random Forest (RF), and Support Vector Regression (SVR) – on an extensive first dataset containing 25,000 real-world single and two-family buildings in Germany. We subsequently calculate the output accuracy (predictive power) by predicting the FEP of 345 additional buildings from a second dataset and comparing the prediction with the actual metered energy consumption. As the second dataset was gathered by qualified energy auditors and also encompasses the FEP stated in the EPCs based on the prescribed engineering EQM, we can further compare the data-driven EQMs to the engineering EQM. To ensure robust results and to comply with state-of-the-art machine learning practices, we benchmark the machine learning algorithms against each other in depth based on nested cross-validation on both building datasets, which is not possible for the engineering EQM due to data restrictions. By stratifying the Performance Evaluation Measures (PEM) based on a third dataset which contains information on the German building stock, we ensure representativeness.

Even though the applied solutions and the respective problem in this research are technically known, we argue that we contribute an improvement to existing solutions in terms of Gregor and Hevner (2013) for the following reasons: (1) We are among the first to compare existing solutions (i.e., different EQMs) in terms of solution maturity within a new application domain of the annual FEP prediction for residential buildings, filling the research gap of missing data at the residential building level and the application of data-driven EQMs. (2) Because data-driven EQMs must be designed for specific applications to unleash their full potential (Mosavi et al. 2019), existing knowledge about the performance of data-driven EQMs on non-residential buildings cannot be transferred to the residential building stock directly. Especially in countries like Germany, which have a very high percentage of single- and two-family buildings (Federal Statistical Office of Germany 2011), the improvement of the quantification of the energy efficiency of buildings is relevant to advance towards the set climate goals.

The remainder of this study is structured in seven sections: Sect. 2 summarizes the theoretical background of EPCs, previous research on EQMs, and PEMs to assess the EQMs’ prediction accuracy. Section 3 presents the methodology and the study design for the benchmarking process. The datasets and pre-processing procedure are then introduced in Sect. 4. In Sect. 5 we display the model training as well as the model optimization and present the results in Sect. 6. We discuss the results and provide managerial and policy implications as well as limitations and prospects for further research in Sect. 7 before the final Sect. 8 concludes.

2 Problem Context and Theoretical Background

2.1 Energy Performance Certificates

The European parliament and council passed a directive in 2002 that declares the need for EPCs to improve the energy performance of buildings, aiming to inform owners, occupants, and property developers about the energetic building state and related operating costs (European Parliament and the Council 2002). EPCs are issued by qualified auditors and illustrate the energy performance of individual buildings as well as further information like building age, energy source of the heating system, recommendations for energetic retrofit measures, or the building’s position in an energy efficiency ranking scheme which allows to compare different buildings (Poel et al. 2007).

Both literature and practice manifold discuss different aspects of EPCs (Li et al. 2019). Next to investigations about to which extent EPCs influence the real estate market as well as the impact and relevance of EPCs on retrofit and purchasing decisions, the energy performance gap is a major challenge of EPCs (Pasichnyi et al. 2019). The energy performance gap describes the phenomenon that the actually metered FEP differs significantly from the predicted FEP, with studies depicting deviations of up to 287% (Calì et al. 2016; Wilde 2014). Many studies are dedicated to the gap’s existence, causes, and solutions to minimize it (Burman et al. 2014; Herrando et al. 2016; Menezes et al. 2012). One possible solution to address the energy performance gap are data-driven EQMs instead of engineering EQMs (Foucquier et al. 2013). Another option that can be used to minimize this gap is a demand-consumption comparison, which is regulated in addendum 1 of DIN V 18599 for retrofitting consulting, but not part of official EPCs (Beuth Verlag GmbH 2010). The norm defines key figures and correlations, in order to stepwise approximate the calculated demand to the measured consumption and thus to minimize the performance gap by improving retrofitting decisions (Bigalke and Marcinek 2016).

In Germany, the Energy Saving Ordinance forms the regulatory framework for EPCs with the FEP as target measure (Deutscher Bundestag 2013). EPCs for residential buildings concentrate mostly on space and water heating, as cooling and controlled ventilation systems are not common in Germany (Federal Ministry for Economic Affairs and Energy 2018). Thus, in our research, we focus on the FEP for space and water heating. Broadly speaking, an EPC is issued either by metering (measured EPC) or by calculations (calculated EPC). Measured EPCs reflect the actually metered annual consumption of all energy sources that have contributed to the heating, ventilation, and cooling of a house within the last three consecutive years, thus implicitly including occupant behavior. Calculated EPCs reflect the energy demand and determine the FEP by means of a technical analysis of a multitude of building parameters prescribed by the Energy Saving Ordinance. To collect the required information to carry out calculated EPCs, on-site inspections of qualified auditors are needed. The German engineering norm DIN V 18599 is the standard calculation scheme to determine the FEP of buildings (DIN e.V. and Beuth Verlag 2016). For residential buildings the norm DIN V 4108–6 can also be applied in combination with DIN V 4701–10 or DIN V 4701–12 (Deutscher Bundestag 2013). The current guidelines necessitate calculated EPCs for nearly two-thirds of all residential buildings in Germany. As a large part of these buildings was constructed before the heat insulation ordinance of 1977, thus offering great energy savings potential, we focus on calculated EPCs in the following (Federal Statistical Office of Germany 2011).

For a better understanding of the following sections, we describe necessary calculation rules of EPCs. The FEP is related to the effective building area \({A}_{e}\) [m2], which does not correspond to the more common living space \({A}_{l}\) [m2] (Deutscher Bundestag 2013). The effective building area includes areas that are heated indirectly like corridors or stairways, and thus turns out to be larger than the living space. According to national legislation, the effective building area depends on the heated building volume and the story height, but can also be approximated with the living space and the factor \({f}_{c}\) using Eq. (1).Footnote 1 The factor \({f}_{c}\) is used for approximating the effective area \({A}_{e}\) with the living space \({A}_{l}\), which is more commonly available for tenants or homeowners. The conversion factor \({f}_{c}\) is 1.35 for buildings which contain no more than two apartment units with a heated basement, and 1.2 for all other buildings (Deutscher Bundestag 2013).

$${A}_{e}={f}_{c}\cdot {A}_{l}.$$
(1)

To meaningfully compare buildings from different locations, i.e., with different climatic conditions, the FEP is weather-rectified by referring to the climate of the reference location of Potsdam in a test reference year (Deutscher Bundestag 2013). To extract weather effects, the broadly accepted and normative formalized method of climate factors \(\left(CF\right)\) based on heating degree days (\(HDD\)) is established in research (You et al. 2014). A degree day is defined as the difference between an indoor comfort temperature (\({\tau }_{I}\)) and the average daily outdoor temperature (\({\tau }_{i}\)). The \(HDD\) equal the sum of degree days over a certain period of \(N\) days, where \({\tau }_{i}\) is below the heating limit (\({\tau }_{L}\)) (e.g., 15 °C in Germany for existing buildings (Olonscheck et al. 2011)), as depicted in Eq. (2) (Baltuttis et al. 2019):

$$HDD \left({\tau }_{L},{\tau }_{I}\right)=\sum_{i=1}^{N}{1}_{{\tau }_{L}\ge {\tau }_{i}}({\tau }_{I}-{\tau }_{i}) .$$
(2)

The indicator function \({1}_{{\tau }_{L}\ge {\tau }_{i}}\) takes the value 1 if the average outdoor temperature \({\tau }_{i}\) is below or equal to the heating limit \({\tau }_{L}\) and is 0 for all other cases. By calculating the \(HDD\) for two locations (\(X, Y)\) the climate factor \(CF\) can be derived according to Eq. 3:

$$CF=\frac{HDD \left(X\right)}{HDD \left(Y\right)} .$$
(3)

Based on the climate factor \(CF\), the measured consumption of location \(Y\) can be adjusted to the weather conditions of location \(X\). With the help of the climate factor and the effective building area, we can calculate the FEP of a building from any location the same way it is given in EPCs by rectifying the final energy demand or measured consumption \(C\) using Eq. (4). For EPCs the HDDs of location \(X\) refer to the climatic conditions of the reference location of Potsdam and the corresponding test reference year (Deutscher Bundestag 2013). This enables us to compare buildings’ energy performance independently of their location, size, and weather-related temperature effects.

$$FEP=C\cdot \frac{CF}{{A}_{e}}.$$
(4)

2.2 Energy Quantification Methods

Quantifying buildings’ energy performance is a challenging task with multiple influencing factors like building geometry, occupant behavior, thermal properties, or weather (Wei et al. 2018). Accordingly, the field of EQM research is diverse and methods differ significantly regarding their level of detail and purpose (Wang et al. 2012). Common dimensions to distinguish the scope of EQM studies are building types, prediction time horizon, and the scope of energy performance (Amasyali and El-Gohary 2018). Thereby, most studies currently focus on the prediction of overall energy performance for commercial and/or educational buildings with an hourly time horizon (Wei et al. 2018). In their extensive literature reviews, where they examined collectively over 200 articles, Amasyali and El-Gohary (2018), Bourdeau et al. (2019), and Wei et al. (2018) independently conclude that there is a lack of research for residential buildings and specifically for long-term annual energy prediction. Especially, the combination necessary for EPCs in residential buildings has not been sufficiently analyzed by means of data-driven EQMs. This also holds true for 2019 onwards, as indicated in Table 1. Real-world applications and data are necessary to obtain reliable results, because synthetic data from simulation models use simplifications and required input parameters are often not available (Wei et al. 2018). Nonetheless most studies currently use synthetic data instead. There are many reasons for the lack of large and reliable real-world datasets for residential buildings, as collecting data for residential buildings is a difficult and time-consuming task. The building stock is extremely diverse (Bourdeau et al. 2019), and the data sources are not standardized, which requires extensive questionnaires and tools for data collection. In addition, parameters and terms are often interpreted differently, making it difficult to align datasets (Carpino et al. 2019). With our study we directly address this research gap, focusing on residential buildings, using real-world data, and predicting annual heating energy performance.

Table 1 Recent studies (2019–2021) of data-driven energy quantification methods and energy prediction (list not conclusive)

In general, EQMs are categorized into engineering methods, data-driven methods, and hybrid methods combining the former (Foucquier et al. 2013). In literature there is no consistent terminology for EQMs. As a generic term, methods or approaches are often used, both for engineering and for data-driven methods (Bourdeau et al. 2019). For data-driven methods, depending on the research domain, the term (machine learning) algorithm is widely established (Amasyali and El-Gohary 2018). In this study we use the terminology “methods” when referring to data-driven, engineering, or hybrid EQMs in general and “machine learning algorithms” when referring to individual instances of data-driven EQMs, e.g., RF or ANN. Even though hybrid methods try to exploit the advantages of engineering as well as data-driven methods while simultaneously minimizing their disadvantages, the necessary knowledge about both EQMs as well as computational inefficiencies poses a great challenge, which makes the hybrid methods less attractive (Wei et al. 2018). Thus, in our study we focus on engineering and data-driven EQMs. Engineering EQMs model the thermal behavior of heat flows in buildings based on physical laws (Amasyali and El-Gohary 2018). Figure 1 displays exemplarily the heat flows considered in engineering EQMs. These include, for example, transmission heat losses \({H}_{T}\) through the building shell (e.g., walls, windows, roof, etc.), ventilation heat losses \({H}_{V}\), caused by airing or leakages in the building shell, solar heat gains \({Q}_{S}\), and internal heat gains \({Q}_{i}\) (e.g., electrical consumers or heat radiated by occupants). The heating energy demand \({Q}_{h}\) provided with a heating system is consequently calculated from the heat losses, to ensure a constant room temperature. In addition, the demand for hot water heating \({Q}_{tw}\) must be calculated and the heating system’s efficiency considered (Ettrich 2008).

Fig. 1
figure 1

Generic illustration of heat flows considered in engineering EQMs to calculate the heating energy demand (own illustration based on Ettrich (2008))

Over the past 50 years, different types of engineering EQMs varying in model complexity and prediction accuracy were developed (Zhao and Magoulès 2012). For the case of calculated EPCs from Germany, quasi-steady-state methods are prescribed by the Energy Savings Ordinance (Eicker et al. 2018). Generally, engineering EQMs require detailed information about all building components and its environment, like external climate conditions, geometrical data, building construction, material properties, or operation (Zhao and Magoulès 2012). Especially for existing buildings the required information and parameters are hardly accessible, thus costly and time consuming to collect (Wang et al. 2012). Furthermore, engineering EQMs are widely discussed for their prediction accuracy, revealing high energy performance gaps, as highlighted in Sect. 2.1.

In contrast to engineering EQMs, data-driven EQMs do not require detailed knowledge about building physics and technical aspects, but use machine learning algorithms to predict building energy performance by learning from available data (Amasyali and El-Gohary 2018). Data-driven EQMs require algorithm training, testing, and validation (Bourdeau et al. 2019). In addition, previous work has to be put in data collection and pre-processing (Kaymakci et al. 2021). Data-driven EQMs have shown convincing results in research regarding prediction accuracy and have surpassed engineering EQMs in several studies (Wei et al. 2018). Researchers agree that data-driven EQMs designed for a particular application achieve the highest degree of accuracy (Mosavi et al. 2019). Yet, a major limitation of data-driven EQMs is the data availability and data quality (Foucquier et al. 2013).

ANN, SVR, and decision trees (or RF and XGB as decision tree ensembles) are the three most used machine learning algorithms for predicting building energy performance (Amasyali and El-Gohary 2018). Even though Bourdeau et al. (2019) and Amasyali and El-Gohary (2018) indicate that SVM and ANN may be the best performing data-driven EQMs to predict building energy performance, there is no consistent picture in the literature yet as to which EQM performs best in terms of prediction accuracy (Ahmad et al. 2018; Aydinalp et al. 2004; Wei et al. 2018). Different advantages and disadvantages of data-driven EQMs like dealing with incomplete data, complexity of the models’ training process, or computation speed are discussed. Particularly interesting is the novel D-vine copula quantile regression. Copulas are essentially d-dimensional distribution functions, which can also be used for energy quantification or prediction. They are especially suited for complex prediction tasks, as copulas are able to capture complex dependence patterns even in the tails of the distributions (Czado 2019; Nelsen 2010). So far, copulas have been applied to various fields of study and have convinced with promising results (Kraus and Czado 2017; Schallhorn et al. 2017; Töppel et al. 2019).

2.3 Performance Evaluation Measures

Predictive analytics requires empirical predictive models and methods for evaluating their predictive power – PEMs (Shmueli and Koppius 2011). In literature several PEMs are broadly discussed. Amasyali and El-Gohary (2018) provide an overview of the most commonly-used PEMs for predicting building energy consumption. As the most widely used PEMs they mention the Coefficient of Variation (CV), the Mean Absolute Percentage Error (MAPE), the Root-Mean-Square Error (RMSE), and the Mean Absolute Error (MAE). Table 2 gives an overview of the respective PEMs, including their formal definitions, units, value ranges, and optima.

Table 2 Overview of the most common Performance Evaluation Measures in analogy to Amasyali and El-Gohary (2018) and the Mean-Squared Error used for model learning

\({F}_{i}\) and \({A}_{i}\) are the predicted and actual values for the FEP for an instance \(i, N\) is the sample size, and \(\stackrel{-}{A}\) is the mean of all actual values \({A}_{i}\). Each PEM exhibits different characteristics, leading to different outcomes of prediction accuracy. Outlier sensitivity is an important characteristic, as high deviations between predicted and actual values are not beneficial for EQMs. Furthermore, a unitless measure provides intuitive interpretation and understanding of the PEMs for readers not familiar with this subject. Both characteristics support the fact that the CV is the most commonly-used PEM, as well as its recommendation for energy consumption prediction models by the American Society of Heating, Refrigerating, and Air-Conditioning Engineers (American Society of Heating, Refrigerating and Air-Conditioning Engineers 2002). As the selection of the best suited PEM is not trivial, comparing several PEMs is preferable (Botchkarev 2019). Therefore, in this study, despite focusing primarily on the CV, we additionally provide information on the other three PEMs as well.

3 Methodology and Study Design

To address the research question and benchmark different EQMs, a suitable methodology and study design are necessary. Benchmarking is a well-known and often used term recognized as an essential instrument for improving product and organizational performance, even if benchmarking activities may vary strongly today (Ketter et al. 2015). To meaningfully structure the benchmarking of different EQMs, we derived a seven-step process illustrated in Fig. 2, which is based on the Cross Industry Standard Process for Data Mining (CRISP DM) and the guidelines by Müller et al. (2016) for conducting big data analysis.

Fig. 2
figure 2

Derived process to benchmark energy quantification methods for predicting building energy performance (own illustration based on Wirth and Hipp (2000))

Generally, the CRISP DM provides a standardized process to increase business understanding by applying data mining methods in six steps: “Business Understanding”, “Data Understanding”, “Data Preparation”, “Modeling”, “Evaluation”, and “Deployment” (Wirth and Hipp 2000). We explain our derived process steps in the following:

“Business Understanding and Benchmarking Problem”: We extend the initial first stage of "Business Understanding" with our main objective of solving the benchmarking problem of different EQMs. In addition, we modify the intention of the business understanding to collect domain specific knowledge about building energy performance and EQMs, which is necessary for the benchmarking problem, as we do not intend to get deeper business insights by applying data mining methods. We presented domain specific knowledge in Sect. 2 providing the theoretical background for EPCs and EQMs. As benchmarking candidates, we choose the legally required standard engineering EQM and some well-selected data-driven EQMs. We use the three most commonly used machine learning algorithms in literature for predicting the energy performance of buildings, namely ANN, SVR, and RF (Amasyali and El-Gohary 2018). In addition, we consider the ensemble learning algorithm XGB and D-vine copula quantile regression that showed promising results in recent case studies (Schallhorn et al. 2017; Touzani et al. 2018). With this selection, we can investigate a wide range of models from simpler models like RF to more complex models like SVR. After selecting our benchmarking candidates, we modify the CRISP DM again by introducing our target measure the FEP, before proceeding with data understanding.Footnote 2

Data Understanding”: This step was not modified. In our study we dispose of a training and a separately collected validation dataset, which will be explained in Sect. 4.

Data Preparation”: This step was not modified either. We prepare the data, such that they are available in high quality and can be further used appropriately. For this purpose, we apply the two-stage LANG approach to check for semantic and syntactic data constraints in Sect. 4 (Zhang et al. 2019).

Modeling and Evaluation”: In these steps we implement, train, and tune our EQMs (c.f. Sect. 5). With the trained models we predict the FEP for each building in the validation dataset, which allows us to meaningfully compare the different EQMs based on the PEMs in Sect. 6. Thereby, we conduct two benchmarking analyses. (1) We train the data-driven EQMs on the first dataset and benchmark them against the engineering EQM on the out-of-sample second dataset encompassing the FEP calculated by the energy auditors according to the normative framework. (2) We subsequently benchmark only the data-driven EQMs based on nested cross-validation on all available data to get the most robust results while complying with state-of-the-art machine learning techniques. Since the calculated FEP is not given for the first dataset, we do not include the engineering EQM in the second benchmarking analysis.

Deployment”: This step largely coincides with the step of deployment in the original CRISP DM. We discuss our results and present derived implications for policy, research, and commercial application in Sect. 7.

Last, our results contribute to solve the defined benchmarking problem with our research question in the first step and close the process cycle. Further iterative rounds of the process could be used to adapt single process steps for further insights.

4 Data and Pre-processing

In this study, we used three real-world datasets to derive the target measure FEP for the benchmarking of the EQMs. The first dataset comprises 25,000 single and two-family buildings from Germany with 74 attributes containing information on the building characteristics, e.g., physical building attributes and geometry, the installed heating system, the location, and the annual metered thermal energy consumption.Footnote 3 Information about the occupants is not available. This dataset serves as training and test data for the data-driven EQMs. The second dataset originates from two German energy consulting companies that employ qualified energy auditors and includes 345 additional single and two-family buildings with 35 attributes each, which were collected during on-site inspections by the employed auditors in the period between 2016 and 2018. Next to the metered annual thermal energy consumption, the dataset also contains the calculated annual energy demand from EPCs, which represents the engineering EQM. We therefore use this second dataset as validation data for the benchmarking against the engineering EQM. The calculation rules and specifics for the creation of EPCs are updated frequently (Platten et al. 2019). To compare EPCs correctly requires that they follow the same calculation rules. The calculated EPCs in this dataset were each created according to the standard DIN V 4108–6 in combination with DIN V 4701–10. As there were no normative changes concerning the FEP during the period of the survey, the dataset does not need to be adjusted (Beuth Verlag GmbH 2004, 2016). The third dataset is a statistical survey from the German micro census 2011, which represents the household and building stock of Germany (Federal Statistical Office of Germany 2011). This dataset will later be used for stratification purposes to ensure representative results.

To calculate the target measure FEP we had to make some assumptions. Following Eq. 4, the FEP is calculated from the consumption, the climate factor, and the effective building area. Since the latter two were not directly included in the datasets, we assumed that each building contains a heated basement and applied Eq. (1) with \({f}_{c}=1.35\) to derive the effective building area. We further retrieved the mean climate factor over the period the datasets were gathered from historical data by mapping the buildings to the nearest weather station based on the zip code. Finally, we inserted these values in Eq. 4 to calculate the FEP.

To ensure high data quality we cleansed the training and validation datasets. First, we reduced the attributes to the intersection between the two datasets. This is necessary, because otherwise we would train the EQMs on data we cannot provide for validation. Nonetheless, the datasets shared a large intersection in the most important attributes, containing identical or similar attributes that could be easily converted. Second, we excluded attributes lacking explanatory power for the FEP, like identification numbers, as well as attributes with few entries. Also, we deleted faulty or contradicting data entries, e.g., when the age of the roof is older than the building age itself. Third, we eliminated outliers in the attributes living space and final energy consumption, using the thresholds of Metzger et al. (2019). The resulting datasets contained 20,348 and 330 data entries, respectively, with a total of 15 attributes, illustrated in Table 3.

Table 3 Input parameters for data-driven Energy Quantification Methods

Some data-driven EQMs require further processing steps to increase their prediction accuracy. Because these processing steps are not identical for all EQMs, we further processed the data algorithm-specifically. For the ANN, this involved normalizing all numerical attributes to [0,1] and one-hot encoding all non-numerical attributes, i.e., introducing a binary dummy variable for \(n-1\) instantiations (Jovanović et al. 2015). For SVR, we only performed one-hot encoding, while no further pre-processing is required for the RF and XGB. For the copula, we applied continuous convolution to each attribute (Nagler 2018a, b).

To ensure representativeness of our study, we post-stratified our results with regard to building age based on the third dataset according to the German building stock.Footnote 4 Stratification describes a sampling procedure, in which representativeness with regard to a desired attribute is ensured by sampling in the respective relation from the different subpopulations (Bowley 1925). Post-stratification takes place after data collection. We post-stratify our results by adjusting the PEM to the German building stock. First, we calculate the PEM for each subpopulation – in our case the building age class –, then calculate a weighted average according to the building age class distribution in the German building stock. This method is used with great success in various fields of study (Bowley 1925); Heinisch 1965; Miratrix et al. 2012). Table 4 shows the percentages of the overall German building stock and our datasets, illustrating why post-stratification is necessary. Henceforth, when we refer to our PEMs, we use the stratified PEMs when applicable.

Table 4 Distribution of building age classes in Germany (census) and in our datasets

Table 5 summarizes the individual pre-processing steps.

Table 5 Methods used for data pre-processing

5 Model Fitting and Tuning

As mentioned, the results for the engineering EQM are only available for the validation dataset. This in turn means, that benchmarking the engineering EQM is also only possible on this dataset and, consequently, benchmarking against the engineering EQM applying nested cross-validation on the larger training dataset is not possible. Thus, to receive the most reliable results and make the best use of our available data, we conducted two benchmarking analyses. In the first analysis, we applied cross-validation on the training dataset and evaluated the prediction accuracy on the validation dataset for all EQMs including the engineering EQM, while in the second analysis we further benchmarked only the data-driven EQMs against each other based on nested cross-validation on all data. We implemented and tuned each algorithm in the statistical programming language R.

For the first analysis, we applied cross-validation on the training dataset with hyperparameter tuning based on genetic algorithms (Friedrichs and Igel 2005; Goldberg 2012). In this vein, we defined areas for all relevant hyperparameters and randomly initialized a population. For each hyperparameter specification we trained a model and evaluated its fit based on the CV, handing the best performing specifications over to the next generation. Additionally, new hyperparameter specifications were added by crossbreeding and mutating the more successful specifications while ensuring parameter constraints (e.g., integer values for the hidden layers or non-negativity constraints). This procedure was applied over 200 generations, or until an early callback indicated no further improvement in CV. Once we identified the best performing hyperparameter specification, we trained a model with the tuned hyperparameters on the entire training dataset to not lose any information before evaluating their prediction accuracy on the validation dataset.

For the second analysis, i.e., to benchmark only the data-driven EQMs in-depth, we proceeded mostly in the same way with the sole difference, that we applied nested cross-validation on all data instead, which had not been possible before due to missing results for the engineering EQM in the larger training dataset. This two-stage approach allowed us to compare the data-driven methods against the engineering EQM while still receiving robust results for the benchmarking of the data-driven methods.

In what follows we cover method-specific details on the model tuning process. However, because a holistic introduction to all relevant hyperparameters for the different algorithms is neither content wise nor in terms of space within the scope of this manuscript, we refer to the literature for thorough explanations and only provide the information necessary to reproduce this study. To lever comparability, we used MSE where applicable for model training and CV for (outer-)fold performance evaluation. The respective tables in the Appendix A1 (available online via http://link.springer.com) show the final set of hyperparameters and their value ranges during the tuning process.

Random Forest: For the RF we used the R package “randomForest” (Breiman et al. 2018). Because we apply regression, we fitted each individual tree minimizing the MSE as error metric instead of the information gain used for classification.

Extreme Gradient Boosting: For the XGB we used the R package “xgboost” (Chen et al. 2020) and proceeded similar to the RF. We again used regression minimizing the MSE.

ANN: For the ANN we used the R packages “keras” and “tensorflow” (Falbel et al. 2020a, b). We fitted the individual models using Adam as optimizer based on rectified linear units as activation functions for the hidden layers and a linear output function. The model was trained minimizing the MSE on 500 epochs, however using early callback if there was no significant improvement in the test data to avoid overfitting.

SVR: For the SVR we used the R package “e1071” (Meyer et al. 2019). We used radial basis kernel functions and applied Epsilon-regression. This procedure prioritizes good model fit over simple solutions, which is in line with the overall goal of this study. We then fitted the individual models according to the underlying optimization function.

Copula: For the copula we used the R packages “vinereg” and “VineCopula” (Nagler et al. 2019; Nagler 2019). For the copula there are no hyperparameters to be tuned in the classical sense. Instead, we applied a parsimonious forward selection algorithm by Kraus and Czado (2017), which sequentially builds up the model using the Akaike Information Criterion based on conditional loglikelihood as termination criterion. The algorithm thereby automatically fixes the tree sequences in the vine copula structure. Once the termination criterion threshold is no longer breached when adding variables, the algorithm stops. The resulting variable selection and tree structure can be found in the Appendix A2.

6 Results

6.1 Benchmarking against the Engineering Energy Quantification Method

The prediction accuracy of the different EQMs measured by the PEMs is presented in Fig. 3. Here, the EQMs are depicted on the x-axis, while the y-axis indicates the magnitude of the PEMs. Focusing first on the CV, we notice that the engineering EQM lags significantly behind with a CV of 0.614, while the data-driven EQMs provide results in approximately the same range between 0.33 and 0.35. This means that the prediction of the engineering EQM deviates roughly 60% on average from the mean actual FEP. The XGB shows the highest prediction accuracy with a CV of 0.329 which equals a decrease in error of almost 50%. To ensure robustness, we validated these results by means of further PEMs.Footnote 5 Thereby, the general tendency remains the same with only minor variations in the exact outcomes. The only notable difference occurs for the MAPE, where the ANN shows the highest prediction accuracy, reducing error by more than 50% compared to the engineering EQM. However, the difference is slight and minor variations in the order of EQMs were expected. Table 6 provides detailed numeric values.

Fig. 3
figure 3

Performance evaluation measures for the different energy quantification methods

Table 6 Performance evaluation measures for the different energy quantification methods

Next, we have a closer look at the individual predictions of the EQMs. Figure 4 shows scatterplots in which we compare the predicted and the metered (weather rectified) FEP for each EQM. The x-axes show the predicted values, while the y-axes show the metered values. The blue circles represent the buildings in the validation dataset. For easier interpretation, we provide an angle bisector and a regression line. Ideally, we want all observations to lie on the angle bisector.

Fig. 4
figure 4

Scatterplots of predicted and metered final energy performance for the different energy quantification methods

The engineering EQM exhibits the highest standard deviation in the errors with \({\sigma }^{Eng} =55.45\,\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\), for a mean metered FEP of 126.44 \(\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\) over the whole dataset. The engineering EQM is followed by the SVR with \({\sigma }^{SVR}=43.99\,\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\) and ANN with \({\sigma }^{ANN}=43.28\,\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\). The copula and the RF on the other hand exhibit slightly less standard deviation in the errors with \({\sigma }^{Cop}=42.84\,\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\) and \({\sigma }^{RF}=42.72\,\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\). The XGB has the smallest standard deviation of \({\sigma }^{XGB}=42.07\,\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\). At the same time, the engineering EQM and the RF both overestimate the FEP on average by 50 and 4 \(\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\), respectively, while the ANN, SVR, and XGB underestimate the FEP on average by 15, 6, and 5 \(\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\), respectively. The copula underestimates the FEP on average very slightly by 0.05 \(\mathrm{kWh}/{(\mathrm{m}}^{2}\mathrm{a})\). Again, we notice that there is high unexplained variance which could stem from different factors like occupant behavior and cannot be explained by the EQMs based on the building characteristics alone.

To obtain a more complete picture, we disaggregated the predictions for different instantiations of the variables building age and living space, and analyzed whether there are significant differences. The idea behind this is that systematic errors might have been made when one of the variables takes extreme values, e.g., a very poor prediction accuracy for old buildings. For better readability we aggregated the variables into building age classes and living space bins. For the building age we chose the building age classes from the census to obtain comparability with other studies. For the living space bins, we took the different deciles as separators for a total of ten living space bins.Footnote 6 Figure 5 shows the results for the building age classes on the left-hand side and for the living space bins on the right-hand side. The figures are structured analogously, with the x-axes indicating the instantiations of the variables and the y-axes indicating the CV.

Fig. 5
figure 5

Coefficient of Variation for the different Energy Quantification Methods for instantiations of the variables building age on the left-hand side, aggregated into building age classes, and living space on the right-hand side, aggregated into living space bins

For the building age classes, the XGB, copula, and RF show slightly higher prediction accuracy, as expected based on the aggregated results. While the data-driven EQMs produce similar results throughout all building age classes, the engineering EQM increases in prediction accuracy towards newer buildings until 1990. This is in line with findings in literature of higher measurement errors for buildings with lower energetic efficiency in England and Wales (Crawley et al. 2019), thus older buildings with less strict regulations, which can be explained by the underlying data quality. As mentioned, engineering EQMs require exact inputs and expert knowledge to produce viable outcomes. For older buildings this is often not the case, especially when construction methods or building materials used are unknown. The final increase in the last two building age classes is partially explained by the CV being a relative PEM. Stricter building construction regulations came into place in Germany from 1977, followed by further aggravations leading to lower overall FEP, which in turn yields higher CVs for the same absolute error (Deutsche Energie-Agentur GmbH 2016). For the living space, we notice an overall trend towards more accurate predictions for larger buildings. However, this trend is less pronounced when compared to the building age classes and therefore does not allow for conclusions. Again, the XGB and RF show superior prediction accuracy for most living space bins.

Last, we evaluated the individual over- and underestimations for different building age classes, as literature describes a general overestimation bias of FEP for older and an underestimation for newer buildings (Greller et al. 2010). Figure 6 reports on the results, whereby the x-axis again depicts the building age classes and the y-axis depicts the mean prediction error as the difference between predicted/calculated and the real measured values. We note that the EQMs indeed overestimate the FEP for older buildings, however, cannot validate an underestimation of newer buildings. Analogously, in Fig. 4 the slope of the regression line from the engineering EQM is lower than the bisector. This supports the findings of Cozza et al. (2020), who found a lower actual consumption for energy inefficient residential buildings and a higher actual consumption for efficient residential buildings in the Swiss building stock. The fact that the overestimation is greater in older buildings with poorer energy efficiency must be urgently improved, so that the negative publicity and unreliable statements for buildings in need of renovation do not prevent investments in retrofitting measures. The lower estimation error for data-driven EQMs may result from the fact that the training dataset contains measured energy consumption and thus implicitly considers occupant behavior. Following Greller et al. (2010), the higher deviations in older buildings could be due to a more savings-conscious user behavior on average for less energetic efficient buildings, which is associated with a higher rejection of temperature comfort than assumed in the standards for calculation.

Fig. 6
figure 6

Mean prediction error of the Final Energy Performance for the building age classes

6.2 Benchmarking the Data-driven Methods

In this subsection we present the results for the in-depth benchmarking of the data-driven EQMs only. Because we applied nested cross-validation with five outer folds and ten inner folds, we obtain as a result not one but five tuned models per algorithm which performed best for their respective outer folds. To still present the results in a clear and understandable way, we aggregated the prediction accuracies by calculating the mean PEMs. Figure 7 presents an overview over the prediction accuracies of the different EQMs measured by the PEMs.

Fig. 7
figure 7

Mean performance evaluation measures for the different data-driven energy quantification Methods over the five outer folds

We notice that the differences in prediction accuracy almost completely vanish when we use nested cross-validation for performance evaluation instead of the validation set. When aggregated, the accuracies differ by less than 1% regarding CV. We further notice that the overall prediction accuracy increases slightly for most PEMs. Both effects are to be expected, as the repeated evaluation procedure yields more robust results and allows for in-sample training. The XGB and SVR slightly outperform their competitors in most cases. Table 7 further reports on the exact values and the standard deviations given in brackets. The standard deviations in the results reveal that the ANN mostly exhibits the highest standard deviation in prediction accuracy, thus its results should be treated with more caution. RF on the other hand scores very consistently.

Table 7 Mean performance evaluation measures for the different data-driven energy quantification methods over the five outer folds (standard deviation given in brackets)

Last, we provide some insights into variable importance to increase the explainability of the models. However, Shmueli and Koppius (2011) state that explanation and prediction should be best thought of as separate modeling goals. Consequently, any model trying to encompass both will have to compromise. This means that the following analyses should be interpreted with caution, as our goal was prediction and not explanation. To derive the variable importance for each of the models, we used the method initially proposed by Breiman (2001). The importance is derived by permuting the predictor variables and measuring the decrease in accuracy. Figure 8 shows the results for the five most important variables of the data-driven EQMs, with higher values corresponding to higher importance. A complete enumeration of all variables and their respective importance for each algorithm can be found in the appendix (Figure A 2).

Fig. 8
figure 8

Variable importance plots for the data-driven energy quantification methods

We notice that the living space is highly important for all data-driven methods. This is explained by changing heating behavior and usage patterns of rooms depending on the available living space. Because the number of residents in single and two-family houses does not generally increase with the living space, the utilization of all rooms decreases. For example, rooms are used as storage rooms, for sports or as repair shops and are not necessarily heated. Since the data-driven EQMs were trained on measured consumption, they could learn this correlation. Next to the living space, the energy source is consistently important. This is also not surprising, as the heating system is of central importance for the overall energy efficiency. The remaining variables are less consistent in their importance. Nonetheless, we notice similarities between the two tree-based algorithms XGB and RF, as well as between the ANN and SVR which both use one-hot-encoding. For the copula, the importance can be inferred to a certain degree from the tree structure for the bivariate copula building blocks. Moreover, due to the parsimonious forward selection algorithm applied for model fitting, the copula disposes of less variables (c.f. Figure A 1 and Figure A 2).

7 Discussion

Our results show that the energy performance gap generally holds true for single- and two-family buildings in Germany. The engineering EQM produces approximately the values for the energy performance gap as expected in literature. The data-driven EQMs are also in the expected range but exhibit a considerably lower error. The lack of literature for our specific benchmarking problem of predicting annual heating energy performance in residential buildings does not allow a holistic discussion of the accuracy gap between data-driven and engineering EQMs. Nevertheless, compared to the results of Neto and Fiorelli (2008), who compared an engineering EQM with an ANN for time series prediction of energy consumption of buildings, the data-driven EQMs in our study show an even greater advantage in terms of prediction accuracy. In their study the ANN achieved a 3-percentage point advantage, whereas our data-driven EQMs achieve almost a 30-percentage point advantage over the engineering EQM. However, our analyses do not confirm previous findings in literature that ANN and SVR possess generally better prediction accuracy for building energy performance than less complex machine learning algorithms like RF (Amasyali and El-Gohary 2018). Rather, XGB exhibited the highest prediction accuracy for most analyses conducted, closely followed by SVR and RF. ANN, on the other hand, performed worst to second worst among the tested data-driven EQMs. However, the differences in prediction accuracy were slight and the standard deviations indicate that these results should be treated with caution. Consequently, we refrain from stating that one data-driven EQM is particularly suited for this task. Nonetheless, this supports that each application requires a specifically designed EQM to reach its highest accuracy, and that there is no strictly dominant EQM (Mosavi et al. 2019). Because our data-driven EQMs rely solely on few attributes which are relatively easy to grasp compared to the engineering EQMs, we argue that data-driven EQMs exhibit further advantages regarding their handling and applicability. Thus, using data-driven EQMs instead of engineering EQMs saves money and time while simultaneously increasing prediction accuracy.

Our results have several managerial and policy implications. First, they provide clear guidelines for policymakers. The current state of the low-carbon transition paths requires higher retrofitting rates for residential buildings to still reach the climate goals. Therefore, we advocate to revise the current legislation to allow for data-driven EQMs instead of the prescribed engineering EQM with significantly worse prediction accuracy. This potentially raises the residential building retrofitting rate by decreasing the uncertainty of energy efficiency measures, thereby removing investment barriers and contributing to achieving the climate goals. Two different applications are conceivable at present, either the direct replacement of the engineering EQM, or the complementary application used for transitional quality assurance of the engineering EQM to check for outliers or incorrect data. The verification could be automated and thus be realized cost-efficiently and without human involvement. The quality assurance can be rolled out nationwide and increase confidence in the EPC, thus offering a more reliable foundation for decision-making. Potential challenges are the acceptance and ensured quality of the underlying models. Homeowners may perceive unfair treatment if EPCs depicting low energy efficiency are issued based on calculation methods that are not or hardly comprehensible such as black-box approaches, as this reduces the resale value of houses. Using more explainable methods, like RF or XGB might mitigate this challenge. However, there is a whole field of Explainable Artificial Intelligence discussed controversially in literature (Rudin 2019). In addition, inexplicable miscalculations can arise for the data-driven methods, resulting in highly distorted results. We argue, however, that the currently prescribed methods are also highly error-prone if not performed correctly, therefore data-driven methods are to be preferred, due to the generally significantly higher prediction accuracy. When putting data-driven EQMs into a use case perspective, a distinction must be made between EPCs for existing and new buildings. Data-driven EQMs learn from available data, limiting their suitability for creating EPCs for new buildings. Since the construction rate in Germany is comparatively low and the energy saving potentials in existing buildings are much greater, as well as the determination of consumption being more costly and error-prone, the focus should be placed on this use case (Deutsche Energie-Agentur GmbH 2016). Second, we suggest the usage of data-driven EQMs for other applications as well, such as asset management, city planning, insurance, etc., to enhance their business models with more economic decision-making, minimization of risk, and higher profits. The energy efficiency evaluation of buildings is a central element in many areas and can be decisive for the economic success of companies (Bozorgi 2015). To collect cost-efficient information is particularly relevant for the initial energy evaluation of real estate if EPCs are not yet at hand, as energy-efficient buildings yield higher returns and higher rents than energy-inefficient buildings (Cajias and Piazolo 2013). Insurance companies could enhance claim prediction models, or asset management companies could optimize their portfolios with data-driven investment strategies. However, both should be extremely careful with the implementation since miscalculations in investment portfolios are comparatively worse than miscalculations in EPCs. Third, our results imply that more focus should be put onto the benchmarking of different machine learning algorithms, as for our specific use case XGB almost consistently yielded better results than the algorithms ANN and SVR which are favored in literature. Most literature investigated focused on one machine learning algorithm only and disregarded comparisons and benchmarks. This, however, results in a limited generalizability of their results.

Naturally, this research is beset with some limitations. First, we focused on annual heating FEP of German residential buildings. Other results might hold true for, e.g., commercial, or industrial buildings, as well as for other geographical regions or time horizons. Second, because the validation dataset was gathered by qualified energy auditors, there might be a systematic selection bias in the individual data points. However, the fact that we validated out of sample, i.e., that the data-driven EQMs could not learn this potential systematic bias, suggests that the relative improvement over the engineering EQM is presumably even more substantial than this study predicts. Third, several important building characteristics were missing in the dataset, e.g., upper floor insulation and basement insulation. More importantly, we also have no information on socio-economic factors or occupant behavior. This leaves a large margin of variance in the data unexplained. Fourth, for the calculation of the target measure in accordance with the current norms, some assumptions were made regarding basement availability and heating. We approximated the effective building area for all buildings where only the living space was given, but did not find any signs in our analysis that this approximation would lead to higher errors. In contrast, in the case of the buildings that are approximated by the living space, the errors in the building energy performance are consistently smaller. Nevertheless, future research could start here by training and analyzing on a complete dataset also including this information. Moreover, for the rectification of weather effects, we used the mean of the climate factor for each weather station over the period the datasets were gathered, because the datasets did not contain the exact year of data collection, but a span of seven years. These assumptions and simplifications could possibly lead to minor deviations in the final outcomes. In addition, the measured consumption could have been further rectified with regard to room and heating threshold temperatures that deviate from the standard assumption, vacancies, or measurement inaccuracies for non-network-bound energy sources (e.g., wood pellets or heating oil) if corresponding data were at hand (Bigalke and Marcinek 2016). Fifth, there exist further EQMs that were not considered in this study, which does not allow to state a final recommendation. Nevertheless, these EQMs can also be benchmarked by applying our methodology adapted from the CRIPS DM cycle. We are convinced that our derived process is generally applicable in the context of benchmarking and can be used in the future for comparison and benchmarking in various situations. Also, even though we tried to provide a comparable basis for all EQMs, by changing individual steps and spending more time in the optimization procedures improvements in prediction accuracy could have been achieved.

However, these limitations give rise to new research potential. One natural direction includes gathering additional high-quality data points, which include all necessary building characteristics as well as occupant behavior. However, this procedure might prove cumbersome. Another direction includes examining further EQMs as well as tuning them to a higher extent. In particular for the copula, we expect the more general R-vines to perform significantly better. To the best of our knowledge, no implementations of R-vine quantile regression exist in any statistical programming language, but promising theoretical advances have been recently made. Also, the focus on only one country may be relaxed, incorporating other geographical areas with different characteristics of buildings, climate conditions, and other normative frameworks for EPC calculation to assess whether our findings are generalizable for these areas and circumstances. This could also be an interesting task for transfer or federated machine learning to take advantage of decentralized datasets for large scale machine learning. All in all, further research is necessary in this field, as current research is scarce. This is most likely due to scarce publicly available and processable data as highlighted in literature (Carpino et al. 2019). Since most institutions with the necessary database are state-regulated, we suggest that policymakers enter into cooperation with scientific institutions, since a sufficiently large and high-quality database is essential to obtain reliable and more generally valid results from which to derive meaningful long-term political incentive mechanisms to curb climate change. In the same course of the structured recording of large quantities of quality-assured data, data on occupant behavior should be recorded. This would make it possible to analyze the causes of the significant differences between measured and calculated EPCs as well as between the different EQMs. Based on the obtained knowledge, more precise statements can be made about energy consumption and savings after potential retrofit measures. This in turn enables investment decisions to be taken on a sound basis, while at the same time reducing barriers to energy efficiency investments by minimizing the investment risk (Ahlrichs et al. 2020). In addition, a large high-quality database might allow to reproduce our results and benchmark further EQMs more systematically over all regions in Germany, to essentially mitigate the major drawbacks of our study. Our research also contributes to the theoretical body of knowledge by identifying potential for improvement in the currently established methods and benchmarking multiple EQMs in terms of predictability. Regarding the classification of Shmueli and Koppius (2011), this corresponds to role six (assessing predictability of empirical phenomena) and peripherally touches role four (comparing existing methods).

8 Conclusion

In this study, we benchmarked different Energy Quantification Methods (EQM) for residential buildings, applying a derived process based on the CRISP DM. In doing so, we are among the first to focus on the interface of predicting heating Final Energy Performance for residential buildings, based on real-world data with annual energy predictions.

More precisely, we compared Artificial Neural Networks, D-vine copula quantile regression, Extreme Gradient Boosting, Random Forest, and Support Vector Regression with the engineering EQM currently established by German law. We used an extensive real-world dataset of 25,000 German single- and two-family buildings for model training and testing and another out of sample dataset of 345 additional buildings for validation, also containing Energy Performance Certificates issued by qualified auditors, which represent the engineering EQM. Our results provide strong evidence that the data-driven EQMs outperform the engineering EQM by a large margin, reducing the prediction error by almost 50%. We additionally benchmarked only the data-driven EQMs against each other based on nested cross-validation. In contrast to existing literature, Extreme Gradient Boosting exhibits the highest prediction accuracy for most cases, closely followed by Support Vector Regression, which is favored in literature, and Random Forest. To ensure robustness of our results, we examined several Performance Evaluation Measures and analyzed two variables – the building age and the living space – in more detail to account for potential systematic biases. Despite minor variations, the general tendency holds, indicating robust results. We conclude that data-driven EQMs in general are more suitable for residential building energy quantification. Therefore, we advocate to revise the current legislation to allow for the use of data-driven EQMs in Energy Performance Certificates for existing buildings.