Introduction

Having reliable real estate price indices is pivotal for several reasons. Firstly, index volatility is an important input in determining the cost of capital of real estate (Geltner et al., 2017). This is helpful for investors, underwriters and policymakers to determine interest rates. Secondly, it allows one to (re-)appraise real estate portfolios mark-to-market, by multiplying (historic) book values by the corresponding index returns (Francke & Van de Minne, 2021a). This is of obvious interest for investors, but also for owner-occupiers in determining their household wealth. This can also help appraise commercial and residential property values for property tax purposes. Finally, real estate indexing would allow for derivative trading (Deng & Quigley, 2008), especially in the absence of index revisions. However, producing real estate indices is a nontrivial task as real estate is a heterogeneous asset class that transacts infrequently (Francke & Van de Minne, 2021b).

In the real estate literature, several methods haven been proposed to compile transaction price indices. Hedonic pricing models and repeat sales are two of the most popular ones. The hedonic pricing model is the “original” method of price index construction (Malpezzi, 2002), and will be the focus of this paper. The hedonic pricing model has its origins in appraising farmland values (Haas, 1922a; 1922b; Wallace, 1926), and has been used for making constant quality indices in the automobile industry (Court, 1939). Later examples of literature using hedonic pricing models include Griliches (1961), Lancaster (1966), and Rosen (1974). The hedonic pricing model assumes that the price of a commodity is composed by aggregating the individual contributions of each of its characteristics.Footnote 1

Classic hedonic approaches employ linear models, estimated by Ordinary Least Squares (OLS), or its generalizations for index construction. These linear algorithms have the convenience of the β parameter, that elicits a linear relationship, in units, between the dependent and independent variables. Nevertheless, the dependence on estimating a parameter vector β constrains the use of a wide family of other regression algorithms, such as Machine Learning (ML) approaches, that are non-linear or non-parametric. This paper proposes a (hedonic) framework which admits the usage of parametric as well as non-parametric/non-linear models to construct price indices.

The proposed method can be viewed as a model that predicts the price of a property (\(\hat {Y}\)) sold in the current period (t) as if it had been sold in the previous period (t − 1) The difference between the estimated price (\(\hat {Y}_{t-1}\)) and actual price (Yt) at a property level can be decomposed into a market price change and some noise. More specifically, we first estimate some (linear/non-linear/non-parametric) model using only transactions in (or up to and including) period t − 1. In a second step, we predict the values of all properties sold in period t, using the estimated model in the first step. In the final step, we take the out-of-sample prediction errors for transactions in period t, and compute the average change in order to compute the price index change. To find the returns in subsequent periods, we simply redo these three steps for the remaining time periods (\(t,t+1,\dots ,T-2,T-1\)). This methodology is called the chained “Paasche” price index (Geltner et al., 2017). Note that this three step framework is agnostic as to which model and estimation technique is used in the first step, and it can include, but is not limited to, ML techniques.

One of the many reasons to have an price index methodology that incorporates other (regression) algorithms such as ML is that these permit a more flexible relationship between variables when compared to linear models (Varian, 2014). This flexibility increases our expectations of better approximating the unknown complex data generating process that underlies real estate property prices. In other words, it might improve our ability to find the “true” price change between periods that is less affected by noise.

Following this line of thought, this article proposes a new methodology that generalizes the chained hedonic approach in a way that any ML regression algorithm can be used to compile property price indices. Hence, this paper contributes to the existing literature for building non-linear price indices, such as McMillen and Dombrow (2001), and it connects the fields of number theory, econometrics, and ML. In fact, to the best of our knowledge, this paper is the first to estimate price indices using ML approaches.Footnote 2 Our proposed framework therefore opens up the field of chained hedonic pricing models for numerous new possibilities.

The data used in this study have been provided by Real Capital Analytics (RCA) and consist of 29,998 individual transactions from commercial real estate properties in New York metropolitan area, in the period from 2000 to 2019. We construct an aggregate price index and indices per property type.

In general, the results show that using individual predictions of real estate transaction prices is a viable solution for building price indices. Another finding is that Out-of-Time prediction accuracy is higher for the ML algorithms when compared to OLS. However, ML algorithms are more dependent on the data used for estimating the models, and have less stability when applied to smaller data sets. Additionally, the bias and variance trade-off from the ML algorithms has an important role in this methodology, as bias affects the index being estimated. A solution for regression algorithms that exhibit estimation bias is the use of the double imputation method.

We perform a stress test to determine sensitivity to sample size and to examine how much leverage the sample size has on the proposed methodology. This test is performed by sampling 50%, 25%, 10%, and 5% of the available transactions 30 times for each percentage level, totalling 120 samples. The number of repetitions was chosen to ensure that the hyperparameters would be estimated 30 times.

Additionally, we allow the model’s predicted property prices in period t to be estimated on multiple periods up to and including period t − 1. Two approaches for determining the optimal window size are tested: Rolling Window (RW) with windows of 2 through 8 years and Expanding Window (EW). These tests determine the relation between model complexity (additional observations at the cost of additional year variables) and model accuracy for each of the algorithms used.

Chained index results for the full sample show that all ML algorithms have lower Root Mean Squared Error (RMSE) and higher R2 than OLS. Regarding volatility and first-order autocorrelation (ACF1), ML algorithms have similar results to OLS; none of the indices exhibit extreme volatility or a negative ACF1. The trade-off between having more observations by using expanding or rolling windows versus the cost of including time fixed effects, affects the algorithms differently. Another finding is that while performing the stress test, ML algorithms have higher errors compared to OLS when data is restricted. This is more evident at the 5% data restriction level, which suggests a higher data dependence for the ML algorithms.

Methodology

This paper presents a single imputation Chained Paasche approach towards building real estate price indices (Balk et al., 2013). Linear and non-linear models are used for imputing the value of a real estate property in a different time period. The linear model, estimated by ordinary least squares (OLS), is used as a benchmark. For each period, hyperparameter tuning, model estimation and price prediction are performed. Therefore, there is a model estimated or trained using each algorithm for each period.

Hyperparameter tuning is performed to minimize the risk of overfitting the data, consequently lowering the out-of-sample generalization power. To best capture the price dynamics and to avoid overfitting, an optimization of a vector of hyperparameters λ is required. Therefore, a five-fold cross-validation with random search is performed to estimate the best set of hyperparameters for each regression algorithm. This optimization is explained in depth further in Section labeled “Training and bias-variance trade-off”.

Chained Index

The chained hedonic approach is a series of (numeric) indices for a long sequence of periods. It is obtained by connecting price changes covering shorter time intervals. This approach controls for changes in the pool of real estate properties available in period t by specifying a “representative property”. This allows the indexing process to explicitly relate property price change either to changes in the implicit prices of its characteristics or to general market conditions and omitted longitudinal variables (Geltner et al., 2017).

According to Geltner et al. (2017), the representative property can be specified relatively freely within the bounds of the sample characteristics. As a result, the chained hedonic index is a more powerful approach for exploring asset price dynamics. In this study, the representative property is the average real estate property in period t.

The chained hedonic index effectively requires separate, purely cross-sectional regression algorithms to be run on each single index period (t), hence, constituting an out-of-sample estimation. The index number formula can be defined so as to establish a Laspeyres, Paasche or Fisher index. In this research, only the Paasche index is applied.

The hedonic price modelling approach is susceptible to omitted variable bias or model misspecification as it is not possible to control for and observe every real estate characteristic. Therefore, the resulting price indices are constant quality with respect to the observed characteristics.Footnote 3

The Chained Paasche Index (CP) is calculated each period by taking log sale prices Yt and characteristics Xt in period t for training and then predicting log prices of properties transacted in period t + 1 as if they had been transacted in period t, \(\hat {Y}^{t}_{t+1} | (X_{t+1}, \mathcal {F}_{t})\), where the superscript in \(\hat {Y}^{t}_{t+1}\) denotes the price level date t, and the subscript denotes the transacted properties in period t + 1, and \(\mathcal {F}_{t}\) is the training set, given by (Yt,Xt). All transactions in period t are used for training the model, and all transactions in period t + 1 for out-of-sample predictions with price level date t. Note that \(\hat {Y}^{t}_{t+1} | (X_{t+1},\mathcal {F}_{t})\) does not use time fixed effects, as the data used for estimation (training) is for a single period, t. Thus, within the training data set, time is a constant.

The average difference between observed log sale prices in period t + 1 and its corresponding predictions in period t is an estimate of the log price change between period t and t + 1, denoted by δt+ 1,

$$\hat{\delta}_{t+1} = \frac{1}{n_{t+1}}\sum\limits_{i \in S_{t+1}} \left( {Y_{i,t+1}-\hat{Y}^{t}_{i,t+1} | (X_{t+1},\mathcal{F}_{t})}\right),$$
(1)

where St+ 1 denotes the set of transactions in period t + 1, and nt+ 1 its corresponding number.

The Paasche Single Imputation index can be calculated as

$$\text{CP}_{t}=\text{CP}_{t-1} \times \exp(\hat{\delta}_{t+1} ).$$
(2)

As Balk et al. (2013) point out, an alternative to the single imputation approach is to use the predicted (fitted) values \(\hat {Y}^{t+1}_{t+1} | (X_{t+1}, \mathcal {F}_{t}+1)\) instead of observed prices Yt+ 1. This approach is known as double imputation. The estimated log price change between period t and t + 1 is then given by

$$\hat{\delta}_{t+1} = \frac{1}{n_{t+1}}\sum\limits_{i \in S_{t+1}} \left( {\hat{Y}^{t+1}_{i,t+1} | \mathcal{F}_{t+1} - \hat{Y}^{t}_{i} | (X_{t+1}, \mathcal{F}_{t})}\right)$$
(3)

It can be argued that double imputation is a better way to construct indices, as biases derived from omitted variables in the model would at least partially offset each other (Balk et al., 2013). For the present study, both single and double imputation have been investigated. For simplicity, this paper focuses on the single imputation approach. Brief results of double imputation are presented in the Section labeled “Double and single imputation”.

Additionally, we allow the model to predict property prices in period t to be estimated on multiple periods up to and including period t − 1. We test two variants of the CP model: (1) Rolling Window (RW) and (2) Expanding Window (EW) models. Both the rolling and expanding window variants have the benefit that they use more observations to train the model. For each additional period in the training window, a time dummy variable is included in the model, increasing the model complexity (dimensionality). The window approach has the objective of analyzing the trade-off between model complexity and accuracy.

RW models are estimated based on all observations in periods t − 1 and t (a 2-period window), so \(\mathcal {F}_{t} = (y_{t},X_{t}), (y_{t-1}, X_{t-1})\), and then predict the prices of properties sold in period t + 1 as if they had been sold in t. EW models are estimated based on all observations available from the first period in the time series (2000) until period t and then predict the prices of real estate sold in period t + 1 as if they had been sold in period t. In this approach, the extension of the training window varies (expands) over time, so \(\mathcal {F}_{t} = (y_{t},X_{t}), (y_{t-1}, X_{t-1}) ,\ldots , (y_{1}, X_{1})\).

Machine Learning Algorithms and comparison metrics

There are several studies in the real estate literature that implement machine learning algorithms. For example, Kok et al. (2017) use machine learning to construct an Automated Valuation Model (AVM). Most of these studies benchmark traditional hedonic pricing models to Artificial Neural Networks (Tay and Ho, 1992; Do & Grudnitski, 1992; Evans et al., 1992; Worzala et al., 1995; McGreal et al., 1998; Nghiep & Al, 2001; Wong et al., 2002; Peterson & Flanagan, 2009).

As there are countless ML regression algorithms that allow us to predict the value of next period’s transaction prices \(\hat {Y}^{t}_{t+1} | (X_{t+1},\mathcal {F}_{t})\), a preselection of algorithms is necessary for practicality. The preselection phase consisted in applying several ML algorithms e.g. Lasso (Tibshirani, 1996), Random Forest (Breiman, 2001), and k-Nearest Neighbours (Altman, 1992) to the same trial dataset and selecting the four with the lowest RMSE. The algorithms which had the best performance in the preselection phase were Support Vector Regression (SVR) (Drucker et al., 1997), Extreme Gradient Boosting Tree (XGBT) (Chen & Guestrin, 2016), Neural Networks Using Model Averaging (avNNet) (Ripley, 1996) and Cubist (Quinlan & et al., 1992). These algorithms are among the most popular in the ML domain and are applied in a variety of fields.

The linear model estimated by OLS is selected as our benchmark for two reasons. First, it is the most popular regression model. Second, it is the norm for chained indices (Balk et al., 2013). SVR, XGBT, avNNet and Cubist are non-linear ML regression algorithms. Specifically, SVR is a variation of Support Vector Machine (SVM) for regression, XGBT and avNNet are algorithms that can be used for both regression and classification, while Cubist is an algorithm used exclusively for regression.

The estimated indices are assessed on two criteria: Regression model accuracy and index quality. RMSE and (adjusted) R2 are among the most popular metrics for measuring the performance of regression algorithms. According to Steurer et al. (2019), these measures are used by several researchers for evaluating regression algorithms in the AVM field. Hence, these are the selected metrics for assessing the regression algorithms that form the indices.

For assessing the quality of price indices, Guo et al. (2014) argue that volatility and first-order autocorrelation (ACF1) of the index returns are more appropriate measures. Volatility is defined as the standard deviation of the index returns. The authors argue that noise (cross-sectional random error) in the index adds to volatility, beyond and in addition to the true volatility. Thus, noise generates excess or erroneous volatility in the index. This excess volatility reduces the ACF1 to the point that a pure noise index would yield an ACF1 of -0.5. Hence, extreme volatility and a negative ACF1 are indicative of poor index quality. Finally, the four factors established to compare the estimated price indices are: out-of-sample RMSE, R2, volatility and the ACF1 of the index returns.

Training and bias-variance trade-off

The training of a ML algorithm is marked by the trade-off between bias (central tendency) and variance (deviations from the central tendency) of the predicted variable (Geman et al., 1992; Webb, 2000). The objective is to have the least biased model with the highest generalization power. Hyperparameters are settings used to control how flexible the resulting model should be when fitting the training data. Refinements in ML implementation emphasize stable out-of-sample performance to explicitly guard against overfitting (Gu et al., 2018). Choosing the best set of hyperparameters minimizes the occurrence of overfitting and ensures out-of-sample generalization.

Cross-Validation (CV) is a sampling technique used during training for hyperparameter optimization, which results in a nearly unbiased estimator of the generalization error given a finite sample (Scholköpf and Smola, 2002). When employing k-fold CV, the lowest bias can be achieved by setting k to n, where n equals the number of observations in the training data set. This type of CV is also known as Leave One Out Cross-Validation (LOOCV), and can be performed at the cost of increasing the error variance (Hastie et al., 2009; Smola & Scholköpf, 2004). Conversely, setting k to 10 or 5 yields the best compromise between bias and variance (Breiman & Spector, 1992; Kohavi & et al., 1995).

Considering the trade-off between bias and variance, in this paper we opt for performing a 5-fold CV with hyperparameter values selected via random search. This choice favors the balance between bias and variance. As a result, it cannot be asserted that all performed forward predictions are mean unbiased. Note, however, that the proposed methodology allows the choice of hyperparameter tuning to warrant either less bias or less variance.

Consider as a natural distribution with parameters Θ and as an i.i.d. sample set drawn from . The learning algorithm operates as a function that maps the data set to a function f which minimizes some expected loss . Bergstra and Bengio (2012) argue that the ultimate goal of a standard learning algorithm is to find the function f. A learning algorithm usually produces f through the optimization of a training criterion in connection with the set of parameters Θ, e.g. mean and variance of . For the present work, the selected training criterion is RMSE.

The learning algorithm has additional features called hyperparameters (λ) which control the shape of function f. The actual learning algorithm is the one obtained after selecting λ, which can be denoted , where for training set X(train). Identifying the best set of values for the hyperparameters λ is called hyperparameter optimization. The objective is to choose λ that minimizes generalization error, hence

Essentially, there are no efficient algorithms to perform the optimization:

$$\left( \hat{Y}^{t}_{t+1}\mid X_{t+1}, \mathcal{F}_{t}, \hat{\lambda}_{t}\right)$$
(4)

Additionally, it is not possible to evaluate the expectation over the unknown natural distribution . Hence, cross-validation is employed to estimate the expectation of . Cross-validation replaces the expectation with the mean over a validation set , whose elements are drawn i.i.d .

(5)
(6)
$$\approx \underset{\lambda\in\left\{\lambda^{(1)}...\lambda^{(S)}\right\}}{\arg\min} {\Psi}(\lambda)\equiv \hat{\lambda}$$
(7)

Equation 6 expresses the hyperparameter optimization problem in terms of a hyperparameter response function, Ψ. Hyperparameter optimization is the minimization of Ψ(λ) over λ ∈Λ. Different data sets, tasks, and learning algorithm families yield different sets Λ and functions Ψ. As the information about Ψ and Λ are scarce, the current dominant strategy for finding a good λ is to choose some number (S) of trial points \(\left \{\lambda ^{(1)},...,\lambda ^{(S)}\right \}\), to evaluate Ψ(λ) for each one, and return the λ(i) that works the best as \(\hat {\lambda }\). This strategy is represented by Eq. 7.

The trial points \(\left \{\lambda ^{(1)},...,\lambda ^{(S)} \right \}\) are randomly selected, hence the name random search. Bergstra and Bengio (2012) suggest that a random search is better than other methods, such as a grid search, because it has a higher chance of finding the global optima. The ML regression algorithms explored in this research have different numbers of hyperparameters.

Data

The data used in this study consist of individual commercial real estate transactions in New York over the period 2000–2019. It is divided in five regions, four property types and four building periods. For each property, the data contain the price in USD and the area in squared feet. The data are provided by Real Capital Analytics (RCA), a well-established commercial real estate transaction dataset in the United States. Founded in 2000, they collect over 90% of all transactions of “investable” real estate by now.

Observations in the bottom and top 1% of the distribution \(\log (\text {price})/ \log (\text {area})\) have been removed. The main reason for doing so is that data entry errors and outliers tend to be concentrated at the extremes of the distribution (Steurer et al., 2019). After performing the previous step and removing any observation with missing values from the original 31,727 observations, the resulting data set is composed of 29,998 observations.

The yearly summary by transaction date is available in Table 1. A few observations can be made from this table. First of all, the average square footage of properties sold decreased substantially in the first four years. This is mostly explained by the fact that the data collecting process improved in that period.Footnote 4 This is also reflected in the relatively low number of observations in the same period. Secondly, note that the Great Financial Crisis (GFC, 2007 – 2009) is clearly visible as well. The average transaction price was $11M during the trough of the GFC, considerably lower compared to the $27M long run average. The number of observations also fell almost by half between 2008 and 2009, falling from 1,354 to 719.

A categorical variable for building period has been created by splitting the data into quartiles of the building date. Information about the categorical variables is available in Table 2. Note that over 70% of the transactions happen in just Manhattan and the NYC Boroughs (Brooklyn, Queens, Bronx and Staten Island). Likewise, over 40% of the transactions are income producing apartment complexes.

Table 1 New York sample summary statistics
Table 2 Categorical variables summary

Results

This section first presents chained index estimates, followed by stress tests that analyze the sensitivity of the results to the sample size. The Section labeled “Optimal Window Size” contrasts estimation results using rolling and expanding windows. Finally, the bias-variance trade-off confirms that double imputation is better for regression algorithms that suffer from estimation biases.

Chained index results

Figure 1 shows the yearly CP indices. It shows that SVR (Fig. 1a) and Cubist (Fig. 1d) indices have higher cumulative price changes compared to OLS. Also, XGBT (Fig. 1b) and avNNet (Fig. 1c) have almost identical results as OLS for most of the time, but are slightly lower in the last five years.

Fig. 1
figure 1

New York Chained Paasche Price Indices. OLS = Ordinary Least Squares; XGBT = Extreme Gradient Boosting Tree; SVR = Support Vector Regression; avNNet = Neural Networks Using Model Averaging

Figure 2 shows the yearly CP index returns. Despite the fact that the indices are different in levels (see Fig. 1), these differences is not that evident when looking at the index returns (first differences). Among all the regression algorithms, the most notable differences in the returns are displayed by SVR (Fig. 2a) and Cubist (Fig. 2d), both of which yield overall higher returns than OLS.

Fig. 2
figure 2

New York Chained Paasche Price Returns. OLS = Ordinary Least Squares; XGBT = Extreme Gradient Boosting Tree; SVR = Support Vector Regression; avNNet = Neural Networks Using Model Averaging

Figure 3 and Table 3 show that overall, the non-linear models have a lower RMSE and a higher R2 when compared to OLS. This is expected, as non-linear models are more flexible and usually yield a better fit to the data, especially when the data is complex and sparse as is the case for real estate. Table 3 shows that for all the regression algorithms, the index returns have a non-negative first-order autocorrelation and do not exhibit extreme volatility.

Fig. 3
figure 3

New York Chained Paasche Indices RMSEs. OLS = Ordinary Least Squares; XGBT = Extreme Gradient Boosting Tree; SVR = Support Vector Regression; avNNet = Neural Networks Using Model Averaging; RMSE = Root Mean Squared Error

Table 3 Chained Paasche performance summary

Since the commercial real estate market in New York is distinct for each property type, an index per property type has been calculated. Figure 9a exhibits the index for apartments, Fig. 9b for retail, Fig. 9c for offices and Fig. 9d for industrial. Figure 10a, b, c and d present the respective RMSE values for the underlying algorithms.

As the regression algorithms used to estimate the property type indices were trained on a restricted data set, the RMSE for the ML algorithms are higher when compared to the indices with the full sample. Apartment is the most prevalent property type. Therefore, the ML algorithms show a lower RMSE than the benchmark, with the only exception being SVR. XGBT and avNNet produce an index almost identical to the one generated by the benchmark, while SVR and Cubist have much higher values.

Industrial properties are the type with the fewest transactions. The RMSE for this category is around the same level for all the algorithms used, with ML being below the benchmark in some years and above in others. The variation of the RMSE values is higher for the ML algorithms when compared to the benchmark. For this property type, XGBT has a very similar index to the benchmark, whereas SVR and Cubist have much higher index values. avNNet is the only regression algorithm that exhibits an index with a lower value than the benchmark.

Office and Retail have similar numbers of transactions and the RMSE of the ML algorithms is usually higher than the benchmark. SVR, XGBT and avNNet generated indices with values lower than the benchmark. Cubist produces an index that has lower values than the benchmark up until 2015, but after that the index values are higher.

The restriction per property type imposed in the training sample limits the performance of the ML algorithms. This is more evident after examining the RMSE plots, which show no conclusive improvement when compared to the benchmark.

Stress Test

In order to check model and index stability, a stress test has been performed by sampling 50%, 25%, 10% and 5% of the available data, 30 times for each percentage level, so in total 120 samples. An index has been generated for each sample.

Figure 4 shows the average index return for each percentage level. Interestingly, all indices exhibit a higher volatility at the 5% level, detaching from the other percentage levels in years 2001, 2004 and 2011. Table 4 presents summary statistics of the volatility (standard deviation of the index returns) per percentage level. All ML algorithms have a similar behaviour, displaying almost no noticeable change in the index until the 5% mark. Only at the 5% level is there a significant index return change, with the index return exhibiting a higher volatility when compared to the full sample. It is possible to corroborate the results in Fig. 4 by examining the mean volatility of the 30 samples in Table 4, where, only at the 5% level, a significant increase in volatility of the returns is noticeable for all regression algorithms.

Figure 5 presents the average RMSE over all stress tests performed. The RMSE for the ML algorithms is noticeably more sensitive to data availability. Nevertheless, for several percentage levels, the RMSE for the ML algorithms is lower than the OLS benchmark. Starting from the 10% level, the RMSE of the ML algorithms increases, decoupling from the previous levels. At the 5% level all ML algorithms display a higher RMSE than the OLS benchmark.

Table 4 provides insight in how “stable” the index returns remain when using less observations (see Francke and Van de Minne (2017), who employ a similar stress test). On average, the mean of the volatility of the returns for OLS is lower than the ML algorithms for all suppressed percentage levels, except for the 5% level where XGBT and Cubist have a lower average return volatility. A common trend among all regression algorithms is that the volatility of the returns increases as more data are suppressed. The standard deviation of the return volatility also increases when more data are removed. Only at the 50% and 10% levels is the standard deviation of the return volatility from the benchmark lower than the ML algorithms.

Table 4 Summary statistics of the volatility from the 30 samples with 50%, 25%, 10% and 5% of the available observations
Fig. 4
figure 4

Returns for Stress Tests. All the indices returns are similar until the 10% level. At the 5% level, indices returns exhibit a higher volatility, detaching from the other percentage levels in years 2001, 2004 and 2011. Numbers refer to the percentage of the sample size being used: 50, 25, 10 and 5 percent. When omitted the full sample has been used

Fig. 5
figure 5

RMSEs for Stress Tests. The RMSE for the ML algorithms is noticeably more sensitive to data availability as it is higher then OLS. Starting from the 10% level, the RMSE of the ML algorithms increases, decoupling from the previous levels. At the 5% level all ML algorithms display a higher RMSE than the OLS benchmark. Numbers refer to the percentage of the sample size being used: 50, 25, 10 and 5 percent. When omitted the full sample has been used

Optimal Window Size

This subsection presents model results from both rolling (RW) and expanding windows (EW) samples. The main motivation for investigating alternative samples is to account for the trade-off between model complexity and out-of-sample accuracy.

A larger time window size is associated with more observations and more year control variables, thus adding more dimensions and complexity to the models. Up to an Optimal Window Size (OWS), the addition of an extra year is expected to have an overall benefit of increasing the out-of-sample accuracy, as the model fits the data better and becomes more generalizable. The balance between model complexity and out-of-sample accuracy can be explored by testing different window sizes and measuring the impact in terms of out-of-sample RMSE.

In our search for the optimal window size, we test windows of 2 to 8 years. . Window sizes larger than 8 years would hamper this exercise by being too limiting on the data availability for training. The test data used for the different window sizes must be identical to allow for a fair comparison of results. Therefore, for each window size m, the regression algorithms have been trained on data in the years tm + 1 to t and predicted for year t + 1, where t + 1 = 2008,…,2019.

Fig. 6
figure 6

RMSEs per Window Size. Each regression algorithm behaves differently, depending on the training window size. The best performing algorithms are avNNet with a window size of 2 years and XGBT with a window size of 7 years; RMSE = Root Mean Squared Error

Figure 6 presents out-of-sample RMSEs for the different algorithms and window sizes. The figure shows that each regression algorithm behaves differently, depending on the training window size. This shows that the out-of-sample model performance and corresponding indices are quite sensitive to the window length. Table 5 provides model and index performance statistics for the optimal window size for each algorithm. Table 6 presents the results for the EW approach, which has a lower performance when compared to RW. The results for RW and EW suggest that the optimal window size is bigger than two years and smaller than using all years as in EW. The best performing algorithms are avNNet with a window size of 2 years and XGBT with a window size of 7 years.

Table 5 Optimal window performance summary
Table 6 Expanding window performance summary

Double and single imputation

When comparing the results obtained using single and double imputation (Fig. 7) one can notice a significant cumulative difference of single minus double imputation for the ML algorithms. SVR and Cubist have the highest cumulative difference. Conversely, avNNet shows a moderate cumulative difference, while XGBT is the only ML algorithm that exhibits a negative difference between indices using the true or fitted values. OLS shows no cumulative difference because the fitted mean is equal to the true mean by definition (when a constant is included).

Fig. 7
figure 7

Cumulative difference between single and double imputation. All ML algorithms exhibit a difference between single and double imputation, SVR and Cubist have the highest cumulative difference; Index difference = Single minus double imputation difference

As presented in the Section labeled “Training and bias-variance trade-off”, there is a trade-off between bias and variance that should be taken into consideration when selecting and evaluating ML algorithms. Algorithms with higher fitted variance have a tendency to overfit, which implies that the mean of the fitted values will be equal or virtually equal to the mean of the true (observed) values as the bias is close to zero.

Table 7 shows the difference between true mean log price and fitted mean log price. All ML algorithms display values different from zero, which indicates bias. On the other hand, Table 8 shows that all ML algorithms have a lower difference between true and fitted variance. All ML algorithms have a higher fitted variance when compared to the benchmark. Hence, Tables 7 and 8 display and quantify the bias-variance trade-off for this application.

The double imputation approach is preferred for the ML algorithms as the bias in the fitted value \(\left (\hat {Y}^{t+1}_{t+1} | \mathcal {F}_{t+1}\right )\) and the Out-of-Time prediction \(\left (\hat {Y}^{t} | (X_{t+1}, \mathcal {F}_{t})\right )\) will partially offset each other. Hill and Melser (2008) also suggest double imputation for similar reasons. For the purpose of price index construction, it is necessary to be aware of the estimation bias, as this can be transferred into the index. A biased model can potentially produce an index that over or under estimates long-term price trends, especially over long time periods, as the index is built using prior index values and the bias will accumulate.

Fig. 8
figure 8

New York Chained Paasche Price Indices with Double Imputation. Starting from year 2013 all the indices built using the ML regression algorithms are below the benchmark, different from the single imputation approach, where SVR and Cubist had all their values above the benchmark

Figure 8 shows the price indices for the New York commercial real estate market using double imputation. Notice that after the year 2013 all the indices built using the ML regression algorithms are below the benchmark, different from the single imputation approach, where SVR and Cubist had all their values above the benchmark.

Additionally, one of the central assumptions of this paper is that the difference of the means is the periodic price change. With a biased estimator this difference will be composed of the periodic price change plus bias. Hence, when using the proposed methodology, the bias-variance trade-off should be taken into consideration as well as the measures to attenuate or eliminate the estimation bias. As mentioned in the Section labeled “Training and bias-variance trade-off”, in the context of this paper, increasing the number of folds during training might be a solution to mitigate estimation bias during training.

As a further check, Welch’s t-tests were performed in all actual/fitted price pairs to test the hypothesis that the two populations have equal mean. The test results, such as t-values, p-values and confidence intervals can be seen in the Appendix labeled “Welch’s t-Test Results”. Note that the algorithms which exhibit a higher difference between single and double imputation indices are the ones with lower p-values on average. However, with a 95% confidence interval, none of the tests rejected the null hypothesis. This result indicates the likelihood that the two populations have equal means (Tables 9101112 and 13).

Table 7 Difference between true and fitted log price means
Table 8 Difference between true and fitted log price variance

Conclusion

This paper presents a model-agnostic approach based on Out-of-Time individual transaction predictions to build price indices for commercial real estate using a variety of non-linear machine learning (ML) algorithms. The key innovation is the use of prediction error to measure time trends. The results obtained support the viability of using ML for constructing price indices. Overall, the non-linear ML algorithms yielded higher accuracy and lower volatility with non-negative first-order autocorrelation of index returns.

The comparison between single and double imputation shows that some of the ml algorithms have display estimation bias. Using the proposed methodology for index construction requires attention to the bias and variance trade-off. The findings also highlight the importance of the hyperparameter selection phase in minimizing the introduction of bias while keeping the out-of-sample generalization power. Regression algorithms that exhibit estimation bias could use the double imputation method as a straightforward way to reduce the bias problem.

The stress tests show that linear models (OLS) generate overall more stable indices when few training data are available than the non-linear ML regression algorithms used in this paper, as linear models are less dependent on the number of observations. Also, looking at the index volatility in the stress test, OLS has lower values, on average, than the ML algorithms. Additionally, the variations of the loss function across the tests with 50, 25, 10 an 5 percent of the data are higher for the ML algorithms, especially at the 5% level. The RMSE from the property type indices corroborate the idea that ML regression algorithms are more dependent on sample size. These indices were generated using regression algorithms estimated using only the property type sample. Data sets composed of property types like apartment that contain more observations produce lower RMSE when compared to the other types with fewer observations, such as industrial real estate.

The analysis of the optimal window size for the Rolling Window (RW) approach demonstrates that the magnitude of the optimal window varies greatly across the different algorithms; the window sizes range from 2 to 8 years. The single year window corresponds to the Chained Paasche index (CP) and is ruled out, as the RMSE is higher than both the RW and the Expanding Window (EW).

Considering all the tests performed in this study, it is possible to conclude that in cases where more observations are available, even at the cost of adding more dimensions (controls or features), ML algorithms tend to produce better results than OLS. Cases where few observations or characteristics are accessible favor OLS, as it performs better in restricted data sets when compared to ML.