Abstract
In this paper we discuss the use and potential advantages and disadvantages of machine learning driven models in rental guides. Rental guides are a formal legal instrument in Germany for surveying rents of flats in cities and municipalities, which are today based on regression models or simple contingency tables. We discuss if and how modern and timely methods of machine learning outperform existing and established routines. We make use of data from the Munich rental guide and mainly focus on the predictive power of these models. We discuss the “black-box” character making some of these models difficult to interpret and hence challenging for applications in the rental guide context. Still, it is of interest to see how “black-box” models perform with respect to prediction error. Moreover, we study adversarial effects, i.e. we investigate robustness in the sense how corrupted data influence the performance of the prediction models. With the data at hand we show that models with promising predictive performance suffer from being more vulnerable to corruptions than classic linear models including Ridge or Lasso regularization.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Rental guides for flats are an official instrument in the German rental market, see e.g. Kauermann and Windmann (2016). Based on regular surveys, city councils issue the average rent for flats given the flat’s facilities, like floor space, year of construction and facilities such as well equipped kitchen or high standard bathroom. Given available survey data one is interested in constructing a prediction model for the rent per squared meter, given the input variables, which we call features subsequently. Denoting the features as \(x\) and the rent per squared meter as \(y\) we are interested in finding a good prediction model
where \(\varepsilon\) is considered as noise, mirroring market price variation. We denote \(y\) subsequently also as response variable.
Before discussing the statistical approaches to tackle function \(f(x)\) in (1) we want to add some more explanation about rental guides in Germany. These are used as an instrument to infringe on a landlord’s constitutional rights to his property (Art. 14 German Constitution). Therefore, a judge, in order to deny a landlord a rent increase, will need a solid base for his judgment. Rental guides aim to fulfill this purpose. This, in turn, seem to exclude “black box models” from the set of instruments that one should use. Moreover, there is a significant amount of vested interest involved in the process of rental guide creation which of course increases the likeliness of data corruption to occur. Hence, careful and detailed guidelines for rental guides are inevitable. With this being said, we can now approach rental guides formally as a prediction model. We also refer to Kauermann and Windmann (2016) and Fitzenberger and Fuchs (2017) for more details.
In times of increased usage of machine learning methods we can consider Eq. (1) as a supervised learning setting. Hence, we may take advantage of the toolbox of available machine learning algorithms to train or estimate a suitable prediction model \(f\). A classical model and in fact the commonly used model in practice (so far) is to build the prediction model \(f\) via regression techniques. One approach, developed by Aigner et al. (1993), is to fit a multiplicative-additive regression model in two stages. The more common strategy is to use an additive regression model, as discussed for instance in Fahrmeir et al. (2022). Regression models are one of the legally permitted models for rental guides, besides very simple models based on (contingency) tables, which we do not consider here in this article. Instead, we go beyond the legally permitted models and want to explore more advanced tools of machine learning as introduced, e.g. by James et al. (2017) and Hastie et al. (2017).
Regression models allow for interpretation due to their open box character. In contrast, more complex machine learning makes direct interpretations often difficult. This leads to extended flexibility but ends up with a “black-box” character. Still, one can achieve higher prediction accuracy. The recent developments in machine learning suggest to investigate their potential usage for rental guides. This is the scope of this paper. We use classical regression models including penalized regression and contrast these to regression trees (Breiman 1984) and ensemble models like averaging and boosting (Freund and Schapire 1999; Breiman 2001; Friedman 2001; Chen and Guestrin 2016) but also neural networks (Goodfellow et al. 2016). A comparison is given in terms of model performance and predictive power.
Besides interpretability, the question of robustness of these models gets in the foreground. We thereby focus on adversarial effects, see e.g. Biggio et al. (2013); Szegedy et al. (2014); Biggio and Roli (2018); Madry et al. (2018) or Tsipras et al. (2019). Adversarial effects are changes of the input variables to a machine learning model that cause the model to make wrong predictions. We use the concept of adversarial risk proposed in Javanmard et al. (2020); Mehrabi et al. (2021) to quantify the robustness of machine learning based rental guides and compare these to adversarial effects in regression models.
The paper is organized as follows: Sect. 2 shortly describes the data at hand. In Sect. 3 we introduce all prediction models, emphasize some differences and explain their essential hyper-parameters. Sect. 4 introduces the notion of standard and adversarial risk which is then applied it to the rental data. The results of our data analysis are given in Sect. 5 and a conclusion follows in Sect. 6.
2 Data and Software
As database we make use of the Munich rental guide data from 2019 containing \(n=3024\) sampled apartments for which we include \(p=19\) selected features Windmann and Kauermann (2019, Table 2.7). The features and their respective description are listed in Table 1. The features are apparently not independent and their Pearson correlation coefficients are visualized in Fig. 1.
For developing and coding we use Python 3, see Van Rossum and Drake (2009). As IDE (Integrated Development Environment) we use Spyder 4.5.0, see Raybaut (2009). For building statistical models we use Statsmodel (Seabold and Perktold 2010) and Scikit (Pedregosa et al. 2011) API’s (Application Interface) with its latest versions.
3 Prediction Models
In this section we shorty describe the different types of prediction models used in this paper. A short summary including the models’ standard performance hyper-parameters are provided in Table 3.
3.1 Regression Model
Given features \(\mathbf{x}=(x_{1},\ldots,x_{p}\)), we predict the rent per square meter via the model
where \(\hat{\beta}_{0}\) denotes the intercept and \(\hat{\beta}_{0},,\ldots,\hat{\beta}_{p}\) are estimated by least squares method, i.e. by minimizing the residual sum of squares (RSS)
where \(\mathbf{x}^{(i)}=(1,x_{1}^{(i)},\ldots,x_{p}^{(i)})\) and \(\boldsymbol{\beta}=(\beta_{0},\ldots,\beta_{p})\), with superscript (\(i\)) referring to the observed data and \(i=1,\ldots,n\). We also write \(\hat{f}(\mathbf{x}):=f(\mathbf{x};\hat{\boldsymbol{\beta}})\) and \(\hat{\mathbf{y}}:=\mathbf{x}\hat{\boldsymbol{\beta}}\) as shorthand. In the application, the model is extended by including non-linearities for the metrical covariates. To be explicit, we replace the linear fit by a spline-based fit using tools extensively described in Wood (2017).
Given our response variable \(y\), our inputs \(\mathbf{x}\) and our prediction model \(\hat{f}(\mathbf{x})\), the loss function for measuring errors between \(y\) and \(\hat{f}(\mathbf{x})\) is denoted by \(\ell(\mathbf{y},\hat{f}(\mathbf{x}))\). We use the quadratic loss
for the applications in this paper.
3.2 Regression Trees
Basically all tree-based methods arise from partitioning the feature space into a set of hyperrectangles, and then fit a simple model in each of the hyperrectangles which are summed up as an ensemble of subtrees to give a final prediction model. Amongst several types of algorithms, here we focus on the most common CART-algorithm (Classification and Regression Trees). To be specific, regression trees, following Breiman (1984) and Hastie et al. (2017), are built by dividing the feature space \(\mathbb{R}^{p}\) into \(K\) distinct and non-overlapping hyperrectangles \(R_{1},\ldots R_{K}\) that minimize the RSS given by
where \(\hat{y}_{R_{k}}\) is the mean response for the observations within the \(k\)th hyperrectangle, i.e. \(\hat{y}_{R_{k}}=\sum_{x^{(l)}\in R_{k}}y^{(l)}/\left|R_{k}\right|\). For computational runtime reasons one takes a top-down, greedy approach that is known as recursive binary splitting, see Breiman (1984). In order to perform recursive binary splitting, the algorithm needs a starting point. Therefore, we first need to find a feature \(x_{k}\) and a cutpoint \(s\) such that splitting the feature space into the regions \(\{x\;|\;x_{k}<s\}\) and \(\{x\;|\;x_{k}\geq s\}\) leads to the greatest possible reduction in RSS. We consider all features \(x_{1},\ldots x_{p}\), and all possible values of the cutpoint \(s\) for each of the features, and then choose the feature and cutpoint such that the resulting tree has the lowest RSS. Once found, this feature is called the root of the tree. More precisely, for any \(\tilde{p}\) and \(s\), we define the pair of half-planes
and we seek the value of \(\tilde{p}\) and \(s\) that minimize the equation
where \(\hat{y}_{R_{1}}\), \(\hat{y}_{R_{2}}\) are the mean response for the observations in \(R_{1}(\tilde{p},s)\) and \(R_{2}(\tilde{p},s)\), respectively. The splitting procedure is now continued on each half-plane and the splitting process is continued until a stopping criterion is reached. For instance, we may continue until no region contains more than five observations. This procedure comes often with a very complex resulting tree, which is likely to overfit. Therefore, a procedure called cost-complexity tree pruning, suggested by Breiman (1984), is obtained by removing a sequence of subtrees. This procedure is applied after fully growing the tree and is described as follows: A large tree \(T_{0}\) is grown and the splitting process only is stopped when some minimum node size (say 5) is reached. Then, this tree is “pruned” by finding a subtree \(T\subsetneq T_{0}\). Let \(\left|T\right|\) denote the number of terminal nodes in tree \(T\) and let \(\textit{RSS}(T)\) be the residual sum of squares given in Eq. (3) for tree \(T\). We define the criterion
For given \(\alpha\) we aim to find the subtree \(T_{\alpha}\subseteq T_{0}\) which minimizes \(C_{\alpha}(T)\). The tuning parameter \(\alpha\geq 0\) governs the tradeoff between the tree size and its goodness of fit to the data. Large values of \(\alpha\) result in smaller tress \(T_{\alpha}\), and conversely for smaller values of \(\alpha\). As the notation suggests, with \(\alpha=0\) the solution is the full tree \(T_{0}\). For each \(\alpha\) one can show that there is a unique smallest subtree \(T_{\alpha}\) that minimizes \(C_{\alpha}(T)\). To find \(T_{\alpha}\) we use weakest link pruning, that is we successively collapse the internal node that produces the smallest per-node increase in RSS, and continue until we produce the single-node (root) tree. This gives a (finite) sequence of subtrees, and one can show this sequence must contain \(T_{\alpha}\), see Breiman (1984); Hastie et al. (2017). Estimation of \(\alpha\) is achieved using cross-validation. The final prediction model is then contained in the final tree \(T_{\hat{\alpha}}\).
3.3 Random Forests
If we consider a regression tree as our statistical model, bagging regression trees, as proposed by Breiman (1996), is an aggregation of \(B\) bootstrap samples
sampled randomly from the original training data with replacement, having all the same size. Then for each bootstrap sample \(\mathbf{x}^{*b}\), a corresponding bootstrap replication regression tree \(T^{*b}\) is grown, for all \(b=1,\ldots,B\). Together all \(B\) regression trees \(T^{*1},T^{*2},\ldots T^{*B}\) are summed up to fit a final prediction model \(\hat{f}_{\text{bag}}(\mathbf{x})\). This model is given by the arithmetic mean of the predictions obtained from the \(B\) trees \(T^{*1},T^{*2},\ldots T^{*B}\), i.e.
Random forests, see e.g. Breiman (2001) and James et al. (2017), are basically an extension of bagging. It fits regression trees \(\hat{T}^{*b}\) as base regressors on each of all the \(b=1,\ldots,B\) bootstraps. However, when splitting each node during the construction of a tree (as described in Sect. 3.2), the best split is found from a random subset of the features (see hyper-parameter max_features in Table 3). Hence, instead of considering all \(p\) features \(x_{1},\ldots,x_{p}\) a split is considered from a random sample of \(m<p\) features chosen as split candidates. The final prediction model is then given by
for \(b=1,\ldots,B\).
With these two variations of randomness, we decrease the prediction error of random forest prediction models and increase prediction performance. This will also be visible on our experiments later in the paper.
3.4 Boosting
A weak learner is defined to be a regression model that is only slightly correlated with the true prediction. Boosting, originally proposed by Schapire (1990) for classification tasks, answers the question, if a set of weak learners can be put together to form a strong learner. As in the above supervised learning problems, the goal is to find a function \(\hat{f}\) that best predicts the output variable \(\mathbf{y}\) from the input variables \(x_{1},\ldots,x_{p}\). Let \(\ell(\mathbf{y},f(\mathbf{x}))\) be the \(\ell_{2}\)-loss. Then, we want to minimize this loss formally by
The idea of boosting is to predict the response \(y\) by fitting an additive model
for \(M\in\mathbb{N}\), \(\gamma_{m}\in\mathbb{R}\) as weight and \(t_{m}(x)\in T\), where \(T\) denotes the class of base or weak learners, e.g. the class of regression trees. This is optimized in a forward stage-wise manner, meaning at each stage, one fixes the errors of its predecessor.
3.4.1 Steepest Descent
Unfortunately, choosing the best function \(t\) at each step for an arbitrary loss function \(\ell\) is a computational infeasible optimization problem in general. Therefore, we restrict our approach to a simplified version of the problem. There are basically two methods: The very first algorithm is called Adaboost using a specific exponential loss function and small stumps which are usually smaller than the trees build with gradient tree boosting (gradie), as explained in the next section, see also Freund and Schapire (1996). In this paper we use a more flexible way of updating the model to its predecessors called gradient descent. It is a first-order iterative optimization algorithm for finding a local minimum of a differentiable convex loss function going in the opposite direction of the gradient at a point. This is the direction of the steepest descent (Cauchy 1847), which is given by the negative gradient \(-\mathbf{g}(f)\) of a function \(f\). The gradient \(\mathbf{g}(f)\) of a real-valued, \(p\)-dimensional function \(f\) is defined as
where the \(e_{1},\ldots,e_{p}\) denoting the unit vectors.
3.4.2 Gradient Tree Boosting
Gradient Tree Boosting (Breiman 1997; Friedman 2001) uses regression tress of fixed size as base learners. It specializes the approach above to the case where the base learner \(t_{m}(\mathbf{x})\) is a \(J_{m}\)-terminal leaf regression tree. More precisely, let \(F\) be the set of regression trees. Each tree comes with a respective partition of the feature space \(R_{j},j=1,2,\ldots,J\) induced by the terminal node of the tree. In this case, at the \(m\)th step one fits a regression tree \(f_{m}(\mathbf{x})\) to the previous pseudo-residuals using steepest descent by
where \(f_{m-1}\) represents the combined prediction from the ensemble of trees up to the \((m-1)\)th iteration in the gradient boosting process. These pseudo-residuals forming a new data set \(\{(\mathbf{x}^{(i)},r_{m}^{(i)})\}_{i=1}^{n}\) which \(t_{m}\) gets fed with. Now, the output \(t_{m}(\mathbf{x})\) for input \(x\) can be written as the sum:
where \(b_{jm}=\bar{y}_{jm}\), the mean of the \(y^{(i)}_{jm}\) predicted in region \(R_{jm}\) and \(\mathbf{1}_{R_{jm}}(x)\) denotes the indicator function. We optimize this expression and replace the \(b_{jm}\)’s by calculating the one-dimensional \(\gamma_{jm}\)’s in each of the trees regions \(R_{jm}\). Hence we write
To control overfitting we take a sensible amount of trees (parameter: n_estimators) and a further shrinkage by \(\nu\), the learning rate (parameter: learning_rate), which can be plugged into the update rule (15) as follows:
3.4.3 Stochastic Gradient and Extreme Boosting
Another way to apply gradient boosting is to fit the trees only on subsamples. This is called Stochastic Gradient Boosting (Friedman 2001). The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes, see Table 3.
In this paper we use an implementation of a gradient tree boosting algorithm (gradie ) as described in (Pedregosa et al. 2011). We further use extreme-boosting(xgbreg ) (Chen and Guestrin 2016) which uses a second order Taylor approximation in the loss function to weight the leafs inside of a tree, see Eq. (31) in Appendix. The idea was proposed by Friedman et al. (2000). For parameter details see Table 3.
3.5 Nearest Neighbours
The principle behind the \(k\)-Nearest-Neighbour method (KNN) used for regression, see Altman (1992), is to find \(k\) observations that are closest in distance to the new point \(\mathbf{x}^{(0)}\), denoted by \(\mathcal{N}^{(0)}\). Then \(f(\mathbf{x}^{(0)})\) is estimated using the average of all the training responses in \(\mathcal{N}^{(0)}\), namely
The number of samples can be a user-defined constant (\(K\)-nearest neighbour learning), or vary based on the local density of points (radius-based neighbour learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Sometimes it make sense to assign more weight to the nearer neighbours. A common technique to achieve this is to define weights by the inverse of the distance between \(\mathbf{x}^{(0)}\) and a neighbour \(\mathbf{x}^{(k)}\), doing so for all \(\mathbf{x}^{(k)}\in\mathcal{N}^{(0)}\), see also Cunningham and Delany (2020).
Following Pedregosa et al. (2011), the problem of finding the right nearest neighbour algorithm relies on the size of the sample data and the feature space. For small data sets in both directions, the sample size and the features space, respectively, one usually uses the brute-force algorithm. It computes the distance of all pairs of points in the data set. In our experiment we use this method, since the data-set is relatively small. For details on the hyperparameters see Table 3. As the number of data grows, calculating all these distances is computational infeasible. Therefore a method called K‑D-tree can be applied, see Mehlhorn (1988); Pedregosa et al. (2011); Bentley (1975) for details.
3.6 Neural Networks
In this paper we focus on multilayer perceptrons (mlpreg) (Goodfellow et al. 2016) as a neural network regression model. Since the data flow only in forward direction, it is also known as feed-forward neural network. This network defines a mapping \(\hat{y}=f(\mathbf{x};\hat{\boldsymbol{\beta}})\) with \(\hat{y}\) as predictor for given input variables \(\mathbf{x}\), where the values of the parameter \(\boldsymbol{\beta}\) are learned (estimated) from the data. Typically, one aims to minimize the squared prediction error \(E((\hat{y}-y)^{2})\). Such a network usually is represented by a composition of many different functions. For example we might have three functions \(f^{[1]}\), \(f^{[2]}\), and \(f^{[3]}\) connected to form \(f(\mathbf{x})=f^{[3]}(f^{[2]}(f^{[1]}(x)))\). The function \(f^{[1]}\) is called first layer, \(f^{[2]}\) is called the second layer and so on. The overall length of the chain gives the depth of the model. The final layer is called the output layer and the layers before the final layer are so-called hidden layers. The hidden layer functions themselves are multivariate but simple in their structure. They incorporate a weighted sum of the input combined with an activation function. One can rewrite the \(j\)-th component of the functions as \(f^{[k]}_{j}(\mathbf{x}^{[k-1]};\mathbf{w}_{k,j},b_{k,j})=\phi(\mathbf{x}^{[k-1]^{T}}\mathbf{w}_{k,j}+b_{k,j})\), where \(\mathbf{x}^{[k-1]}\) is the (multivariate) output of the previous layer with \(\mathbf{x}^{[0]}=\mathbf{x}\). The weights \(w_{k,j}\) and the intercept \(b_{k,j}\)(the so-called bias) for \(k=1,2,\ldots\) are the parameters, which need to be determined data driven. The set of all parameters defines \(\beta\) and leads to the trained model \(f(\mathbf{x};\hat{\boldsymbol{\beta}})\). The function \(\phi(.)\) is called the activation function, which is known. With this setup we can now find optimal weights such the prediction error gets minimized. To do so one can use cross-validation so that the model is trained on one part of the data and tested on the other part. We do not want to get in more technical details here, since the field of neural networks became so massive with numerous introductory literature. We refer to Goodfellow et al. (2016) or Hastie et al. (2017) for more details, or to Borth et al. (2023).
4 Performance Measures
After having introduced the different models which will be used for constructing rental guides, we need to define the performance measures applied subsequently. We will thereby look not only on the prediction error but also on the robustness, utilizing the concept of standard and adversarial risk introduced in Javanmard et al. (2020); Mehrabi et al. (2021), just in case of the \(\ell_{2}\)-Loss.
4.1 Standard Risk
We define the standard risk
to be the prediction loss of an estimator \(f\) on an (uncorrupted) test data point \(\mathbf{x}\) and where \((\mathbf{x},y)\sim\mathcal{P}\) is drawn from some common law \(\mathcal{P}\). An empirical estimate for \(\mathsf{SR}\) is given by
where \(\hat{f}_{-i}\) is the prediction model fitted on data omitting the \(i\)th observation.
4.2 Adversarial Risk
An adversarial attacked model \(\hat{f}_{\mathcal{S}}\) is a prediction model \(\hat{f}\) just with a corruption of the data coming from predefined perturbation sets \(\mathcal{S}:=\{\delta\in\mathbb{R}^{p}:\|\delta\|_{\ell_{2}}\leq\epsilon\}\subset\mathbb{R}^{p}\). The adversary has the power of \(\epsilon\) perturbing each data point \(x^{(i)}\) by adding an element of \(\mathcal{S}\). The main idea of assessing the adversarial risk is to measure the robustness of the model. We quantify how much does the prediction change if a single or multiple input variables are false, i.e. perturbed from its original value.
We define the adversarial risk as
which is the expected prediction loss of a predictor \(f\) on an adversarially corrupted data point according to some attack or mistake model. Stated differently, the adversarial risk measures how adverse the predictor\(f\) can perform in prediction when it is fed with adversarially corrupted data.
To motivate the adversarial attack in more detail, we first have to consider our two types of features, namely metrically scaled and binary (categorical) variables. For noising a metric feature \(x_{j}\), say, a value \(\delta\in\mathbb{R}\) is added to \(x_{j}\) in such a way, that the sum \((x_{j}+\delta)\) is still is inside the range of values of \(x_{j}\). This means the sum ranges between the minimum and the maximum realization of \(x_{j}\). For binary features \(x_{l}\) this goes analogously, except that all possible values are 0 or 1. Hence, in binary case, the perturbation is achieved by \(x_{l}+(-1)^{x_{l}}\). This defines the perturbation set \({\cal S}\). We additionally restrict the number of perturbated variables and define the sets \(\mathcal{S}_{k}\subset{\cal S}\) such that each element of \({\cal S}_{k}\) has exactly \(k\) elements which are unequal to zero. In other words, in \({\cal S}_{k}\) we perturb exactly \(k\) features and leave the remaining features unchanged. This leads to the adversarial risk
Apparently, \(\mathsf{\mathsf{AR}}(f,{\cal S})=\max_{k=1,\ldots p}\mathsf{\mathsf{AR}}(f,k)\).
We estimate (21) through
where \(\hat{f}_{-i}\) is the prediction model fitted on data excluding the i‑th observation.
5 Results
In this section we compare the performance of the models introduced above with respect to the standard and adversarial risk, respectively. We abbreviate the different models as follows: linear regression model (linear), decision tree (decisi), bagging tree (baggin), random forest (random), gradient boosting (gradie), extreme-boosting tree (xgbreg), \(k\)-neighbours (\(k\)neigh), multi-layer perceptron (mlpreg).
5.1 Standard Risk
The standard risk for the different models is shown in Fig. 2, see also Table 2. For the standard risk we have in descending direction decisi (2.70), baggin (2.61), \(k\)neigh (2.56), random (2.50), mlpreg (2.45), linear (2.44), gradie (2.42), xgbreg (2.40). Hence, the best results are obtained by the boosting approaches xgbreg and gradie followed by the classical linear and neural network mlpreg. Then we have \(k\)neigh and random with almost equal results. At the end we have baggin and decisi. Both of the last two regression models are also almost equal since baggin is fitted using decisi as base regressors. It is reassuring to see that the “open box” regression models performs quite well.
5.2 Adversarial Risk
5.2.1 Model Robustness (global Robustness)
We describe the results for \(\boldsymbol{\delta}\in\mathcal{S}_{k}\) with \(k=1\). In other words, a single input variable is perturbed. The highest adversarial risk is obtained by baggin (4,17), followed by xgbreg (3,63). This is followed by decisi (3,62), gradie (3,26), \(k\)neigh (3,24), linear (3,22). The lowest adversarial risks are given by mlpreg (3,16) and random (3,10), see Fig. 3. Interestingly enough, linear regression is performing well, while some models with small standard risk suffer from larger adversarial risk.
To get a sense of the robustness of the algorithms against adversarial attacks, we also calculate the robustness measure \(\rho:=\mathsf{AR}/\mathsf{SR}\in\mathbb{R}\), the ratio between the adversarial risk and the standard risk. Usually we expect the adversarial risk to be higher than the standard risk for corrupted data. This means that we expect \(\rho\) to be always greater or a least equal 1.
For \(\rho\) in descending direction we have baggin (1.59), xgbreg (1.51), gradie (1.35), decisi (1.34), linear (1.32). The models mlpreg (1.29), \(k\)neigh (1.27) showed equal robustness. Finally random (1.23) performend the best in terms of robustness, see Fig. 4 and Table 2.
5.2.2 Feature Robustness (local Robustness)
We can take a look at the adversarial risk and see how perturbing the different features contribute to the adversarial risk. That is we look at the quantities
for all \(\boldsymbol{\delta}\in{\cal S}_{k}\). This helps to specify which feature causes unrobustness of the prediction. We do this for every feature and every learning model. This approach can also serve as a robust feature importance measure. Not surprisingly at all, we see that the different features have different effects on the models’ adversarial risk, see Fig. 5. Also we can observe that even if one model performs better in terms of a lower adversarial risk, it may have a higher unrobustness measure \(\rho\), compare, for example, \(k\)neigh and mlpreg in Figs. 5, 6 and Table 2.
Our experiment also shows that in total, there is no model which outperforms all the other models in terms of standard and adversarial risk.
6 Conclusions
This paper studies the use and performance of machine learning regression models when used for rental guides. We also study the models’ behaviour under adversarial data corruption using data from the Munich rental guide. In Sect. 5 we compared different machine learning regression models which seem to suite most for the case of explaining rental prices for flats in German Rental Guides. Even if boosting algorithms showed their equal prediction performance compared to the classic prediction models like linear regression, the black-box character still remains: Rental Guides need a differentiated and reliable (robust) information about which criteria are influencing the net rental price in both, the positive and negative direction.
In Sects. 5.1 and 5.2 we showed the standard and adversarial performance of the given models. In Sects. 4.2 and 5.2.1 we addressed adversarial regression by building prediction models on top of corrupted data. We also compared the (local) standard and adversarial risk of features which are highly influential of a model’s performance (see Fig. 5). This means that a model \(\hat{f}\) may have a lower adversarial risk than another model \(\hat{f}^{\prime}\) in terms of its predictive power. But, we found that \(\hat{f}^{\prime}\) may possibly be more robust than \(\hat{f}\) by looking at the robustness measure \(\rho=\mathsf{AR}/\mathsf{SR}\), see for example, \(k\)neigh and mlpreg in Figs. 5 and 6 and Table 2. Local examples, meaning a feature-wise comparison can be found in Table 4. For example within the feature “specialEquipment” baggin shows a higher adversarial risk but lower robustness than xgbreg.
Concretely, in terms of both, the risk measures \(\mathsf{SR}\) and \(\mathsf{AR}\) on the one hand, and the robustness measure \(\rho\) on the other hand, random, mlpreg, linear, and gradie showed good predictive power with low adversarial risk and high robustness against adversarial attacks, see Fig. 5. The model \(k\)neigh showed an overall moderate performance and xgbreg showed a high adversarial risk.
Both models, decisi and baggin showed a higher standard and adversarial risk than their competitors, but decisi has a similar robustness measure compared to xgbreg. The worst performance is obtained by baggin with second highest standard risk and highest adversarial risk resulting in the highest robustness measure. Additionally, it should be mentioned that several features being highly vulnerable when considered within the models baggin and decisi, see Table 2.
As a résumé, we see the resulting volatility of the models applied to (slightly) corrupted data as explained in Sect. 5.2.2. This contradicts both demands for prediction models being explained well on the one hand, and robust on the other hand. It is therefore recommended to search for a model minimizing the trade-off between predictive performance and adversarial risk. This helps to reduce the danger of generating machine learning regression models with misleading explanation. Therefore, we conclude that the best way to overcome this issue, is to use classic regression models. classic regression models may be supported by machine-learning tools to make them more effective. But, machine learning models should not be used as a replacement for classic regression models in the rental guide context.
References
Aigner, Oberhofer, Schmidt (1993) Eine neue methode zur erstellung eines mietspiegels am beispiel der stadt regensburg. Wohnungswirtschaft und Mietrecht WM 1993(1/2/93):16–21
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46(3):175–185, http://www.jstor.org/stable/2685209
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517, https://doi.org/10.1145/361002.361007
Biggio B, Roli F (2018) Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84(3):317–331, https://doi.org/10.1016/j.patcog.2018.07.023, http://arxiv.org/pdf/1712.03141v2
Biggio B, Corona I, Maiorca D, Nelson B, Srndic N, Laskov P, Giacinto G, Roli F (2013) Evasion attacks against machine learning at test time 7908(1):387–402, https://doi.org/10.1007/978-3-642-40994-3_25, http://arxiv.org/pdf/1708.06131v1
Borth D, Hüllermeier E, Kauermann G (2023) Maschinelles Lernen, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 19–49. https://doi.org/10.1007/978-3-662-66278-6_4
Breiman L (1984) Classification and regression trees. The Wadsworth statistics, probability series
Breiman L (1996) Bagging predictors. Machine Learning 24(2):123–140, https://doi.org/10.1007/BF00058655
Breiman L (1997) Arcing the edge. University of California, 486
Breiman L (2001) Random forests. Machine Learning 45(1):5–32, https://doi.org/10.1023/A:1010933404324
Cauchy A (1847) Methode generale pour la resolution des systemes d’equations simultanees. CR Acad Sci Paris 25:536–538, https://ci.nii.ac.jp/naid/10026863174/en/
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. CoRR abs/1603.02754, http://arxiv.org/abs/1603.02754
Cunningham P, Delany SJ (2020) k‑nearest neighbour classifiers: 2nd edition (with python examples). CoRR abs/2004.04523, https://arxiv.org/abs/2004.04523
Fahrmeir L, Kneib T, Lang S, Marx BD (2022) Regression: Models, methods and applications, second edition edn. Springer eBook Collection, Springer, Berlin, Heidelberg, https://doi.org/10.1007/978-3-662-63882-8
Fitzenberger B, Fuchs B (2017) The residency discount for rents in germany and the tenancy law reform act 2001: Evidence from quantile regressions. German Economic Review 18(2):212–236
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: In Proceedings of the Thirteenth Internations Conference on Machine Learning, Morgan Kaufmann, pp 148–156
Freund Y, Schapire RE (1999) A short introduction to boosting. In: In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pp 1401–1406
Friedman JH (2001) Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29(5):1189–1232, https://doi.org/10.1214/aos/1013203451, https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boostingmachine/10.1214/aos/1013203451.full
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). The Annals of Statistics 28(2):337 – 407, https://doi.org/10.1214/aos/1016218223
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press, http://www.deeplearningbook.org
Hastie T, Tibshirani R, Friedman JH (2017) The elements of statistical learning: Data mining, inference, and prediction, second edition, corrected at 12th printing 2017 edn. Springer series in statistics, Springer, New York, NY
James G, Witten D, Hastie T, Tibshirani R (2017) An introduction to statistical learning: With applications in R, corrected at 8th printing edn. Springer texts in statistics, Springer, New York and Heidelberg and Dordrecht and London
Javanmard A, Soltanolkotabi M, Hassani H (2020) Precise tradeoffs in adversarial training for linear regression. CoRR abs/2002.10477, https://arxiv.org/abs/2002.10477
Kauermann G, Windmann M (2016) Mietspiegel heute: Zwischen realität und statistischen möglichkeiten. Wirtschafts- und sozialstatistisches Archiv : ASTA : eine Zeitschrift der Deutschen Statistischen Gesellschaft 10(4):205–223
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations, https://openreview.net/forum?id=rJzIBfZAb
Mehlhorn K (1988) Datenstrukturen und effiziente Algorithmen: Band 1: Sortieren und Suchen. Datenstrukturen und effiziente Algorithmen, Vieweg+Teubner Verlag, https://books.google.de/books?id=EmxIAQAAIAAJ
Mehrabi M, Javanmard A, Rossi RA, Rao A, Mai T (2021) Fundamental tradeoffs in distributionally adversarial training. CoRR abs/2101.06309, https://arxiv.org/abs/2101.06309
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830
Raybaut P (2009) Spyder-documentation. Available online at: pythonhosted org
Schapire RE (1990) The strength of weak learnability. Machine Learning 5(2):197–227, https://doi.org/10.1007/BF00116037
Seabold S, Perktold J (2010) statsmodels: Econometric and statistical modeling with python. In: 9th Python in Science Conference
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2014) Intriguing properties of neural networks. In: Bengio Y, LeCun Y (eds) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, http://arxiv.org/abs/1312.6199
Tsipras D, Santurkar S, Engstrom L, Turner A, Madry A (2019) Robustness may be at odds with accuracy. In: International Conference on Learning Representations, https://openreview.net/forum?id=SyxAb30cY7
Van Rossum G, Drake FL (2009) Python 3 Reference Manual. CreateSpace, Scotts Valley, CA
Windmann M, Kauermann G (2019) Mietspiegel für München 2019 - Statistik, Dokumentation und Analysen. Sozialreferat der Landeshauptstadt München
Wood SN (2017) Generalized Additive Models: An Introduction with R, Second Edition. Chapman & Hall / CRC Texts in Statistical Science, CRC Press, Portland, https://ebookcentral.proquest.com/lib/gbv/detail.action?docID=4862399
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Declaration of competing interest
The first author is involved in creating rental guides. This work consists of surveying rents of flats and doing statistical analyses in behalf of German cities and municipalities. He was not involved in creating the Munich rental guide. The second author was involved in the statistical analyses of the Munich rental guides in 2013, 2015, 2017, 2019, 2021, 2023. The data used for the Munich rental guide 2019, which are used in this paper, have been collected by KANTAR (TNS Infratest). These involvements of both authors could be interpreted as conflict of interest appeared to influence the work reported in this paper.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Tables
1.2 Algorithms
Algorithm 1
(Gradient boosting)
Algorithm 2
((unregularized) XGBoost Algorithm)
Algorithm 3
(Calculating the standard and adversarial risk in case of δ ∈ S1 using LOOCV)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Trinkaus, O., Kauermann, G. Can machine learning algorithms deliver superior models for rental guides?. AStA Wirtsch Sozialstat Arch 17, 305–330 (2023). https://doi.org/10.1007/s11943-023-00333-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11943-023-00333-x