1 Introduction

Rental guides for flats are an official instrument in the German rental market, see e.g. Kauermann and Windmann (2016). Based on regular surveys, city councils issue the average rent for flats given the flat’s facilities, like floor space, year of construction and facilities such as well equipped kitchen or high standard bathroom. Given available survey data one is interested in constructing a prediction model for the rent per squared meter, given the input variables, which we call features subsequently. Denoting the features as \(x\) and the rent per squared meter as \(y\) we are interested in finding a good prediction model

$$y=f(x)+\varepsilon$$
(1)

where \(\varepsilon\) is considered as noise, mirroring market price variation. We denote \(y\) subsequently also as response variable.

Before discussing the statistical approaches to tackle function \(f(x)\) in (1) we want to add some more explanation about rental guides in Germany. These are used as an instrument to infringe on a landlord’s constitutional rights to his property (Art. 14 German Constitution). Therefore, a judge, in order to deny a landlord a rent increase, will need a solid base for his judgment. Rental guides aim to fulfill this purpose. This, in turn, seem to exclude “black box models” from the set of instruments that one should use. Moreover, there is a significant amount of vested interest involved in the process of rental guide creation which of course increases the likeliness of data corruption to occur. Hence, careful and detailed guidelines for rental guides are inevitable. With this being said, we can now approach rental guides formally as a prediction model. We also refer to Kauermann and Windmann (2016) and Fitzenberger and Fuchs (2017) for more details.

In times of increased usage of machine learning methods we can consider Eq. (1) as a supervised learning setting. Hence, we may take advantage of the toolbox of available machine learning algorithms to train or estimate a suitable prediction model \(f\). A classical model and in fact the commonly used model in practice (so far) is to build the prediction model \(f\) via regression techniques. One approach, developed by Aigner et al. (1993), is to fit a multiplicative-additive regression model in two stages. The more common strategy is to use an additive regression model, as discussed for instance in Fahrmeir et al. (2022). Regression models are one of the legally permitted models for rental guides, besides very simple models based on (contingency) tables, which we do not consider here in this article. Instead, we go beyond the legally permitted models and want to explore more advanced tools of machine learning as introduced, e.g. by James et al. (2017) and Hastie et al. (2017).

Regression models allow for interpretation due to their open box character. In contrast, more complex machine learning makes direct interpretations often difficult. This leads to extended flexibility but ends up with a “black-box” character. Still, one can achieve higher prediction accuracy. The recent developments in machine learning suggest to investigate their potential usage for rental guides. This is the scope of this paper. We use classical regression models including penalized regression and contrast these to regression trees (Breiman 1984) and ensemble models like averaging and boosting (Freund and Schapire 1999; Breiman 2001; Friedman 2001; Chen and Guestrin 2016) but also neural networks (Goodfellow et al. 2016). A comparison is given in terms of model performance and predictive power.

Besides interpretability, the question of robustness of these models gets in the foreground. We thereby focus on adversarial effects, see e.g. Biggio et al. (2013); Szegedy et al. (2014); Biggio and Roli (2018); Madry et al. (2018) or Tsipras et al. (2019). Adversarial effects are changes of the input variables to a machine learning model that cause the model to make wrong predictions. We use the concept of adversarial risk proposed in Javanmard et al. (2020); Mehrabi et al. (2021) to quantify the robustness of machine learning based rental guides and compare these to adversarial effects in regression models.

The paper is organized as follows: Sect. 2 shortly describes the data at hand. In Sect. 3 we introduce all prediction models, emphasize some differences and explain their essential hyper-parameters. Sect. 4 introduces the notion of standard and adversarial risk which is then applied it to the rental data. The results of our data analysis are given in Sect. 5 and a conclusion follows in Sect. 6.

2 Data and Software

As database we make use of the Munich rental guide data from 2019 containing \(n=3024\) sampled apartments for which we include \(p=19\) selected features Windmann and Kauermann (2019, Table 2.7). The features and their respective description are listed in Table 1. The features are apparently not independent and their Pearson correlation coefficients are visualized in Fig. 1.

Table 1 List of features
Fig. 1
figure 1

Pearson correlation coefficients between all selected features

For developing and coding we use Python 3, see Van Rossum and Drake (2009). As IDE (Integrated Development Environment) we use Spyder 4.5.0, see Raybaut (2009). For building statistical models we use Statsmodel (Seabold and Perktold 2010) and Scikit (Pedregosa et al. 2011) API’s (Application Interface) with its latest versions.

3 Prediction Models

In this section we shorty describe the different types of prediction models used in this paper. A short summary including the models’ standard performance hyper-parameters are provided in Table 3.

3.1 Regression Model

Given features \(\mathbf{x}=(x_{1},\ldots,x_{p}\)), we predict the rent per square meter via the model

$$f(\mathbf{x};\hat{\boldsymbol{\beta}})=\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\ldots+\hat{\beta}_{p}x_{p}=\hat{\beta}_{0}+\sum_{j=1}^{p}x_{j}\hat{\beta}_{j},$$
(2)

where \(\hat{\beta}_{0}\) denotes the intercept and \(\hat{\beta}_{0},,\ldots,\hat{\beta}_{p}\) are estimated by least squares method, i.e. by minimizing the residual sum of squares (RSS)

$$\textit{RSS}(\boldsymbol{\beta})=\sum_{i=1}^{n}\left(y^{(i)}-\mathbf{x}^{(i)}\boldsymbol{\beta}\right)^{2},$$
(3)

where \(\mathbf{x}^{(i)}=(1,x_{1}^{(i)},\ldots,x_{p}^{(i)})\) and \(\boldsymbol{\beta}=(\beta_{0},\ldots,\beta_{p})\), with superscript (\(i\)) referring to the observed data and \(i=1,\ldots,n\). We also write \(\hat{f}(\mathbf{x}):=f(\mathbf{x};\hat{\boldsymbol{\beta}})\) and \(\hat{\mathbf{y}}:=\mathbf{x}\hat{\boldsymbol{\beta}}\) as shorthand. In the application, the model is extended by including non-linearities for the metrical covariates. To be explicit, we replace the linear fit by a spline-based fit using tools extensively described in Wood (2017).

Given our response variable \(y\), our inputs \(\mathbf{x}\) and our prediction model \(\hat{f}(\mathbf{x})\), the loss function for measuring errors between \(y\) and \(\hat{f}(\mathbf{x})\) is denoted by \(\ell(\mathbf{y},\hat{f}(\mathbf{x}))\). We use the quadratic loss

$$\ell(\mathbf{y},\hat{f}(\mathbf{x})):=\left(\mathbf{y}-\hat{f}\left(\mathbf{x}\right)\right)^{2},$$
(4)

for the applications in this paper.

3.2 Regression Trees

Basically all tree-based methods arise from partitioning the feature space into a set of hyperrectangles, and then fit a simple model in each of the hyperrectangles which are summed up as an ensemble of subtrees to give a final prediction model. Amongst several types of algorithms, here we focus on the most common CART-algorithm (Classification and Regression Trees). To be specific, regression trees, following Breiman (1984) and Hastie et al. (2017), are built by dividing the feature space \(\mathbb{R}^{p}\) into \(K\) distinct and non-overlapping hyperrectangles \(R_{1},\ldots R_{K}\) that minimize the RSS given by

$$\sum_{k=1}^{K}\sum_{l\in R_{k}}\left(y^{(l)}-\hat{y}_{R_{k}}\right)^{2}$$
(5)

where \(\hat{y}_{R_{k}}\) is the mean response for the observations within the \(k\)th hyperrectangle, i.e. \(\hat{y}_{R_{k}}=\sum_{x^{(l)}\in R_{k}}y^{(l)}/\left|R_{k}\right|\). For computational runtime reasons one takes a top-down, greedy approach that is known as recursive binary splitting, see Breiman (1984). In order to perform recursive binary splitting, the algorithm needs a starting point. Therefore, we first need to find a feature \(x_{k}\) and a cutpoint \(s\) such that splitting the feature space into the regions \(\{x\;|\;x_{k}<s\}\) and \(\{x\;|\;x_{k}\geq s\}\) leads to the greatest possible reduction in RSS. We consider all features \(x_{1},\ldots x_{p}\), and all possible values of the cutpoint \(s\) for each of the features, and then choose the feature and cutpoint such that the resulting tree has the lowest RSS. Once found, this feature is called the root of the tree. More precisely, for any \(\tilde{p}\) and \(s\), we define the pair of half-planes

$$R_{1}(\tilde{p},s):=\{x\;|\;x_{\tilde{p}}<s\}\text{ and }R_{2}(\tilde{p},s):=\{x\;|\;x_{\tilde{p}}\geq s\},$$
(6)

and we seek the value of \(\tilde{p}\) and \(s\) that minimize the equation

$$\sum_{i\;|\;x^{(i)}\in R_{1}(\tilde{p},s)}\left(y^{(i)}-\hat{y}_{R_{1}}\right)^{2}+\sum_{i\;|\;x^{(i)}\in R_{2}(\tilde{p},s)}\left(y^{(i)}-\hat{y}_{R_{2}}\right)^{2},$$
(7)

where \(\hat{y}_{R_{1}}\), \(\hat{y}_{R_{2}}\) are the mean response for the observations in \(R_{1}(\tilde{p},s)\) and \(R_{2}(\tilde{p},s)\), respectively. The splitting procedure is now continued on each half-plane and the splitting process is continued until a stopping criterion is reached. For instance, we may continue until no region contains more than five observations. This procedure comes often with a very complex resulting tree, which is likely to overfit. Therefore, a procedure called cost-complexity tree pruning, suggested by Breiman (1984), is obtained by removing a sequence of subtrees. This procedure is applied after fully growing the tree and is described as follows: A large tree \(T_{0}\) is grown and the splitting process only is stopped when some minimum node size (say 5) is reached. Then, this tree is “pruned” by finding a subtree \(T\subsetneq T_{0}\). Let \(\left|T\right|\) denote the number of terminal nodes in tree \(T\) and let \(\textit{RSS}(T)\) be the residual sum of squares given in Eq. (3) for tree \(T\). We define the criterion

$$C_{\alpha}(T)= \textit{RSS}(T)+\alpha\left|T\right|.$$

For given \(\alpha\) we aim to find the subtree \(T_{\alpha}\subseteq T_{0}\) which minimizes \(C_{\alpha}(T)\). The tuning parameter \(\alpha\geq 0\) governs the tradeoff between the tree size and its goodness of fit to the data. Large values of \(\alpha\) result in smaller tress \(T_{\alpha}\), and conversely for smaller values of \(\alpha\). As the notation suggests, with \(\alpha=0\) the solution is the full tree \(T_{0}\). For each \(\alpha\) one can show that there is a unique smallest subtree \(T_{\alpha}\) that minimizes \(C_{\alpha}(T)\). To find \(T_{\alpha}\) we use weakest link pruning, that is we successively collapse the internal node that produces the smallest per-node increase in RSS, and continue until we produce the single-node (root) tree. This gives a (finite) sequence of subtrees, and one can show this sequence must contain \(T_{\alpha}\), see Breiman (1984); Hastie et al. (2017). Estimation of \(\alpha\) is achieved using cross-validation. The final prediction model is then contained in the final tree \(T_{\hat{\alpha}}\).

3.3 Random Forests

If we consider a regression tree as our statistical model, bagging regression trees, as proposed by Breiman (1996), is an aggregation of \(B\) bootstrap samples

$$\left(\mathbf{x}^{*1},\mathbf{y}^{*1}\right),\left(\mathbf{x}^{*2}\mathbf{y}^{*2}\right),\ldots,\left(\mathbf{x}^{*B},\mathbf{y}^{*B}\right),$$

sampled randomly from the original training data with replacement, having all the same size. Then for each bootstrap sample \(\mathbf{x}^{*b}\), a corresponding bootstrap replication regression tree \(T^{*b}\) is grown, for all \(b=1,\ldots,B\). Together all \(B\) regression trees \(T^{*1},T^{*2},\ldots T^{*B}\) are summed up to fit a final prediction model \(\hat{f}_{\text{bag}}(\mathbf{x})\). This model is given by the arithmetic mean of the predictions obtained from the \(B\) trees \(T^{*1},T^{*2},\ldots T^{*B}\), i.e.

$$\hat{f}_{\text{bag}}^{B}(\mathbf{x})=\frac{1}{B}\sum_{b=1}^{B}T^{*b}(\mathbf{x}).$$
(8)

Random forests, see e.g. Breiman (2001) and James et al. (2017), are basically an extension of bagging. It fits regression trees \(\hat{T}^{*b}\) as base regressors on each of all the \(b=1,\ldots,B\) bootstraps. However, when splitting each node during the construction of a tree (as described in Sect. 3.2), the best split is found from a random subset of the features (see hyper-parameter max_features in Table 3). Hence, instead of considering all \(p\) features \(x_{1},\ldots,x_{p}\) a split is considered from a random sample of \(m<p\) features chosen as split candidates. The final prediction model is then given by

$$\hat{f}_{\text{rf}}^{B}(\mathbf{x})=\frac{1}{B}\sum_{b=1}^{B}\hat{T}^{*b}(\mathbf{x}),$$
(9)

for \(b=1,\ldots,B\).

With these two variations of randomness, we decrease the prediction error of random forest prediction models and increase prediction performance. This will also be visible on our experiments later in the paper.

3.4 Boosting

A weak learner is defined to be a regression model that is only slightly correlated with the true prediction. Boosting, originally proposed by Schapire (1990) for classification tasks, answers the question, if a set of weak learners can be put together to form a strong learner. As in the above supervised learning problems, the goal is to find a function \(\hat{f}\) that best predicts the output variable \(\mathbf{y}\) from the input variables \(x_{1},\ldots,x_{p}\). Let \(\ell(\mathbf{y},f(\mathbf{x}))\) be the \(\ell_{2}\)-loss. Then, we want to minimize this loss formally by

$$\hat{f}=\text{argmin}_{f(\mathbf{x})}\mathbb{E}_{\mathbf{y},\mathbf{x}}[\ell(\mathbf{y},f(\mathbf{x}))].$$
(10)

The idea of boosting is to predict the response \(y\) by fitting an additive model

$$\hat{f}(\mathbf{x})=\sum_{m=1}^{M}\gamma_{m}t_{m}(\mathbf{x})$$
(11)

for \(M\in\mathbb{N}\), \(\gamma_{m}\in\mathbb{R}\) as weight and \(t_{m}(x)\in T\), where \(T\) denotes the class of base or weak learners, e.g. the class of regression trees. This is optimized in a forward stage-wise manner, meaning at each stage, one fixes the errors of its predecessor.

3.4.1 Steepest Descent

Unfortunately, choosing the best function \(t\) at each step for an arbitrary loss function \(\ell\) is a computational infeasible optimization problem in general. Therefore, we restrict our approach to a simplified version of the problem. There are basically two methods: The very first algorithm is called Adaboost using a specific exponential loss function and small stumps which are usually smaller than the trees build with gradient tree boosting (gradie), as explained in the next section, see also Freund and Schapire (1996). In this paper we use a more flexible way of updating the model to its predecessors called gradient descent. It is a first-order iterative optimization algorithm for finding a local minimum of a differentiable convex loss function going in the opposite direction of the gradient at a point. This is the direction of the steepest descent (Cauchy 1847), which is given by the negative gradient \(-\mathbf{g}(f)\) of a function \(f\). The gradient \(\mathbf{g}(f)\) of a real-valued, \(p\)-dimensional function \(f\) is defined as

$$\mathbf{g}(f):=\frac{\partial f}{\partial x_{1}}e_{1}+\cdots+\frac{\partial f}{\partial x_{p}}e_{p}=\begin{pmatrix}\frac{\partial f}{\partial x_{1}}\\ \vdots\\ \frac{\partial f}{\partial x_{p}}\end{pmatrix},$$
(12)

where the \(e_{1},\ldots,e_{p}\) denoting the unit vectors.

3.4.2 Gradient Tree Boosting

Gradient Tree Boosting (Breiman 1997; Friedman 2001) uses regression tress of fixed size as base learners. It specializes the approach above to the case where the base learner \(t_{m}(\mathbf{x})\) is a \(J_{m}\)-terminal leaf regression tree. More precisely, let \(F\) be the set of regression trees. Each tree comes with a respective partition of the feature space \(R_{j},j=1,2,\ldots,J\) induced by the terminal node of the tree. In this case, at the \(m\)th step one fits a regression tree \(f_{m}(\mathbf{x})\) to the previous pseudo-residuals using steepest descent by

$$r_{m}^{(i)}:=-g_{f_{m-1}}^{(i)}=-\left[\frac{\partial\ell(y^{(i)},f(\mathbf{x}^{(i)}))}{\partial f(\mathbf{x}^{(i)})}\right]_{f(\mathbf{x}^{(i)})=f_{m-1}(\mathbf{x}^{(i)})},$$
(13)

where \(f_{m-1}\) represents the combined prediction from the ensemble of trees up to the \((m-1)\)th iteration in the gradient boosting process. These pseudo-residuals forming a new data set \(\{(\mathbf{x}^{(i)},r_{m}^{(i)})\}_{i=1}^{n}\) which \(t_{m}\) gets fed with. Now, the output \(t_{m}(\mathbf{x})\) for input \(x\) can be written as the sum:

$$f_{m}(\mathbf{x})=\sum_{j=1}^{J_{m}}b_{jm}\mathbf{1}_{R_{jm}}(\mathbf{x}),$$
(14)

where \(b_{jm}=\bar{y}_{jm}\), the mean of the \(y^{(i)}_{jm}\) predicted in region \(R_{jm}\) and \(\mathbf{1}_{R_{jm}}(x)\) denotes the indicator function. We optimize this expression and replace the \(b_{jm}\)’s by calculating the one-dimensional \(\gamma_{jm}\)’s in each of the trees regions \(R_{jm}\). Hence we write

$$\begin{aligned}\hat{f}_{m}(\mathbf{x})&=\hat{f}_{m-1}(\mathbf{x})+\sum_{j=1}^{J_{m}}\gamma_{jm}\mathbf{1}_{R_{jm}}(\mathbf{x}),\\ \gamma_{jm}&=\text{argmin}_{\gamma}\sum_{x^{(i)}\in R_{jm}}\frac{1}{2}\left(y^{(i)}-\left(\hat{f}_{m-1}(\mathbf{x}^{(i)})+\gamma\right)\right)^{2}.\end{aligned}$$
(15)

To control overfitting we take a sensible amount of trees (parameter: n_estimators) and a further shrinkage by \(\nu\), the learning rate (parameter: learning_rate), which can be plugged into the update rule (15) as follows:

$$\hat{f}_{m}(\mathbf{x})=\hat{f}_{m-1}(\mathbf{x})+\nu{\cdot}\sum_{j=1}^{J_{m}}\gamma_{jm}\mathbf{1}_{R_{jm}}(\mathbf{x}),\quad 0<\nu\leq 1.$$
(16)

3.4.3 Stochastic Gradient and Extreme Boosting

Another way to apply gradient boosting is to fit the trees only on subsamples. This is called Stochastic Gradient Boosting (Friedman 2001). The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes, see Table 3.

In this paper we use an implementation of a gradient tree boosting algorithm (gradie ) as described in (Pedregosa et al. 2011). We further use extreme-boosting(xgbreg ) (Chen and Guestrin 2016) which uses a second order Taylor approximation in the loss function to weight the leafs inside of a tree, see Eq. (31) in Appendix. The idea was proposed by Friedman et al. (2000). For parameter details see Table 3.

3.5 Nearest Neighbours

The principle behind the \(k\)-Nearest-Neighbour method (KNN) used for regression, see Altman (1992), is to find \(k\) observations that are closest in distance to the new point \(\mathbf{x}^{(0)}\), denoted by \(\mathcal{N}^{(0)}\). Then \(f(\mathbf{x}^{(0)})\) is estimated using the average of all the training responses in \(\mathcal{N}^{(0)}\), namely

$$\hat{f}(\mathbf{x}^{(0)})=\frac{\sum_{x^{(i)}\in\mathcal{N}^{(0)}}y^{(i)}}{\left|\mathcal{N}^{(0)}\right|}.$$
(17)

The number of samples can be a user-defined constant (\(K\)-nearest neighbour learning), or vary based on the local density of points (radius-based neighbour learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Sometimes it make sense to assign more weight to the nearer neighbours. A common technique to achieve this is to define weights by the inverse of the distance between \(\mathbf{x}^{(0)}\) and a neighbour \(\mathbf{x}^{(k)}\), doing so for all \(\mathbf{x}^{(k)}\in\mathcal{N}^{(0)}\), see also Cunningham and Delany (2020).

Following Pedregosa et al. (2011), the problem of finding the right nearest neighbour algorithm relies on the size of the sample data and the feature space. For small data sets in both directions, the sample size and the features space, respectively, one usually uses the brute-force algorithm. It computes the distance of all pairs of points in the data set. In our experiment we use this method, since the data-set is relatively small. For details on the hyperparameters see Table 3. As the number of data grows, calculating all these distances is computational infeasible. Therefore a method called K‑D-tree can be applied, see Mehlhorn (1988); Pedregosa et al. (2011); Bentley (1975) for details.

3.6 Neural Networks

In this paper we focus on multilayer perceptrons (mlpreg) (Goodfellow et al. 2016) as a neural network regression model. Since the data flow only in forward direction, it is also known as feed-forward neural network. This network defines a mapping \(\hat{y}=f(\mathbf{x};\hat{\boldsymbol{\beta}})\) with \(\hat{y}\) as predictor for given input variables \(\mathbf{x}\), where the values of the parameter \(\boldsymbol{\beta}\) are learned (estimated) from the data. Typically, one aims to minimize the squared prediction error \(E((\hat{y}-y)^{2})\). Such a network usually is represented by a composition of many different functions. For example we might have three functions \(f^{[1]}\), \(f^{[2]}\), and \(f^{[3]}\) connected to form \(f(\mathbf{x})=f^{[3]}(f^{[2]}(f^{[1]}(x)))\). The function \(f^{[1]}\) is called first layer, \(f^{[2]}\) is called the second layer and so on. The overall length of the chain gives the depth of the model. The final layer is called the output layer and the layers before the final layer are so-called hidden layers. The hidden layer functions themselves are multivariate but simple in their structure. They incorporate a weighted sum of the input combined with an activation function. One can rewrite the \(j\)-th component of the functions as \(f^{[k]}_{j}(\mathbf{x}^{[k-1]};\mathbf{w}_{k,j},b_{k,j})=\phi(\mathbf{x}^{[k-1]^{T}}\mathbf{w}_{k,j}+b_{k,j})\), where \(\mathbf{x}^{[k-1]}\) is the (multivariate) output of the previous layer with \(\mathbf{x}^{[0]}=\mathbf{x}\). The weights \(w_{k,j}\) and the intercept \(b_{k,j}\)(the so-called bias) for \(k=1,2,\ldots\) are the parameters, which need to be determined data driven. The set of all parameters defines \(\beta\) and leads to the trained model \(f(\mathbf{x};\hat{\boldsymbol{\beta}})\). The function \(\phi(.)\) is called the activation function, which is known. With this setup we can now find optimal weights such the prediction error gets minimized. To do so one can use cross-validation so that the model is trained on one part of the data and tested on the other part. We do not want to get in more technical details here, since the field of neural networks became so massive with numerous introductory literature. We refer to Goodfellow et al. (2016) or Hastie et al. (2017) for more details, or to Borth et al. (2023).

4 Performance Measures

After having introduced the different models which will be used for constructing rental guides, we need to define the performance measures applied subsequently. We will thereby look not only on the prediction error but also on the robustness, utilizing the concept of standard and adversarial risk introduced in Javanmard et al. (2020); Mehrabi et al. (2021), just in case of the \(\ell_{2}\)-Loss.

4.1 Standard Risk

We define the standard risk

$$\mathsf{SR}(f):=\sqrt{\mathbb{E}_{y,x}\left[(y-f(\mathbf{x}))^{2}\right]}$$
(18)

to be the prediction loss of an estimator \(f\) on an (uncorrupted) test data point \(\mathbf{x}\) and where \((\mathbf{x},y)\sim\mathcal{P}\) is drawn from some common law \(\mathcal{P}\). An empirical estimate for \(\mathsf{SR}\) is given by

$$\widehat{\mathsf{\mathsf{SR}}}(\hat{f})=\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(y^{(i)}-\hat{f}_{-i}\left(x^{(i)}\right)\right)^{2}},$$
(19)

where \(\hat{f}_{-i}\) is the prediction model fitted on data omitting the \(i\)th observation.

4.2 Adversarial Risk

An adversarial attacked model \(\hat{f}_{\mathcal{S}}\) is a prediction model \(\hat{f}\) just with a corruption of the data coming from predefined perturbation sets \(\mathcal{S}:=\{\delta\in\mathbb{R}^{p}:\|\delta\|_{\ell_{2}}\leq\epsilon\}\subset\mathbb{R}^{p}\). The adversary has the power of \(\epsilon\) perturbing each data point \(x^{(i)}\) by adding an element of \(\mathcal{S}\). The main idea of assessing the adversarial risk is to measure the robustness of the model. We quantify how much does the prediction change if a single or multiple input variables are false, i.e. perturbed from its original value.

We define the adversarial risk as

$$\begin{aligned}\mathsf{AR}(f,{\cal S}):=\sqrt{\mathbb{E}_{y,x}\left[\max_{\boldsymbol{\delta\in\mathcal{S}}}(y-f(\mathbf{x}+\boldsymbol{\delta}))^{2}\right]}\end{aligned}$$
(20)

which is the expected prediction loss of a predictor \(f\) on an adversarially corrupted data point according to some attack or mistake model. Stated differently, the adversarial risk measures how adverse the predictor\(f\) can perform in prediction when it is fed with adversarially corrupted data.

To motivate the adversarial attack in more detail, we first have to consider our two types of features, namely metrically scaled and binary (categorical) variables. For noising a metric feature \(x_{j}\), say, a value \(\delta\in\mathbb{R}\) is added to \(x_{j}\) in such a way, that the sum \((x_{j}+\delta)\) is still is inside the range of values of \(x_{j}\). This means the sum ranges between the minimum and the maximum realization of \(x_{j}\). For binary features \(x_{l}\) this goes analogously, except that all possible values are 0 or 1. Hence, in binary case, the perturbation is achieved by \(x_{l}+(-1)^{x_{l}}\). This defines the perturbation set \({\cal S}\). We additionally restrict the number of perturbated variables and define the sets \(\mathcal{S}_{k}\subset{\cal S}\) such that each element of \({\cal S}_{k}\) has exactly \(k\) elements which are unequal to zero. In other words, in \({\cal S}_{k}\) we perturb exactly \(k\) features and leave the remaining features unchanged. This leads to the adversarial risk

$$\mathsf{\mathsf{AR}}(f,k):=\sqrt{\mathbb{E}_{y,x}\left[\max_{\boldsymbol{\delta}\in\mathcal{S}_{k}}(y-f(\mathbf{x}+\boldsymbol{\delta}))^{2}\right]}$$
(21)

Apparently, \(\mathsf{\mathsf{AR}}(f,{\cal S})=\max_{k=1,\ldots p}\mathsf{\mathsf{AR}}(f,k)\).

We estimate (21) through

$$\widehat{\mathsf{\mathsf{AR}}}(\hat{f},k):=\sqrt{\max_{\boldsymbol{\delta}\in\mathcal{S}_{k}}\left(\frac{1}{n}\sum_{i=1}^{n}\left(y^{(i)}-\hat{f}_{-i}(x^{(i)}+\delta\right)^{2}\right)}$$
(22)

where \(\hat{f}_{-i}\) is the prediction model fitted on data excluding the i‑th observation.

5 Results

In this section we compare the performance of the models introduced above with respect to the standard and adversarial risk, respectively. We abbreviate the different models as follows: linear regression model (linear), decision tree (decisi), bagging tree (baggin), random forest (random), gradient boosting (gradie), extreme-boosting tree (xgbreg), \(k\)-neighbours (\(k\)neigh), multi-layer perceptron (mlpreg).

5.1 Standard Risk

The standard risk for the different models is shown in Fig. 2, see also Table 2. For the standard risk we have in descending direction decisi (2.70), baggin (2.61), \(k\)neigh (2.56), random (2.50), mlpreg (2.45), linear  (2.44), gradie (2.42), xgbreg (2.40). Hence, the best results are obtained by the boosting approaches xgbreg and gradie followed by the classical linear and neural network mlpreg. Then we have \(k\)neigh and random with almost equal results. At the end we have baggin and decisi. Both of the last two regression models are also almost equal since baggin is fitted using decisi as base regressors. It is reassuring to see that the “open box” regression models performs quite well.

Fig. 2
figure 2

Standard risk (\(y\)-axis) of all models (\(x\)-axis)

5.2 Adversarial Risk

5.2.1 Model Robustness (global Robustness)

We describe the results for \(\boldsymbol{\delta}\in\mathcal{S}_{k}\) with \(k=1\). In other words, a single input variable is perturbed. The highest adversarial risk is obtained by baggin (4,17), followed by xgbreg (3,63). This is followed by decisi (3,62), gradie (3,26), \(k\)neigh (3,24), linear (3,22). The lowest adversarial risks are given by mlpreg (3,16) and random (3,10), see Fig. 3. Interestingly enough, linear regression is performing well, while some models with small standard risk suffer from larger adversarial risk.

Fig. 3
figure 3

Adversarial risk (\(y\)-axis) for \(\mathcal{S}_{1}\) of all models (\(x\)-axis)

To get a sense of the robustness of the algorithms against adversarial attacks, we also calculate the robustness measure \(\rho:=\mathsf{AR}/\mathsf{SR}\in\mathbb{R}\), the ratio between the adversarial risk and the standard risk. Usually we expect the adversarial risk to be higher than the standard risk for corrupted data. This means that we expect \(\rho\) to be always greater or a least equal 1.

For \(\rho\) in descending direction we have baggin (1.59), xgbreg (1.51), gradie (1.35), decisi (1.34), linear (1.32). The models mlpreg (1.29), \(k\)neigh (1.27) showed equal robustness. Finally random  (1.23) performend the best in terms of robustness, see Fig. 4 and Table 2.

Fig. 4
figure 4

Ratio \(\rho\) (\(y\)-axis) of all models (\(x\)-axis)

Table 2 \(\mathsf{SR}\), \(\mathsf{AR}\) and \(\rho\)

5.2.2 Feature Robustness (local Robustness)

We can take a look at the adversarial risk and see how perturbing the different features contribute to the adversarial risk. That is we look at the quantities

$$\begin{aligned}\mathsf{AR}(f,\boldsymbol{\delta}):=\sqrt{\mathbb{E}_{y,x}\left[(y-f(\mathbf{x}+\boldsymbol{\delta}))^{2}\right]}\end{aligned}$$
(23)

for all \(\boldsymbol{\delta}\in{\cal S}_{k}\). This helps to specify which feature causes unrobustness of the prediction. We do this for every feature and every learning model. This approach can also serve as a robust feature importance measure. Not surprisingly at all, we see that the different features have different effects on the models’ adversarial risk, see Fig. 5. Also we can observe that even if one model performs better in terms of a lower adversarial risk, it may have a higher unrobustness measure \(\rho\), compare, for example, \(k\)neigh and mlpreg in Figs. 56 and Table 2.

Our experiment also shows that in total, there is no model which outperforms all the other models in terms of standard and adversarial risk.

Fig. 5
figure 5

Adversarial risk of all models for \(\boldsymbol{\delta}\in\mathcal{S}_{1}\)

Fig. 6
figure 6

Robustness of all models for \(\boldsymbol{\delta}\in\mathcal{S}_{1}\)

6 Conclusions

This paper studies the use and performance of machine learning regression models when used for rental guides. We also study the models’ behaviour under adversarial data corruption using data from the Munich rental guide. In Sect. 5 we compared different machine learning regression models which seem to suite most for the case of explaining rental prices for flats in German Rental Guides. Even if boosting algorithms showed their equal prediction performance compared to the classic prediction models like linear regression, the black-box character still remains: Rental Guides need a differentiated and reliable (robust) information about which criteria are influencing the net rental price in both, the positive and negative direction.

In Sects. 5.1 and 5.2 we showed the standard and adversarial performance of the given models. In Sects. 4.2 and 5.2.1 we addressed adversarial regression by building prediction models on top of corrupted data. We also compared the (local) standard and adversarial risk of features which are highly influential of a model’s performance (see Fig. 5). This means that a model \(\hat{f}\) may have a lower adversarial risk than another model \(\hat{f}^{\prime}\) in terms of its predictive power. But, we found that \(\hat{f}^{\prime}\) may possibly be more robust than \(\hat{f}\) by looking at the robustness measure \(\rho=\mathsf{AR}/\mathsf{SR}\), see for example, \(k\)neigh and mlpreg in Figs. 5 and 6 and Table 2. Local examples, meaning a feature-wise comparison can be found in Table 4. For example within the feature “specialEquipment” baggin  shows a higher adversarial risk but lower robustness than xgbreg.

Concretely, in terms of both, the risk measures \(\mathsf{SR}\) and \(\mathsf{AR}\) on the one hand, and the robustness measure \(\rho\) on the other hand, random, mlpreg, linear, and gradie  showed good predictive power with low adversarial risk and high robustness against adversarial attacks, see Fig. 5. The model \(k\)neigh showed an overall moderate performance and xgbreg showed a high adversarial risk.

Both models, decisi  and baggin  showed a higher standard and adversarial risk than their competitors, but decisi  has a similar robustness measure compared to xgbreg. The worst performance is obtained by baggin  with second highest standard risk and highest adversarial risk resulting in the highest robustness measure. Additionally, it should be mentioned that several features being highly vulnerable when considered within the models baggin  and decisi, see Table 2.

As a résumé, we see the resulting volatility of the models applied to (slightly) corrupted data as explained in Sect. 5.2.2. This contradicts both demands for prediction models being explained well on the one hand, and robust on the other hand. It is therefore recommended to search for a model minimizing the trade-off between predictive performance and adversarial risk. This helps to reduce the danger of generating machine learning regression models with misleading explanation. Therefore, we conclude that the best way to overcome this issue, is to use classic regression models. classic regression models may be supported by machine-learning tools to make them more effective. But, machine learning models should not be used as a replacement for classic regression models in the rental guide context.