1 Introduction

Real estate, one of the indispensable assets of the economy, directly affects many areas, from the financial to the legal systems. For the development of economies and sustainable growth, the most accurate pricing of real estate is crucial. Many economic activities, such as real estate trading, mortgage loans, investment, balance sheet and taxation, depend on real estate prices [1, 2]. Inaccuracy in property price predictions may result in unsuitable investment decisions [3].

The biggest obstacle in accurate real estate valuation is the heterogeneous structure of real estate data [4]. Highly variable characteristics of the real estate market make it challenging to predict real estate values. Real estate datasets include categorical (mostly binominal and nominal) and discrete and continuous (interval scale) features. For example, while the number of rooms is a discrete measurement in the interval scale, the price of a property is a continuous measurement in the interval scale. On the other hand, whether a property is furnished or not is a binary categorical observation and property type is a nominal categorical observation.

As detailed in Sect. 2, there are many different approaches to the property value/price prediction problem in the literature. The hedonic model used for price prediction is based on multiple linear regression [5]. Machine learning (ML) methods such as support vector machines (SVMs) [6] and artificial neural networks (ANNs) [7] are proposed to improve the hedonic model’s performance. Fuzzy methods are also introduced for property price prediction problems. The most frequently used fuzzy approaches are fuzzy neural networks (FNNs) and the adaptive neuro-fuzzy inference system (ANFIS) [8, 9]. The hybrids of these methods with clustering approaches are employed to segment the real estate market for improved price prediction empirically.

The main shortfall of the methods in the literature is that they are not designed to process the categorical features to preserve their categorical nature [10, 11]. Although there are attempts to improve methods against the existence of categorical features through using different encoding strategies at the data preprocessing step, the gain in prediction performance over the standard implementation is unclear [10]. Most methods process categorical features as discrete measurements in the interval scale. If not paired with suitable approaches, this results in a significant loss of accuracy that translates into economic loss by sub-optimal decisions based on the predictions, disregarding the features’ categorical nature. Another issue with the recent literature is the lack of reproducibility due to not providing the software codes for implementing the proposed methods in practice. This limits the applicability of the methods in real estate price prediction and leaves practitioners with sub-optimal methods.

To tackle these problems, first, we aim to develop a method that can handle both interval scale and categorical features and provides us with a lower magnitude of prediction error and lower variability in prediction error than the frequently used methods. We propose a fuzzy regression functions (FRF) approach designed to handle categorical measurements along with interval-scale observations. The proposed approach involves a step that assigns a membership value to each property to represent each property’s degree of belonging to each spatial segment based on hierarchical clustering with the generalized Minkowski distance of [12]. This feature of our approach relaxes the requirement of expert knowledge for segmentation and is able to handle categorical information. The generalized Minkowski distance is also employed in other steps of FRFs to handle categorical features accurately. We develop all the required computer codes to implement HFRFs in R software and make them available for immediate implementation through the Internet.

The proposed FRF approach, namely hierarchical fuzzy regression functions (HFRFs), is applied to real estate datasets from six different markets worldwide to predict real estate prices. The performance of the proposed HFRFs is assessed in terms of the magnitude of the absolute prediction error and the amount of variability in the prediction errors. We benchmark the performance of HFRFs with ANFIS, deep neural networks (DNNs), linear regression, and SVMs using the six real estate datasets. Since DNN is a multi-layered version of ANN, we do not consider ANNs separately in this study. HFRFs demonstrate a considerable extent of improvement in real estate prediction performance. The computational cost of the proposed approach is assessed and found suitable for larger-scale valuation systems. The contributions of this study are: i) An FRF approach that handles categorical and interval-scale features accurately is proposed to predict real estate values. HFRFs are not only limited to real estate datasets. It is readily applicable to any dataset with categorical and continuous features for regression problems. ii) We relax the requirement of expert knowledge for segmentation by using hierarchical clustering that captures segmentation information straightforwardly. iii) A comparison of HFRFs, SVMs, ANFIS, linear regression, and DNNs for the real estate values prediction is presented. Thus, information on the performance of the benchmarking methods is also evaluated in this study. iv) Due to the suitable computational cost of HFRFs, we provide practitioners with a method that can be applied to online valuation systems or large-scale prediction problems.

The rest of the article is organized as follows: Sect. 2 is devoted to the literature review. Section 3 presents datasets, descriptive analysis results, the details of FRFs and the proposed HFRF approach, and the goodness-of-fit measures. Section 4 demonstrates the performance and computational cost comparison of the proposed HFRF approach with benchmark methods. Section 5 concludes the article with discussions and recommendations.

2 Literature review

In the real estate market, a common approach to reducing data variability is to cluster the observations into more homogeneous sub-markets [13, 14]. The idea that housing markets should be divided into clusters according to certain characteristics and that these clusters should be included in the price prediction process is an approach advocated in the real estate literature [4, 15, 16]. Performing market segmentation prior to pricing model estimation avoids aggregation bias and ensures accurate parameter estimation and strong model fit [17, 18].

In the real estate market, a sub-market is a cluster in which pricing and related property characteristics differ from another sub-market [15]. It is argued that each sub-market should have its own unique price models [19]. Considering that the coefficient estimates will be biased in a single model representing the whole market, the model estimates for each sub-market are expected to have better pricing performance than an aggregated model [20]. Identifying sub-markets not only improves price prediction accuracy but also helps researchers better model temporal and spatial changes in prices, helps lenders accurately assess credit risk, and reduces home buyers’ search costs [4].

In order to identify housing market segmentation, two basic approaches are used: a priori information (experience-oriented) classification and data-driven methods. In a priori classification, sub-markets are constructed using expert insights into spatial divisions such as administrative areas, socio-economic characteristics, census tract, zip code districts, and physical features [5, 21,22,23]. Although a priori information classification is intuitive and straightforward, the method is relatively inadequate in terms of accuracy, precision, and objectivity [24]. It is also demonstrated by [25] that market segmentation using administrative boundaries might not be effective in mass appraisal. Expert opinions for segmentation in the same area cannot be used as a widely accepted solution, given the fact that real estate agents may differ in their judgment [20]. As [4] emphasize, consumers are guided by the relationship between price and general characteristics as well as the location of the properties when searching for a home. In contrast, data-driven methods are more objective and accurate and reflect the spatial and temporal dynamics of the real estate market [26].

In recent years, studies have focused on the empirical segmentation of the real estate market using various statistical and machine learning methods to overcome the arbitrariness and subjectivity of a priori information classification. Principal Component Analysis (PCA), factor analysis, and clustering methods are applied to identify sub-markets as data-driven approaches. Market segmentation is performed based on property type, structural features, neighborhood features, spatial features, or a combination of them. To explain the economic meaning of sub-market segmentation, spatial attributes and economic characteristics of properties are often used for clustering [27, 28]. Watkins [22] use PCA to identify structurally differentiated market segments by obtaining the most common components in the housing stock. Then, the hedonic regression equation is estimated separately for each sub-market. Using PCA to identify sub-markets made significant progress with the use of cluster analysis. Clustering, as an unsupervised learning problem, is used to separate data samples into clusters without needing any prior knowledge of partitioning. Various crisp and fuzzy clustering algorithms also used in the literature to identify housing sub-markets [27, 29,30,31].

Machine learning methods are used to identify, interpret, and analyze hugely complicated data structures and patterns. When machine learning techniques are utilized appropriately, they provide fast and reliable results and become a tremendous tool that assists decision-making processes [32]. Intelligent automatic valuation systems for real estate have been designed using various methods and models for mass appraisal in a certain area. When expert algorithms are compared to ML methods in predicting property values, ML methods outperform expert algorithms for Poland data [33].

Since SVMs can capture the nonlinear relationship patterns in regression tasks, they are suitable for the price prediction problem. Other optimization algorithms, such as genetic algorithms or particle swarm optimization, are used to find optimal parameter settings for SVM implementation for real estate price prediction [6, 34]. Another mainstream ML algorithm, ANN is also widely employed to predict real estate values. ANNs capture the nonlinear behavior of input variables well. Mach [35] compares the performance of ANNs with multiple regression modeling and observes a similar performance for Poland data. Cetkovic et al. [7] implement ANNs for France, the Czech Republic, and Lithuania using backpropagation with economic and social variables. They observe a satisfactory prediction performance with ANNs. Some of the variables considered in this study, such as gross domestic product (GDP), GDP per capita at market prices, and foreign direct investment, have high variations. ANNs with backpropagation are improved by employing a genetic algorithm for optimization, producing better prediction performance for the Chinese market [36]. In addition to ANNs, other ML methods, such as ElasticNet and XGBoost, are recently employed to predict real estate prices. But ANNs outperform both ElasticNet and XGBoost for Italy datasets [37]. A recent review of ANN and ML methods for real estate price prediction in comparison with the hedonic model is presented by [38]. We consider an improved and more generalized version of ANNs, namely DNNs, in this study.

Lee [11] highlights the importance of considering the nature of the categorical independent features and the weakness of neural networks in processing categorical features and proposes using entity embedding techniques in natural language processing to improve the performance of neural networks. The entity embedding technique leads to market segmentation for prediction using only one model. This is an important consideration since most real estate pricing applications include categorical features.

Overall, ML methods are promising for real estate price prediction and perform better than conventional regression-based models. However, their weaknesses include a lack of an explanation of the mechanism behind the predictions [11] and poor handling of categorical predictors [11]. Different encoding strategies such as one hot, CatBoost, Helmert, target, and ordinal encoding are used for categorical features at the data preprocessing step to improve the performance of ML methods [10]. Although the performance of different methods is compared under these encodings, the performance gain with encoding against the straightforward implementation is unclear.

In the fuzzy domain, FNNs are the most used technique for real estate value prediction dating back to the end of the 1990 s [39, 40]. Artificial neural networks are criticized due to their black-box nature and low generalization capacity due to overfitting [8, 40]. The hedonic price theory is used along with the FNNs. The straightforward implementation of FNNs involves getting crisp inputs, fuzzifying them, calculating fuzzy inference rules by using an ANFIS and defuzzifying the output with centroid defuzzification. Liu et al. [40] get promising results and good generalization capability with this approach. Guan et al. [41] consider an ANFIS approach to property valuation for the US data and compare the accuracy of their ANFIS approach to multiple regression modeling. The ANFIS approach performs slightly worse than a multiple regression approach for the US data. Note that [41] dataset includes categorical variables such as basement type, wall type, and garage type. However, according to descriptive statistics reported in the article, these categorical features are treated as interval-scale measurements after coding. As a neural network-based method, ANFIS does not handle categorical features accurately. This can be the reason for getting better performance from the multiple regression approach. This result highlights the importance of using methods that can handle categorical features in property price prediction. Shi et al. [31] use the Fuzzy k-means (FKM) clustering method in housing market segmentation, and predictions in each cluster are obtained using ANFIS. This study concludes that data-driven market segmentation improves the accuracy of price prediction. Kusan et al. [42] propose a fuzzy logic model for property price prediction in Turkey. Their model is based on a wide range of continuous and categorical variables and shows a satisfactory performance in the presence of categorical variables. Sarip et al. [9] provide a comparison of ANN, ANFIS, and fuzzy least-squares regression (FLSR) methods in the predictions of property prices through simulation-based experiments. In the simulation experiments, some categorical variables are included. As a result, FLSR is identified as a promising method for property valuation. Grid partitioning and subtractive clustering are considered along with ANFIS to create optimum fuzzy rule base sets for forecasting house selling prices [43], and ANFIS with grid partitioning was found promising for forecasting house prices. There are also attempts to build automated valuation systems for real estate using fuzzy rule-based systems [44,45,46]. However, whether these systems distinguish the categorical nature of the categorical predictors is not clarified.

The previously introduced FRF approaches by [47,48,49] aim to improve the FRF method against the outliers in interval-scale features. The methods proposed in these studies are not recommended when categorical features are in the dataset; hence, they are unsuitable for real estate price prediction. However, the current study specifically focuses on developing an FRF approach that provides better prediction accuracy for datasets with both interval-scale and categorical (nominal and/or ordinal) features.

Overall, most studies in the literature do not treat categorical variables differently from interval-scale measurements, resulting in the loss of accuracy in the real estate price prediction. This is the main limitation of the literature we address in this study.

3 Datasets and methods

3.1 Data description

Six diverse real estate datasets from Russia, Georgia (the USA), Taiwan, Riga (Latvia), Sao Paulo (Brazil), and Victoria (Australia) are considered in this study. The data sources are available in Table 1.

Table 1 Datasets and web links to data sources

The datasets contain continuous and categorical predictors with varying sample sizes. Table 2 contains information about each dataset’s features and the sample size. Since there are duplications and zero-variance features in the datasets given by the original data sources, we applied a preprocessing step. The resulting datasets are available at https://github.com/haydarde/HFRF. The type of categorical predictors is given in brackets in the first column of Table 2. The names of some predictors are changed to group them into one name across the datasets. We consider the location of each property by latitude and longitude information. The sale price is log-transformed to reduce the large range caused by the local currencies for datasets 1, 2, 4, 5, and 6. Since the Taiwan data have the unit area price, log-transformation is not applied. The most common features across the datasets are the total area and the number of rooms, seen in 4 and 5 out of 6 datasets, respectively. Out of 21 predictors, we have 2 nominal categorical, 5 binary, 7 discrete and 7 continuous predictors across the datasets. Thus, it is crucial to have methods that handle binary and nominal categorical features sufficiently.

Table 2 Features, sample size, number of predictors, and number of categorical predictors in each dataset

Box plots of log-sale price for each dataset are displayed in Fig. 1. The triangle in each box indicates the mean log-sale price, while the horizontal line in each box shows the median log-sale price. All datasets have almost similar log-sale price distributions. The variation of log-sale prices is the least for Victoria, while it is the highest for Russia. The Sao Paulo and Victoria datasets are slightly right-skewed with high sale prices.

Fig. 1
figure 1

Box plots of log-sale price for each dataset

The box plots of the log-total area for Georgia, Riga, Russia, and Sao Paulo datasets are given in Fig. 2. While properties in Georgia have the highest variation in total area with a right-skewed distribution, the Russia dataset has the lowest variation in total area. The Sao Paulo dataset is notably right-skewed with properties with very large total areas.

Fig. 2
figure 2

Box plots of the logarithm of the total area for Georgia, Riga, Russia, and Sao Paulo datasets

The histograms of the number of rooms for Georgia, Riga, Russia, Sao Paulo, and Victoria datasets are given in Fig. 3. All distributions of the number of rooms are right-skewed, except for Victoria, which has a left-skewed distribution.

Fig. 3
figure 3

Histograms of the number of rooms for Georgia, Riga, Russia, Sao Paulo, and Victoria datasets

The bar plots of the categorical predictors in Russia, Georgia, Riga, Sao Paulo, and Victoria datasets are given in Fig. 4. All the categorical variables except the swimming pool and elevator predictors for Sao Paulo are highly imbalanced. In addition to having categorical predictors in the dataset, their imbalancedness adds another level of challenge to accurate modeling. A sufficient model should handle imbalanced categorical variables successfully in this case.

Fig. 4
figure 4

Bar plots of the categorical variables for Russia, Georgia, Riga, Sao Paulo, and Victoria datasets

3.2 Fuzzy regression functions

The FRF technique is originally developed by [50]. Baser and Demirhan [47] introduced the SVMs into the FRFs to enhance their performance for regression. Then, the noise cluster approach of [51] is used along with SVMs and ANNs within FRF to improve the robustness of FRFs against the outliers for regression [48]. This approach is called FRF with a noise cluster (FRFN). Then, [49] modified FRFNs by implementing robust clustering methods in the FRFN approach, yielding a modified FRFN approach (MFRFN) to further robustify FRFNs against the outliers in the dataset. MFRFN demonstrates a promising estimation accuracy for wind speed estimation [52]. In the time series setting, FRFs provide robust forecasting results against the outliers as well [53, 54]. However, no categorical predictor is considered in any of these works. In this study, we develop an FRF model, namely the hierarchical fuzzy regression functions (HFRF) model, that runs efficiently when binary, multi-class nominal, ordinal and/or interval-scale predictors are in the data.

The objective function to minimize in FRF implementation is

$$\begin{aligned} f(\varvec{\mu },\varvec{\nu },{\varvec{X}})= \sum _{i=1}^{n}\sum _{k=1}^{c}g_{k,i}(\mu _{k,i},d_{k,i}), \end{aligned}$$
(1)

where c is the number of clusters, n is the sample size, \({\varvec{X}}_{n\times (\ell +1)}\) is the matrix of \(\ell \) independent predictors and the dependent feature, \(\mu _{k,i}({\varvec{x}}_{i},\varvec{\nu })=\mu _{k,i}\in [0,1],\sum _{i}\mu _{k,i}=1\forall k=1,\dots ,c\) shows membership values, \(d_{k,i}({\varvec{x}}_{i},\varvec{\nu _{k}}) =d_{k,i}\) is the Euclidean distance between the cluster centers and observations as the inner product norm:

$$\begin{aligned} d_{k,i} =||{\varvec{x}}_{i}-{\varvec{v}}_{k}||, \end{aligned}$$
(2)

with the \(\ell \times 1\) vector of cluster centers, \({\varvec{v}}_{k}({\varvec{x}}_{i})={\varvec{v}}_{k}\), \(k=1,\dots ,c\). In Eq. (1), the function \(g_{k,i}(\cdot )\) indicates the clustering method. For the fuzzy c-means (FCM) clustering, \(\mu _{k,i}\) is defined in Eq. (3),

$$\begin{aligned} \mu _{k,i}= \bigg \{\sum _{l=1}^{c}\bigg (\frac{d_{k,i}}{d_{l,i}}\bigg )^{2/(m-1)} \bigg \}^{-1} \end{aligned}$$
(3)

with the degree of fuzziness \(m>1.1\), and the center of cluster k is defined in Eq. (4):

$$\begin{aligned} \varvec{\nu }_{k}=\frac{\sum _{i=1}^{n}\mu _{k,i}^{m}{\varvec{x}}_{i}}{\sum _{i=1}^{n}\mu _{k,i}^{m}}, \end{aligned}$$
(4)

and \(g_{k,i}(\cdot )\) function is given in Eq. (5):

$$\begin{aligned} g_{k,i}(\mu _{k,i},d_{k,i})= \mu _{k,i}^{m}d_{k,i}^{2}, k=1,\dots ,c. \end{aligned}$$
(5)

The FRF implementation includes two stages: the fuzzy clustering stage with the training dataset and the fuzzy inference stage with the output of the fuzzy clustering stage on the testing dataset. Learning similarities between observations happens at the clustering stage as it captures the clusters created by the similarities of observations in the data dataset. This corresponds to the segmentation in the real estate valuation domain. Then, the membership values based on the distance between the observations and cluster centers are added to the training dataset as another predictor, and an SVM model is fitted at the inference stage. The fitted SVM model is then used to generate predictions with the test data at the last stage of FRFs [49]. However, FRFs rely on the Euclidean distance, which is not an appropriate distance measure for the categorical features.

3.2.1 Hierarchical fuzzy regression functions

It is clear that when some of the predictors are categorical, using the Euclidean distance between the cluster centers and observations from the categorical predictors is unsuitable [12, 55]. The distance is used twice in FRFs. First, it is used at the clustering stage, and then, the distance between cluster centers and observations is fed into the inference stage. Therefore, using a distance metric suitable for mixed data types is crucial. One can either transform all categorical features into binary by dummy coding and use a distance for binary variables or use a Minkowski distance by considering the categorical variables with integer coding [12, 55, 56]. Considering that our FRF method consists of SVM implementation, keeping the variables numerical rather than binary is more beneficial.

The accuracy of the clustering algorithm at the first stage of FRFs significantly impacts the overall performance of the FRF implementation. Therefore, the clustering algorithm of the first stage needs to be suitable for mixed data types and less time-consuming as there are two stages of implementation. Specifically, capturing the similarity among the observations is also important for real estate valuation. Considering these characteristics, a hierarchical clustering algorithm is suitable by employing the generalized Minkowski distance of [12].

To handle nominal, binary, ordinal, and interval-scale features with FRFs, we consider using hierarchical clustering with the generalized Minkowski distance [12] at the first stage of FRFs and calculate the generalized Minkowski distances between observations and centers from the hierarchical clustering to feed additional data to the inference stage in the HFRF framework.

For two vectors \({\varvec{z}}=(z_{i})^{T}\) and \({\varvec{y}}=(y_{i})^{T},i=1,\dots ,n\) in \({\mathbb {R}}^{n}\), the Minkowski distance is defined as in Eq. (6):

$$\begin{aligned} d^{p}_{z,y}=\bigg (\sum _{i=1}^{n}|z_{i}-y_{i}|^{p} \bigg )^{p}. \end{aligned}$$
(6)

The generalized version of the Minkowski distance is defined by [12] for continuous, discrete, quantitative, qualitative, and structural data types. However, since we are working with continuous, discrete, nominal, and ordinal data types, we establish a simpler form of it based on the observed ranges of the \({\varvec{x}}\) and \({\varvec{y}}\), which is less time-consuming in implementation.

Let \({\varvec{X}}_{n\times k}=(x_{ij}),i=1,\dots ,n; j=1,\dots ,\ell \), where n is the number of observations, and k is the number of features, be the data matrix and \(R_{j}\) denote

  • The absolute range of the column vectors of \({\varvec{X}}\), \(|\max ({\varvec{x}}_{j})-\min ({\varvec{x}}_{j})|\) when \({\varvec{x}}_{j}\) is in interval scale (continuous or discrete), and

  • The number of levels, \(L({\varvec{x}}_{j})\) when \({\varvec{x}}_{j}\) is composed of nominal or ordinal measurements.

Then, the generalized Minkowski distance of order p between two observation vectors \({\varvec{x}}_{i}\) and \({\varvec{x}}_{r}\) is defined in Eq. (7):

$$\begin{aligned} d^{p}_{ir}({\varvec{X}})=\bigg [\sum _{j=1}^{\ell }\big (|x_{ij}-x_{rj}|/R_{j}\big )^{p} \bigg ]^{p}. \end{aligned}$$
(7)

We run hierarchical clustering at the clustering stage of HFRF with the distance matrix \({\varvec{D}}^{p}=(d^{p}_{ir}),i,r=1,\dots ,\lfloor \gamma \cdot n\rfloor \), where \(\lfloor \cdot \rfloor \) shows the floor function and \(\gamma \) is the proportion of the training sample, and obtain cluster center \(\varvec{\nu }_{k}^{H}=(\nu ^{H}_{kj}),j=1,\dots ,\ell \) for each cluster \(k=1,\dots ,c\). Then, we compute the membership values of observation vectors to the clusters with Eq. (8):

$$\begin{aligned} \mu _{p,k,i}^{H}= \bigg \{\sum _{l=1}^{c}\bigg [\frac{\sum _{j=1}^{\ell }\big (|x_{ij}-\nu ^{H}_{kj}|/R_{j}\big )^{p} }{\sum _{j=1}^{\ell }\big (|x_{ij}-\nu ^{H}_{lj}|/R_{j}\big )^{p} }\bigg ]^{2p/(m-1)} \bigg \}^{-1}. \end{aligned}$$
(8)

The membership values from Eq. (8) are merged with the observation matrix of each cluster, \(\varvec{\Gamma _{k}}({\varvec{X}}_{k}\vdots \mu _{p,k,i}^{H} )\) and fuzzy regression functions for each cluster, \(f_{k}(\varvec{\Gamma _{k}},\varvec{\beta }_{k})\), are fitted using SVMs with radial kernel to get the parameter estimates \(\hat{\varvec{\beta }}_{k}\). The \(\hat{\varvec{\beta }}_{k}\) vectors are used to create fitted values of each observation vector, \(\hat{{\varvec{y}}}_{k}=({\hat{y}}_{k,i}),i=1,\dots ,\lfloor \gamma \cdot n\rfloor \) in the training sample. Then, the fitted values in the training set are calculated as the weighted averages of the fitted values with weights corresponding to the membership values:

$$\begin{aligned} {\hat{y}}_{i}=\sum _{k=1}^{c}{\hat{y}}_{k,i}\mu _{p,k,i}^{H}/\sum _{k=1}^{c}\mu _{p,k,i}^{H}. \end{aligned}$$
(9)

For the observations in the test sample, we calculate Eqs. (7) and (8) with observations \({\varvec{x}}_{i},i=\lfloor \gamma \cdot n\rfloor + 1, \dots , n\) and find the predictions using the fitted SVM model’s parameter estimates \(\hat{\varvec{\beta }}_{k}\) for each cluster. Then, the final predictions are calculated using Eq. (9) for the observations \({\varvec{x}}_{i},i=\lfloor \gamma \cdot n\rfloor + 1, \dots , n\).

The workflow of HFRF is given in Fig. 5. Compared to the flow of FRF implementation, we employ the hierarchical clustering with the generalized Minkowski distance to find the cluster centers and then utilize the generalized Minkowski distance for the membership degrees of the observation to the clusters. This improvement makes FRFs applicable for datasets that include categorical and interval-scale observations. The hierarchical clustering with the generalized Minkowski distance provides us with clusters of market segmentation, and the distance between observation to the clusters relates each real estate to the segments.

Fig. 5
figure 5

Flowchart of the HFRF implementation

3.3 Goodness-of-fit measures

In order to assess the prediction performance of the models, root-mean-squared error (RMSE) and mean absolute error (MAE) are used as defined in Eq. (10):

$$\begin{aligned} \text {RMSE} = \bigg (\sum _{i=1}^{N}({\hat{x}}_{i}-x_{i})^{2}/N\bigg )^{0.5} \quad \text {and}\quad \text {MAE} = \frac{1}{N}\sum _{i=1}^{N}|{\hat{x}}_{i}-x_{i}|, \end{aligned}$$
(10)

where \(|\cdot |\) shows absolute value, \(x_{i}\) is the observed value in either training or test sets of size N, \({\hat{x}}_{i}\) is the corresponding prediction by a model, and \({\bar{x}}\) is the mean of either training or test set. The scaled versions of RMSE and MAE, namely rRMSE and rMAE, are obtained by dividing them by \({\bar{x}}\) in Eq. (11):

$$\begin{aligned} \text {rRMSE} = \text {RMSE}/|{\bar{x}}| \quad \text {and}\quad \text {rMAE} = \text {MAE}/|{\bar{x}}|. \end{aligned}$$
(11)

rRMSE and rMAE provide an assessment independent of each market’s price level.

RMSE and MAE handle prediction errors differently and need to be considered simultaneously for a comprehensive evaluation of the models. MAE measures the average absolute difference between the observed and predicted values to depict the average magnitude of the model’s error. On the other hand, RMSE measures the variation in the errors, showing the degree of divergence of the errors. It inflates quicker than MAE for larger errors. The mean magnitude of error needs to be considered simultaneously with the errors’ degree of variation to see if the model generates consistently low errors or not. The rescaled versions of MAE and RMSE, rMAE and rRMSE, are also considered since MAE and RMSE are not directly comparable across different markets due to different price levels. However, although the rescaled versions are comparable for multiple markets, they do not reflect the actual magnitude of error for individual markets. Therefore, we investigate both regular and rescaled versions of MAE and RMSE in this study.

4 Results

We implement the proposed HFRF, DNN [57], ANFIS [58], SVMs with the linear and radial kernels (SVM-Lin and SVM-Rad) [59] and linear regression (LinReg) [60] methods with the six real estate datasets presented in Sect. 3.1 to assess the performance of the proposed HFRFs and benchmark it with ANFIS, DNN, SVMs with the linear and radial kernels and linear regression methods, which are frequently applied to tackle the real estate pricing problem in the literature.

All methods are run with hyperparameter tuning via tenfold cross-validation with 80% \((\gamma =0.8)\) training and 20% test splits. The natural logarithm of real estate prices is used in all methods due to the skewed nature of price data for all datasets except Taiwan data, showing the unit area price differently from others. No transformation is applied to the predictors.

4.1 Hyperparameter tuning

The DNN implementation consists of three dense layers with dropout rates subject to tuning. The first dense layer has an L2 regularizer with a regularization factor of 0.0001. The activation function of the first layer is tuned up considering soft sign and RELU activation functions. The next two layers have RELU activation. The alpha parameter of all RELU activation functions is tuned considering 0.05 and 0.1. The last two layers have He normal initializers. The dropout rate of each layer is tuned by using 0.05, 0.1, and 0.15. The number of units in each layer is tuned by considering 15, 25, and 40. Batch size is tuned with 32, 64, and 128. 200 epochs are run with an early callback monitor based on the mean absolute error. Table 3 shows the final combinations of parameters after the hyperparameter tuning for all datasets.

Table 3 Implementation parameters for DNNs after the hyperparameter tune-up

The ANFIS models are fitted using the Caret and FRBS R packages [61, 62]. In the ANFIS implementation, min and max functions are used as t-norm and s-norm operators, respectively. The Zadeh implication function is used [61, 62]. Five layers are considered for forward stage [62]: i) Fuzzification with Gaussian membership function. ii) Inference using the t-norm operator. iii) Calculate the ratio of rules’ strength. iv) Estimate parameters. v) Calculate overall output using the sum operator. The least squares method is used for parameter estimation in the backward stage. The step size of the gradient descent is set as 0.01. A hyperparameter tuning effort for ANFIS implementation requires examining 405 combinations of t-norm, s-norm, and implication functions. While the computational cost of tuning is 3.4 days for the smallest dataset, Dataset 3, it inflates to 962.1 days, or 2.6 years, for the largest dataset, Dataset 2. Therefore, the same setting is used across the six regions.

SVMs are also implemented using the Caret package. The linear and radial basis kernels are considered. The regularization parameter C is tuned against RMSE for the linear kernel. For the radial basis kernel, C and \(\sigma \) parameters are tuned against RMSE by random search with 20 replications. The resulting optimal values of C and \(\sigma \) are given in Table 4 for each dataset.

Table 4 Implementation parameters for SVM-Rad after the hyperparameter tune-up

HFRFs are run by the R implementation of Fig. 5 with the codes developed by the authors at https://github.com/haydarde/HFRF. The hierarchical clustering algorithm needs the number of clusters as the essential input. The number of clusters for each dataset is determined based on the silhouette (SIL) index [63]. The degree of fuzziness is tuned considering \(m=1.5,2,2.5,3\) for each dataset. The order of generalized Minkowski distance is tuned through \(p=1.1,1.2,1.5,1.75,2,2.25,2.5\). After the tune-up, the final values of the number of clusters (c), the degree fuzziness (m), and the order of generalized Minkowski distance (p) used for HFRFs are given for each dataset in Table 5. The C and \(\sigma \) parameters of SVM-Rad at the inference stage of HFRFs are tuned against RMSE under each cluster for each dataset by random search with 20 replications. Table 6 shows the optimal values of C and \(\sigma \) for each dataset and the number of clusters shown in Table 5. For each dataset, only c rows of Table 6 are filled with the values of corresponding C and \(\sigma \) hyperparameters.

Table 5 Implementation parameters for HFRFs for each dataset after the hyperparameter tune-up
Table 6 Implementation parameters for SVM-Rad under HFRFs for each cluster associated with each dataset after the hyperparameter tune-up

4.2 Prediction performance

Table 7 presents the RMSE, rRMSE, MAE, and rMAE results of HFRF, DNN, SVM, and ANFIS methods for all datasets. The rescaled error measures rRMSE and rMAE help assess the variation and magnitude of errors in test sets independent of different price ranges across different countries considered in the study. In terms of rMAE, the proposed HFRFs produce the minimum absolute error for real estate price prediction for all datasets. This implies that HFRFs provide us with the lowest magnitude of error in the predicted prices. A useful method is also desired to provide predictions with low variability in addition to the low magnitude of error. Regarding the variation in the errors of price prediction, HFRFs have the lowest variation in price predictions. Only for the Taiwan dataset, SVM-Rad produces a very close rRMSE to HFRF. DNN has the second-best rRMSE for Russia, and SVM-Rad has the second-best rRMSE for other datasets. While ANFIS performs poorly among the considered methods, the hedonic model, LinReg, closely follows SVM-Lin.

Table 7 Goodness-of-fit results of HFRF, DNN, SVM, and ANFIS methods for all datasets

Since we apply log-transformation on price data for modeling and neutralize the impact of the mean price in each of the considered real estate markets by reporting the scaled error measures, the real magnitude of the improvement by HFRFs is not quite clear in Table 7. To assess the magnitude of the gain by HFRFs, we find the percent improvement by HFRFs over the second-best model in terms of MAE and report the impact of the percent gain in terms of the average price in each of the considered markets in Table 8.

Table 8 Percent gain in the magnitude of absolute error by HFRFs against the second-best method for each dataset and the impact of the gain on average price

The improvement in the magnitude of prediction error by the proposed HFRF method ranges between 3.8% in the Russian market and 12.5% in the Sao Paulo market across the compared markets. The proposed HFRF produces price predictions with 6.3% less error than the runner-up method in the Georgia market. This corresponds to a 19,312 USD improvement in the price predictions’ margin of error for Georgia, USA, an 11,340 USD better prediction for Sao Paulo, BR, or a 47,418 USD improvement for Victoria, AU. Overall, the proposed HFRF method provides us with significant advancement in the magnitude of real estate price prediction.

Table 9 shows the gain in the variability of the prediction errors by the HFRF method for all datasets based on rRMSE. In terms of rRMSE, the HFRF method is superior to all benchmark methods. The gain in rRMSE is very close to SVM-Rad only for the Taiwan dataset. For Riga, Georgia, Sao Paulo, and Victoria datasets, the gain in the variability of the prediction errors varies between 7% and 14.1%. The lowest gain of 2.8% is recorded for the variability of the prediction errors in the Russian market.

Table 9 Percent gain in the variability of the prediction error by HFRFs against the second-best method for each dataset

HFRFs are proposed to handle categorical variables better in a dataset with categorical and interval-scale measurements. First, having a 7.7% improvement with the Taiwan dataset that has only interval-scale variables implies that HFRF is a promising method even if there is no categorical variable in the datasets. This is a desired feature of HFRFs. However, we do not observe such a significant improvement in the variability of the prediction errors by the HFRF for the Taiwan dataset. This can also be attributed to not having any categorical variable in this dataset. The least improvement, 3.8%, is seen in the Russia dataset. This dataset has a large sample size and has only 1 categorical variable. In comparison, the Sao Paulo dataset has a similar sample size but 3 categorical variables, and HFRF provides a 12.5% improvement in the error magnitude and a 14.1% improvement in the prediction errors’ variability. As we have more categorical variables in a dataset, we can expect better gains in performance with HFRFs. When the number of categorical variables is low with a small-to-moderate sample, such as in the Victoria dataset, the gain with HFRFs is also very promising: 10.9% in error magnitude and 12.1% in error variability.

Figure 6 shows actual and predicted log-sale prices with HFRF for all locations. The 45-degree lines in Fig. 6 display the perfect case of prediction. Generally, HFRF predictions align well with the 45-degree line for all locations. However, we observe outliers for some locations. For Victoria, we have one property located far from the main body of observations in the scatter plot. Since this point is an outlier in the test set and is located in the direction of the trend, we can conclude that HFRF consistently produces a large prediction for this high-priced property. So, HFRF learns from similar observations in the training set to produce a high price prediction for this property. Another notable observation from Fig. 6 is from the Taiwan market. Only one property price is significantly underestimated by HFRF, which is the only significant underestimation by HRFRF in all six markets. This observation is located on top of the main cluster of the points in the scatter plot for Taiwan. Since this property is located in the same region as other top-priced properties in the Taiwan market, we do not anticipate a measurement error for this data point. The reason for HFRF’s underestimation is that this property is a single-story dwelling, while other high-priced ones have more than 6 stories.

Fig. 6
figure 6

Scatter plots of actual and predicted log-sale prices with HFRF for all locations. The red 45-degree line indicates a perfect match between predictions and actual observations

4.3 Computation time

Figure 7 shows the run times of a single run of the top three methods, HFRF, SVM-Rad, and DNN, with the sample size and number of predictors in each dataset to assess the applicability of the proposed HFRFs in practice. All runs are done with a MacBook Pro computer with an Apple M1 Max chip and 64 GB memory. When there is a hyperparameter tuning effort, the run times in Fig. 7 need to be multiplied by the size of the tuning grid. Among the top three methods, HFRFs are most impacted by increased sample size and the number of predictors. For small-to-moderate samples, such as Taiwan and Victoria datasets, the run time of HFRFs is very close to SVM-Rad and DNNs. However, there is a notable difference between HFRFs and SVM-Rad for large samples such as Georgia or Sao Paulo. On the other hand, for Riga data, which is a moderate sample, HFRF has a better run time. Overall, the reported run times for HFRF have no negative impact on the method’s applicability in practice.

Fig. 7
figure 7

Run times of the top three best-performing methods, HFRF, SVM-Rad, and DNN, for all datasets

5 Conclusion

This study develops a method that performs satisfactorily for the regression task in the presence of categorical and interval-scale measurements in the dataset and particularly focuses on implementing the method for real estate price prediction. Real estate price prediction is one of the important application areas where categorical and interval-scale measurements appear in the dataset. Most machine learning methods, especially distance-based methods, cannot handle the non-numerical information contained in the categorical variables and treat them as numerical, resulting in reduced performance. To propose a solution to this issue, we consider using generalized Minkowski distance along with the hierarchical clustering, as introduced by [12], in the fuzzy regression functions method of [47] with support vector machines (SVMs) to handle the categorical variables better. This approach leads to the proposed hierarchical fuzzy regression functions (HFRF) method. SVMs with radial kernel function are implemented at the inference stage of HFRFs to train the model. SVMs are trained with training data and the membership degree of each observation to the clusters created by the hierarchical clustering with the generalized Minkowski distance. The clustering stage captures the information about similarities between observations. Specifically, it corresponds to market segmentation in the property price prediction problem. Since it is done within HFRFs without requiring user input, it relaxes the requirement for expert knowledge in segmentation. SVMs’ parameters are hyper-tuned for each dataset and cluster under the HFRF implementation to achieve the optimum performance for each market segment.

The HFRF method is applied to real estate pricing datasets from six diverse markets from different countries. The error magnitude and variability in prediction with HFRFs are benchmarked against linear regression, SVMs and adaptive neuro-fuzzy inference system (ANFIS) methods that are frequently used for real estate price prediction. In addition to these methods, deep neural networks (DNNs) are also considered in benchmarking as they are more general versions of ANNs with multiple layers. We can summarize the overall conclusions and recommendations of this study as follows:

  • The HFRF method produces 3.9% to 12.1% less prediction error than the second-best-performing SVMs with the radial kernel function (SVM-Rad).

  • The variation in the prediction errors is improved between 2.8% and 13.3% for the datasets including at least one categorical variable. However, we did not record an improvement in the variability of prediction errors when there is no categorical variable.

  • While SVM-Rad is the second-best-performing method, DNNs produce a close performance to them. However, ANFIS and linear regression methods do not deliver satisfactory real estate price prediction performance; hence, these methods are not recommended for use in practice. Consistent with the literature [41], linear regression methods perform better than ANFIS.

  • Performance gain with HFRFs depends on the sample size and the number of categorical variables. This translates into the balance between the amount of categorical information and the total information in the sample. As the weight of categorical information increases, HFRFs are expected to provide more gain in performance.

  • The computation time of HFRFs is sensitive to the sample size and the number of predictors. However, this does not pose a problem with the applicability of the HFRFs on large samples in a reasonable time frame.

  • HFRFs are strongly recommended when any categorical variable is in the dataset for regression tasks. HFRFs are still beneficial in the absence of any categorical variables in data. However, this study does not observe any gain in the variability of prediction errors by HFRF when there is no categorical variable in the dataset.

The study’s main limitation is that the conclusions given here are limited within the scenarios resembled by the considered datasets. However, since the datasets are from six different markets with a sufficient variety of sample sizes, number of predictors, and number of categorical variables, there is no concern about the generalizability of the results.

The real estate price prediction datasets can also include outlier observations. Since we do not focus on the outlier problem in this study, we do not offer any conclusions on the performance of HFRF in the presence of outliers. Considering outliers in the presence of mixed predictor types is a future study.