Identifying local differences with fused-MCP: an apartment rental market case study on geographical segmentation detection

As the variety and quality of spatial data increase in recent times, the potential to analyze local characteristics based on spatial data is getting stronger. Previous spatial analysis methods structuralize the spatial autocorrelation of data by the distances between data observation points and the contiguity of the data-observed regions. It is significant for the estimation of global characteristics of spatial data. However, these approaches are not suitable for identifying local differences from the data since they assume a smooth spatial autocorrelation structure. Generalized fused lasso, which can detect local differences in spatial data, has been proposed in machine learning studies. Its limitation is that the estimated parameters are biased toward zero; however, methods that overcome the limitation have also been proposed. Fused-MCP is one of those methods and is expected to be useful in spatial analyses. This study applies fused-MCP to spatial analyses. As an example of spatial analyses based on fused-MCP, this study analyzes the structure of geographical segmentation of the real estate market in central Tokyo. Fused-MCP is utilized to extract areas where the valuation standard is the same. The results reveal that the geographical segmentation displays hierarchal patterns. Specifically, the market is divided by municipalities, railway lines and stations, and neighborhoods. The case study confirmed the applicability of fused-MCP to spatial analyses.


Introduction
The recent development of information communication technologies and the progress of open data policies have increased the quality and variety of spatial data rich. Thus, the potential to analyze local characteristics based on spatial data is improving, and the methods for spatial analysis are in the spotlight.
Various methods have been proposed for spatial analysis; however, we limit the discussion to regression-based analyses. It is necessary to consider the spatial autocorrelation of data in the regression analyses. Therefore, three representative approaches-spatial regression models, spatial process models, and geographically weighted regression-have been developed. Spatial regression models (Anselin 1988), which include spatial error models, spatial lag models, and moving average models, aim to estimate common parameters of regression models in the target area from region-based independent and dependent variables, representing the structure of spatial autocorrelation by spatial weight matrices. The elements of a spatial weight matrix are set based on the proximity of regions, such as the contiguity of regions and the distance between regions. Spatial process models (Cressie 1991), which include ordinary kriging, universal kriging, and cokriging, aim to interpolate values at unobserved locations in the target area, representing the spatial autocorrelation structure based on the assumption of intrinsic stationarity or second-order stationarity and estimating the common parameters of regression models. Geographically weighted regression (Fotheringham et al. 2003) aims to estimate the heterogeneous parameters by locations, assuming that the location-based parameters are continuous in space. It is a weighted regression in which the weights are set by the functions of the distance between the location where parameters are estimated and the locations where data are observed.
The above-mentioned methods of spatial data analysis structuralize the spatial autocorrelation of data by the distances between data observation points and the contiguity of data observed regions. The modeling of spatial autocorrelation structure is significant for the estimation of global characteristics of spatial data. However, these approaches are not suitable for finding the local areas where the parameters are discontinuously different from their neighborhood areas since these approaches assume a smooth spatial autocorrelation structure.
Fused lasso is proposed by Tibshirani et al. (2005) as a method to detect change points in time-series data. It is based on the least absolute shrinkage and selection operator (lasso) (Tibshirani 1996). The lasso imposes l 1 regularization on an optimum parameter estimation problem, and it functions as a variable selection method. Fused lasso is an expansion of lasso; it penalizes the parameters themselves and the differences between consecutive parameters. Tibshirani and Taylor (2011) propose a generalized fused lasso, which expands the regularization of differences between consecutive parameters to the regularization of differences between adjacent parameters; the expansion enables one to consider the adjacency of regions in spatial analyses to detect the borders where adjacent parameters are not the same. The generalized fused lasso is applied to spatial analyses in Wang and Rodríguez (2014) and Inoue et al. (2018); it is confirmed to be a powerful method for detecting local differences.
However, Fan and Li (2001) point out that lasso estimators are biased, and several regularizations whose estimators are not biased have been proposed. One type of regularization is the adaptive lasso proposed by Zou (2006), which sets a different weight on each parameter in the l 1 penalty. The setting of adaptive weights requires initial parameter estimators to allocate small weights on the penalties of non-zero parameters and the large weights on the penalties of zero-parameters. Another type of regularization is the concave penalty function; the smoothly clipped absolute deviation (SCAD) penalty proposed by Fan and Li (2001) and the minimax concave penalty (MCP) proposed by Zhang (2010). However, the minimization of nonconvex penalized function is computationally challenging. The solution's uniqueness and the existence of local minimizers are not discussed for the SCAD penalties. Conversely, the MCP is defined to minimize the maximum concavity and is proven to have a unique path of the solution graph from the origin that all estimators are zero to an optimal fit, although there may be additional branches. Jing et al. (2018) expand the MCP to the fused-MCP, which penalizes the parameters themselves and the differences between adjacent parameters. Fused-MCP is expected to be useful in spatial analyses that aim to extract local differences.
Therefore, this study employs the fused-MCP in spatial analyses that aim to detect local differences in a target area. As an example of spatial analyses based on fused-MCP, this study introduces an analysis to identify the structure of geographical segmentation of an apartment rental market in central Tokyo and confirms the applicability of fused-MCP to the analysis.

Variable selection and parameter estimation with penalty functions
Suppose a linear regression model in which an n×1 dependent variable vector y depends on p explanatory variables of n×1 vectors x j (j = 1, …, p), and β j is a coefficient for explanatory variable x j , then where n is the number of observations, 1 ≡ 1 , … , p , 1 ≡ 1 , … , p T , and ε ⁓ N(0, σ 2 I n ). However, there are P (> p) candidates for explanatory variables x k (k = p + 1, …, P) that does not correlate with y. If there is no prior information on which regressors should be used in the model of Eq. (1), then the model is . This model presents a variable selection problem of finding out p explanatory variables with non-zero coefficients from P candidates. When the number of variables (1) = 1 1 + , (2) = 1 1 + 2 2 + = + , 1 3 P is small, it is possible to find a good set of explanatory variables based on the criteria, such as AIC (Akaike's Information Criterion) (Akaike 1973) and BIC (Bayesian Information Criterion) (Schwarz 1978). However, when the number of candidate explanatory variables, P, is large, the number of combinations of variables becomes enormous. Thus, the variable selection is not computationally feasible. Hence, to avoid the situation, the optimization methods with the regularization by continuous penalty functions are utilized. This section introduces lasso and its derivative, MCP.

Lasso
The lasso introduces l 1 regularization to the model (Tibshirani 1996). Here, we estimate the linear regression model of Eq. (2) by the lasso. Let ‖ · ‖ 2 denote the Euclidean norm (l 2 -norm) and ‖ · ‖ 1 denote the l 1 -norm. Then, the parameter estimation by lasso is formulated as where t is the positive lasso regularization parameter. The lasso promotes the sparsity of the model; it estimates that the parameters where the less substantial explanatory variables are zero. The solution of a constrained optimization problem of Eq.
(3) selects the substantial explanatory variables from among many candidates. Equation (3) is equivalent to where λ is a Lagrange multiplier that corresponds to t. When λ is large (and t is small), many parameters are estimated as zero. Tibshirani et al. (2005) expand the lasso to a fused lasso. It is a method to investigate the presence or absence of a difference between consecutive parameters by imposing a new regularization on the differences between consecutive parameters. Tibshirani and Taylor (2011) propose a generalized fused lasso and a path algorithm to estimate parameters. Sparse fused lasso is a type of generalized fused lasso that has the fused lasso penalty, the l 1 -norm on the difference between adjacent coefficients, and the l 1 -norm of the coefficients themselves. It is formulated as

Fused lasso
where C is the set of pairs of parameters that are defined to have an adjacency.
The l 1 -norm on the difference among adjacent parameters tends to estimate common parameters. Tibshirani and Taylor (2011) state that the sparse fused lasso gives an unbiased estimate of the number of nonzero fused groups. Hereafter, the sparse fused lasso of Eq. (5) is denoted as fused lasso in this study. Since the fused lasso has flexibility in setting the adjacency of parameters, it is easily applicable to spatial analyses by setting adjacency based on location. Tibshirani and Taylor (2011) show an example of applying fused lasso to spatial analyses. A linear regression model with a parameter of the constant term for each US state is estimated from the flu case data. Inoue et al. (2018) apply the fused lasso to an estimation of the linear regression model that represents real estate valuation to analyze the geographical segmentation of the real estate market. In addition to the linear regression models, the fused lasso is utilized in many types of models. Wang and Rodríguez (2014) utilize the fused lasso in a Poisson regression model without covariates to identify the spatial clusters of point events, and Choi et al. (2018) propose to utilize the fused lasso in a Poisson regression model with covariates and an estimation algorithm based on the Majorize-Minimization algorithm (MM algorithm) (Hunter and Li 2005). Parker et al. (2016) utilize it to estimate the spatial covariance function that is not spatially stationary. Sun et al. (2016) introduce a method to estimate spatial and temporal quantile function with a fused adaptive lasso by penalizing the difference between neighboring quantiles.

MCP and fused-MCP
As described above, the lasso-based variable selection has already been utilized in spatial analyses that aim to detect local differences and indicate its applicability. However, Fan and Li (2001) point out that the estimators by lasso are biased. Therefore, several penalty functions that not only have an ability in variable selection but also can estimate unbiased parameters have been proposed.
This section introduces one of the penalty functions, MCP (Zhang 2010), and its expansion, the fused-MCP (Jing et al. 2018). Zhang (2010) proposes the MCP and a penalized linear unbiased selection (PLUS) algorithm that estimates parameters in the minimization of the loss function with MCPs. The MCP is defined as with regularization hyperparameters γ > 0 and λ > 0. The MCP is obtained to minimize the maximum concavity defined by which is subject to

MCP
and where ̇(t; , ) = d (t; , ) dt . Equation (8) presents the condition that an MCP is not effective if t exceeds a threshold of γλ, and this leads to estimate unbiased parameters when t ≥ γλ. Equation (9) presents the condition that an MCP imposes the same penalty to the l 1 regularization term of ‖t‖ 1 ; thus, the MCP works as a variable selection method.
The parameter estimation of the model in Eq.
(2) by regularization using the MCP is Equation (10) is a non-convex optimization since the MCP is a concave function. However, since the minimum concavity of Eq. (7) is minimized to ( ) = 1∕ , the MCP provides sparse convexity to the broadest range.
A larger γ value leads to less unbiasedness and more concavity, and the MCP is equal to the l 1 penalty at γ = ∞ and is similar to the l 0 penalty as γ → 0+. Zhang (2010) proves that the MCP has selection consistency at the universal penalty level univ ≡ √ (2∕ n) log p. Jing et al. (2018) propose a fused-MCP, which penalizes both the parameters themselves and their adjacent difference by the MCP. This study focuses on the recovery of 1D and 2D-grid signal with noise and proposes models for each data type, as well as an algorithm to estimate the parameters. The model for denoising 1D signal is where y is an n × 1 vector of observed signal, θ is an n × 1 vector of true signal, and ε is an n × 1 vector of noise. Thus, the recovery of the signal, which is the estimation of θ, is defined as If λ 1 = 0, Eq. (12) is transformable to the minimization problem that is the same as that of the regularization of parameters themselves by the MCP. Thus, the PLUS algorithm can be applied to estimate the parameters. However, if λ 1 > 0, it is not possible to apply the PLUS algorithm. Jing et al. (2018) propose (8) (t; , ) = 0 ∀t ≥ , (9) (0 + ; , ) = ,

Fused-MCP
(11) = + , to solve the nonconvex optimization problem by the MM algorithm (Hunter and Lange 2004;Hunter and Li 2005) The MM algorithm is an iterative procedure that consists of majorization and minimization. Consider the problem that minimizes the concave function f (θ). At the m step, the current value of θ, θ (m) , is given. Then, find a surrogate function g (θ|θ (m) ) that is a convex function and majorizes f (θ): Then, minimize the surrogate function g (θ|θ (m) ) to find θ (m+1) . Since the following relationship, holds, the estimator approaches the local minima. If a surrogate function is set appropriately, the MM algorithm works efficiently. Jing et al. (2018) proposes a surrogate function of the MCP ρ (t; λ) that is a quadratic function with the minimum value at t = 0, The model for denoising the 2D-grid signal used in Jing et al. (2018) is where Y is an n × m matrix of an observed 2D-grid type signal (such as an image with noise), Θ is the n × m matrix of true signal to estimate the recovered image, and ε is the n × m matrix of disturbances at every pixel. Thus, the signal recovery based on MCP is written as where ‖·‖ F denotes the Frobenius norm. The model does not penalize parameters but their adjacent difference. The PLUS algorithm is also not applicable to this model; hence, the solution using the MM algorithm is proposed.

Estimation of a linear regression model by fused-MCP
This study applies the fused-MCP to the estimation of a linear regression model. The parameter estimation is formulated as This study implemented the estimation by the MM algorithm using the surrogate function of Eq. (15), following Jing et al. (2018).

Case study: identification of geographical segmentation of the apartment rental market in central Tokyo
As an example of the identification of local differences, this study introduces an analysis of geographical segmentation of the real estate market; it aims to detect the areas where the pricing is different from their neighbors. The real estate market is segmented by many aspects, including consumer types, property types, and environmental factors. It also forms submarkets in which the valuation bases differ from those in others. It is well-known that location plays a major role in the submarket formation. For example, people who value the accessibility to services of urban life prefer to live in city centers, and people who value commodious property and proximity to the natural environment prefer to live in the suburbs. The difference in consumer types between the city center and the suburbs affects property supply and valuation in both submarkets. Families who care about the education of their children choose school districts and municipalities where the level of education service is high. However, couples without a child do not consider educational service for their residential selection.
In this manner, not only the living environment but also the types of consumers and supplied residential properties are related to location. As a result, the real estate market is segmented geographically such that the value of attributes of real estate properties is different for each area. The valuation of attributes of real estate properties differs by neighborhoods, school districts, municipalities, and city center or suburbs. Therefore, the geographical segmentation in the real estate market has a hierarchical structure.

Previous studies on geographical segmentation of the real estate market
The geographical segmentation of the real estate market has attracted much research interest, and several approaches have been attempted to understand its structure.
One approach is the classification of the market based on the attributes of real estate properties. Bourassa et al. (1999), Wu and Sharma (2012), and Keskin and Watkins (2017) classify the real estate properties based on the principal component analysis of their attributes and analyze the relationship between the classified results and the locational characteristics.
Other approaches are based on the hedonic analysis derived from Rosen (1974). Kestens et al. (2006) utilized geographically weighted regression to estimate the location-dependent parameters of the real estate price model and discuss the spatial distribution of estimated parameters. However, most hedonic-based studies, such as Watkins (2001), Bourassa et al. (2003), Goodman and Thibodeau (1998, 2007, and Islam and Asami (2011), preset several types of geographical division units, estimated parameters for each unit in the settings of different geographical divisions, and selected the model that best fit the data to identify the structure of geographical segmentation. Their analyses are limited to the preset division structures, such as school districts, postal districts, and census tracts; they cannot identify the segmentation that is partially different from the preset division, and they do not consider the hierarchal division structure of real estate market. Kuroda et al. (2014), also based on the hedonic analysis, proposed to estimate the real estate price model and searched the division of submarkets by connecting the adjacent preset geographical units simultaneously by the simulated annealing. It is possible to find the division structure that is not fixed to the preset division. However, there are several limitations. The number of submarkets has to be given when executing the simulated annealing. Thus, if the number of geographical units for the preset division is large and there are many choices for the number of submarkets, it is quite difficult to find the best model that fits the data. Moreover, it is not possible to find the hierarchical division structure.
Thus, to overcome the limitation of previous researches, Inoue et al. (2018) applied the fused lasso for the estimation of a real estate price model. Similar to Kuroda et al. (2014), the structure of the geographical market segmentation is searched by connecting adjacent spatial units of the preset division. The difference is that it can analyze real estate price models with multiple levels of division structure and estimate parameters and division structure simultaneously by solving the optimization without relying on a randomized algorithm. A real estate price model with many regional explanatory variables that depend on different spatial resolutions is constructed, and areas where the valuation standard is the same are extracted by the implementation of the fused lasso. The proposed method was applied to rent data of apartments in the Tokyo metropolitan area, and the result reveals that the valuation standard is different from most of the municipalities. However, there are sets of municipalities where the valuation for the attributes of apartments are the same. The results also detect the neighborhood-level local disparity in the valuation; it, then, confirms that there is a hierarchical structure in the apartment rental market in the target area. However, Inoue et al. (2018) did not consider the lack of unbiasedness in the lasso estimation.
This study utilizes the fused-MCP for the estimation of the apartment rent model used in Inoue et al. (2018), although the target area is reduced due to the limitation in the computation power to solve the optimization problem with MCP.

Data and model of apartment rent
This study utilized apartment rent data from 2015 to 2016 collected by At Home Co., Ltd. The target area was central Tokyo, which consists of five municipalities (wards, "ku" in Japanese): Chiyoda, Chuo, Minato, Shinjuku, and Shibuya ( Fig. 1a).
High-rise condominiums whose number of floors exceeded 15 were excluded as their rents have a different pricing structure from the pricing of other apartments. If the rent data of apartments in the same building occupy the majority of data in the neighborhood, the estimated results of valuation in that neighborhood would be affected by the building. Thus, to avoid such issues, an apartment was randomly selected on the same floor in the same building each year. Consequently, the total number of records was 50,355. Figure 2 shows the spatial distribution of the apartment rent data.
The apartment rent model in this case study used the natural logarithm of monthly rent per square meter as the dependent variable. Five attributes of apartments, apartment age, area of property, walking time to the nearest station, floor number, and number of rooms were the explanatory variables to estimate the municipality-level parameters; in total, 25 parameters were set. The summary of attributes of apartments is shown in Table 1.
Three different levels of location factors that might affect the apartment rental market were considered: railway lines, the nearest railway stations, and neighborhoods ("cho" and "chome" in Japanese) (Fig. 1b). Dummy variables in this model  Table 2. Note that a station and a neighborhood whose average rent per square meter were the medians were selected as the reference, and dummy variables were not set for that station and neighborhood.
The following is the explanation of the hedonic apartment rent model settings in this study. Let y i(in m) denote a dependent variable that is the natural logarithm of rent of apartment i located in municipality m. Furthermore, β 0 denotes the intercept of the regression; β jm denotes the municipality-level regression coefficient of the j-th explanatory variables x ijm of apartment i in municipality m; M, L, S, and N denote sets of municipalities, railway lines, railway stations, and neighborhoods, respectively; x il Line , x is Stn , and x in Nbrhd denote the dummy variables for railway line l that stops at the nearest station, the nearest railway station s, and the neighborhood n of apartment i, respectively; and β l Line , β s Stn , and β n Nbrhd denote the parameters of railway line l, railway station s, and neighborhood n, respectively. Then, the apartment rent model was where ε i ⁓ N(0, σ 2 ). The total number of parameters was 651.
The following explanations are the settings of the regularization term. This study penalized all the parameters and the differences of municipality-level parameters between adjacent municipalities and between parameters of dummy variables of adjacent neighborhoods. However, the differences between parameters of connected railway lines and adjacent railway stations were not penalized since the parameters of railway line dummies and station dummies in this model were expected to represent the evaluation of accessibility as well as the environmental factors surrounding the station. It is natural to consider that the parameters of connected railway lines and parameters of neighbor railway stations are not necessarily the same.
The adjacency of municipalities was set if two municipalities shared borders. The adjacency of neighborhoods was set if two neighborhoods shared borders and belonged to the same municipalities and the same system of railway stations. Note that whether a certain neighborhood belongs to a certain system of railway  stations was decided based on the highest frequency of appearances in the nearest station attribute of the apartment data in the neighborhood. Since the apartments tend to be allocated to the major stations in their neighborhood, the shapes of systems of stations are different from those of Voronoi polygons whose generators are the stations. This setting constructs a hierarchical structure in the dummy variable settings and enables the analysis of the hierarchical structure of the apartment rental market from municipality to neighborhood levels. The numbers of pairs of adjacency municipalities and neighborhoods are shown in Table 3. Let y denote the vector of dependent variables. Furthermore, X denotes the matrix of explanatory variables; P denotes the set of all parameters; bs denotes the estimator of βs; P m denotes the set of types of municipality-level parameters; N m and N n denote sets of pairs of adjacent municipalities and neighborhoods, respectively; λ 1 , λ 2 , γ 1 , and γ 2 are the hyper-parameters. Equations (20) and (21) represent the estimation of parameters by the fused lasso and the fused-MCP, respectively.
The numeric explanatory variables in Eqs. (20) and (21) were scaled to have zero mean and unit variance. It is intended to adjust the scale of parameters; if the scale of parameters is different (the effect of penalties is different for the parameters), then it would lead to biased estimators.
When estimating parameters, many hyperparameter settings were tested. We selected the estimation results with the minimum AIC value, assuming that the disturbances of the apartment rent model of Eq. (19) were independently, identically, and normally distributed. Since the number of apartments in the dataset (50,355), were far larger than the number of parameters in the model (651), the parameter estimation by Eqs. (20) and (21) would be almost equivalent to the estimation by = arg min where l is the log-likelihood function. Then, the model selection by AIC would be fair, although the parameters are biased toward zero, especially in the estimation by the fused lasso.

Estimation results by two methods and comparison
This section introduces the estimation results by the fused lasso and the fused-MCP.

Estimation by fused lasso
In the estimation by the fused lasso, three settings of λ 2 /λ 1 = 0.1, 1, and 10 were tested. The calculation was done by the R package 'genlasso' that outputs a solution path for each setting. When λ 1 = 0.023 and λ 2 = 0.23 (λ 2 /λ 1 = 10), the model with 469 unique parameters out of 651 initially set parameters was adopted. Table 4 summarizes the numbers of parameters, and Table 5 shows the estimated values of municipality-level parameters. In this estimation, no municipality-level coefficients were estimated to be the same to the adjacent municipalities. Note that the estimated parameters of apartment age in Chiyoda and Shibuya were the same, but they are not contiguous. For reference, the adjusted coefficient of determination was 0.583 and the AIC was − 64647, although it underestimated the model since the estimated parameters were biased toward zero.

Estimation by fused-MCP
The estimation of parameters by the fused-MCP was achieved by solving Eq. (21) by the MM algorithm, thereby, setting the estimated parameters of fused lasso as initial values. The model has four hyperparameters: λ 1 , λ 2 , γ 1 , and γ 2 ; λ 1 and λ 2 mainly act as the weights on the penalties, and their relative size effects the balance between the penalties on parameters and the penalties on the difference of adjacent parameters. Furthermore, γ 1 and γ 2 mainly act as the shape of the MCP. Since there are no positive reasons for using the penalty functions with different shapes for the parameters and the differences of adjacent parameters, this study fixed γ 1 and γ 2 to the same value γ. Then, it took the following two-step procedure to decide the hyperparameters. At the first step, by fixing λ 1 and λ 2 to the same values λ, the balance between λ and γ was searched by the grid search. At the second step, by fixing γ, the balance between λ 1 and λ 2 was tested. Table 6 shows the results of the grid search of the hyperparameters λ and γ that minimize the AIC. After the broad grid search with the search range of λ from 0.1 to 100 and γ from 0.0001 to 0.1 was tested (Table 6a), the narrow grid search with the search range of λ from 1 to 10 at the interval of 1 and γ from 0.001 to 0.01 at the interval of 0.001 was executed (Table 6b). Consequently, the AIC was minimized when λ (= λ 1 = λ 2 ) = 2 and γ (= γ 1 = γ 2 ) = 0.007. Table 7 shows the results of the grid search of the hyperparameters λ 1 and λ 2 by fixing γ 1 = γ 2 = 0.007. Both hyperparameters were tested from 1.0 to 3.0 at the interval of 0.5, and it was confirmed that the hyperparameter setting of λ 1 = λ 2 = 2.0 minimized the AIC. For reference, the adjusted coefficient of determination was 0.585. Hereafter, the estimated results in this setting are introduced. Tables 8 and 9 show the summary of the estimated parameters and the estimation results of municipality-level parameters, respectively. The estimated parameters for dummy variables are shown in Figs. 3, 4 and 5. Figure 3 indicates the estimation results of railway related dummy variables (railway lines and stations), and Figs. 4 and 5 indicate the estimation results of neighborhood dummy variables.

Comparison of estimations by the fused lasso and the fused-MCP
It is confirmed that the estimation by the fused-MCP with the adjusted coefficient of determination of 0.585 fit the data better than that by the fused lasso with the adjusted coefficient of determination of 0.583, although the total number of unique parameters (Table 8) is smaller than the estimation by fused lasso (Table 4).
Comparing the estimated results of municipality-level parameters by the fused lasso (Table 5) and the fused-MCP (Table 9), it is confirmed that no parameters coincided with those of adjacent municipalities in the estimation by the fused lasso. However, parameters of three explanatory variables coincided in the estimation by the fused-MCP; the parameters for apartment age and walking time to the nearest station in Chiyoda and Chuo wards were estimated to be the same. Likewise, the parameters for the area of the property in Shinjuku and Shibuya wards were estimated to be the same. The results indicated that the same submarkets might be established across the borders of the municipalities.
It is worth mentioning that most of the absolute values of the estimators by the fused-MCP are larger than those by the fused lasso, especially when the absolute values of estimators are large. This result was caused by the characteristics of the MCP, which had no gradient when parameter values were larger than λγ, 0.014 in this estimation. It is expected that the estimated results that exceed this value (parameters of apartment ages, area of the property, and floor number) were unbiased. However, some results for parameters of walking time to the nearest station and the number of rooms did not exceed λγ; thus, the estimated parameters might be biased toward zero. Figure 4 presents the differences in the estimation results of neighborhood dummy variables by fused-MCP and fused lasso. It is evident that the number of non-zero parameters by fused-MCP is smaller than that by fused lasso. The estimation result by fused-MCP reveals the neighborhoods where apartment rents are distinguishable from their neighbors. The number of parameters that are common to the adjacent neighborhoods by fused-MCP is larger than that by fused lasso; the estimation by fused-MCP clearly indicates the structure of market segmentation by detecting the areas where the valuation of apartments is the same. The absolute values of estimated non-zero parameters tend to be larger by fused-MCP than those by fused lasso; this result corresponds to the discussion and results of Jing et al.'s (2018) simulation study.
From the comparison of estimated results by the two methods, the estimation by the fused-MCP has higher interpretability with smaller biases. It enables the selection of areas with local differences strictly by the regularization of parameters and the differences of adjacent parameters by the MCP.

Comparison of results by SCAD with fused conditions
This section briefly introduces SCAD (Fan and Li 2001), one of the other regularization methods that can output the non-biased estimates, applies it to the analysis of the apartment rent dataset, and compares the estimation result with the fused-MCP.
The gradient of penalty function of SCAD, p(|β|), is defined by When the parameter is zero, the gradient is λ, which is the same as that of the l 1 penalty with the weight parameter λ. When the absolute value of the parameter is larger than aλ, the gradient is zero; this means that no penalty is imposed when the parameter is large. This characteristic of the penalty function leads to the unbiased estimation for parameters that are not close to zero.
Although the SCAD is not applied to the fused conditions to the best of the author's knowledge, this section applies it to the apartment rent analysis. The estimation by SCAD is based on the MM algorithm by Hunter and Li (2005), which we utilize for the fused-MCP estimation. This study sets the hyperparameter a to 3.7 according to the discussion by Fan and Li (2001). Equation (25) represents the estimation of parameters by the SCAD, for some a > 2 and | | > 0.   Similar to the fused lasso, the model has two hyperparameters, λ 1 and λ 2 , and their relative size effects the balance between the penalties on parameters and the penalties on the difference of adjacent parameters. The balance between λ 1 and λ 2 was tested to decide the hyperparameters. Table 10 presents the results of the grid search of the hyperparameters λ 1 and λ 2 . The hyperparameter λ 1 was tested from 0.05 to 0.4 by the power of two, and λ 2 was tested from 0.0005 to 0.004 by the power of two, and it was confirmed that the hyperparameter setting of λ 1 = 0.1 and λ 2 = 0.001 minimized the AIC. For reference, the adjusted coefficient of determination was 0.587. Hereafter, the estimated results in this setting are introduced.
Tables 11 and 12 show the summary of the estimated parameters and the estimation results of municipality-level parameters by SCAD, respectively. The estimated parameters for the dummy variables are shown in Fig. 6. Figure 6 indicates the estimation results of the neighborhood dummy variables by SCAD.  The result by the SCAD with fused conditions shows that none of adjacent parameters are estimated to be the same. The AIC by the SCAD is not improved from that by fused lasso in this estimation. The apartment rent model in this study has many parameters and pairs of adjacency parameters that penalties are imposed on, and the minimization problem with non-convex penalties must have many local minima. Since the prevention of the problem caused by the non-convexity of the penalty function is not considered when proposing the SCAD, the possibility that the estimated result by the SCAD is one of local minima that is larger than the case by the fused-MCP. The difference in the estimated results by these two regularization methods may be what can output the non-biased estimates.

Discussions on results by fused-MCP
This section focuses on the parameters of location-related dummy variables by the fused-MCP estimation and discusses the revealed local differences in the apartment rental market in central Tokyo. Figure 3a shows the sum of the parameters of the station and railway lines that operate at the station; it presents the evaluation of accessibility and the neighborhood environment of the stations. It is observed that the stations located near the border of Chiyoda and Chuo and in the northern area of Minato are evaluated as high. The former stations are located close to the major central business districts (CBD) of Marunouchi and Ginza, and the latter stations are located close to the other CBD of Akasaka and the urban residential area of Aoyama. However, the stations in the eastern half of Chuo and the northern area of Shinjuku are evaluated as low. The former is an area populated with small buildings and is located in an environment not conducive to live in. The latter is the area densely populated with small detached houses and is located near the Kanda river that used to overflow often; it is far from the CBD and has a less conducive environment to live in within the target area. It seems that the estimated results demonstrated in Fig. 3a is quite reasonable. Figure 3b indicates the estimated parameters of railway line dummy variables. It features the extraction of railway lines in the target area and enables the analysis of whether railway lines might affect the popularity in the apartment market. The railway lines that run through the CBD area of Chiyoda and Chuo and the northern part of Minato, Ginza, and Chiyoda subway lines are evaluated as high, and Seibu Shinjuku line, which runs the northern part of Shinjuku, and Yurikamome line, which runs harbor districts, are evaluated as low. Figure 4c is the estimated parameters of railway station dummy variables. Since this is the rest of the evaluation of railway stations without the evaluation of railway lines, it is hard to interpret. Figure 4 indicates the estimated parameters of neighborhood dummy variables. A neighborhood is the minimum spatial unit in the analysis. The broad gray lines are the borders of the system of railway stations, and the neighborhood parameter represents the local differences in each system. Almost half of the parameters are estimated to be zero, and local differences are judged to be non-existent (Table 8). The analysis succeeds in selecting neighborhoods with substantial parameters from many among the candidates and reveals that apartment rents are locally homogeneous in most areas. However, many non-zero parameters are estimated, especially in Minato and Shibuya, the southwestern region of the target area. The results confirm that neighborhood-level local geographical segmentation occurs in these areas. It is confirmed that the fused-MCP can extract the local differences.
The shaded areas in the figure indicate the neighborhoods that have the same parameter values to the adjacent neighborhoods. By the regularization of the Fig. 6 Estimated parameters of neighborhood dummy variables by SCAD with fused conditions difference of parameters of adjacent neighborhoods, the combinations of neighborhoods where the valuation is the same is extracted by the fused-MCP estimation. It is also confirmed that the fused-MCP is capable of extracting areas with local differences without relying excessively on the preset spatial divisions by combining adjacent spatial units. Figure 5 is the enlarged view of Fig. 4. Figure 5a shows the area around Azabujuban station in Minato. It consists of 14 neighborhoods. Five neighborhoods in the northwestern part have the same positive value; five neighborhoods, mainly in the middle, do not have non-zero parameters; and the rest of the four neighborhoods have different negative values. The large redevelopment site of Roppingi Hills, where offices, shopping mall, hotels, and high-rise residential towers were provided, was northwest of the system of stations; the redevelopment has correspondingly affected neighboring apartment rents along the path from the station. Figure 5b shows the area around Shibuya station in Shibuya. It consists of 15 neighborhoods. Four neighborhoods on the north side of station form a group with positive parameter values, and three and two neighborhoods on the south side form two groups with negative parameter values. This figure visualizes the fact that the rental apartment market near Shibuya station is divided by north and south. The neighborhood with the highest parameter values of neighborhood dummy in this area is located the northwest of the station. It is the famous residential area of Shoto. These results confirmed that the local differences of apartment rent are clearly extracted in this analysis.

Summary
As mentioned above, the case study extracted local differences in the apartment market in central Tokyo. Although many local parameters were set in the model, the regularization of the parameters enables the detection of local areas where the pricing is different. Moreover, although the spatial division for the analysis is preset, the regularization of the differences of adjacent parameters enables the extraction of areas where the pricing is the same. In this case study, the model with hierarchical spatial divisions of municipalities, system of railway stations, and neighborhoods is estimated. The fused-MCP based analysis can set the model with flexibility. Thus, it is expected to be useful in other types of spatial analyses.

Conclusion
This study focused on the applicability of fused-MCP to the spatial analyses that detect local differences from spatial data. After introducing the variable selection methods (lasso and MCP), this study presented a case study to identify the geographical segmentation of the apartment rental market in central Tokyo.
In the case study, the rent model with many regional parameters to represent the difference in price formation by the municipality, the system of railway stations, and the neighborhood is constructed. Parameter estimation was performed by the fused-MCP to extract substantial parameters and to search for common parameters in adjacent regions. It detected the hierarchical structure of the apartment rental market in which pricing is different by several levels of spatial divisions. The case study confirmed the applicability of the fused-MCP to spatial analysis that aims to search local differences.
There are several challenges to be solved to utilize fused-MCP. The most important problem is the absence of an efficient estimation algorithm. The PLUS algorithm can output a solution path of the MCP. However, it does not apply to the fused-MCP. This case study utilized the MM algorithm, but it only outputs a solution for each hyperparameter setting. Since the model used in the case study had four hyperparameters, it was impossible to set the optimum values. Another challenge is the criterion of model selection. This case study utilized AIC under the assumption that disturbances are independently and identically normally distributed, considering that the number of observations is much larger than that of parameters. However, it would be inappropriate when the number of observations is small. Information criterion, such as EBIC (Chen and Chen 2008), should be tested. From the perspective of a spatial analyst, it would be better if the more flexible setting for the adjacent parameters is applied. If the spatial units are not contiguous but are located nearby, the parameters might be equal. The question of how to set the adjacency of parameters is difficult to answer.
Despite the challenges, sparse modeling, which includes lasso and MCP, is a promising analysis method for spatial data. As is introduced earlier, the fused-lasso is being utilized in spatial analysis, especially in the analyses to detect spatial clusters of data on point events that focus on local differences. Sparse modeling is also useful in the regression analysis that focuses on local differences; however, few studies have utilized it in spatial analysis. We hope that this study would be helpful in the introduction of the methods.