1 Introduction

In this study, we focus on the heterogeneity latent in data. For example, spatial data have heterogeneity, i.e., each individual is affected by space (e.g., area and location). For such data, it is important how their heterogeneity is expressed. Furthermore, it is of interest to determine homogeneous clusters. Therefore, we consider not only modeling for heterogeneity but also clustering for homogeneity.

In the case of modeling for heterogeneity, we can apply a varying coefficient model and estimate its coefficients by a weighted local estimation. Geographically weighted regression (GWR; Brunsdon et al. 1996) is one popular method for spatial data analysis (e.g., Geng et al. 2011; Lu et al. 2011; Wang and Wang 2020). GWR estimates the coefficients for each sample point with weights based on distances between sample points, and provides a flexible estimation. However, applying GWR to a large sample data is difficult in terms of calculation cost. This is because calculating distances for all pairs of sample points is necessary to obtain the weights and the method also requires optimization of the location and bandwidth of a kernel function. Furthermore, one needs to perform estimation again to obtain predictive values for future observations and GWR cannot perform clustering. The general varying coefficient model can be considered to have similar disadvantages.

In this paper, we consider a model with discrete varying coefficients. Specifically, we consider a problem involving m groups with homogeneity group-wise for which we have a dataset \(({\varvec{y}}_j, {\varvec{X}}_j)\) for group each \(j\ (\in \{ 1, \ldots , m \})\), where \({\varvec{y}}_j\) is an \(n_j\)-dimensional vector of a response variable and \({\varvec{X}}_j\) is an \(n_j \times k\) matrix of explanatory variables. For spatial data, we can obtain the m groups by splitting the space subjected to analysis into m subspaces. (However, note that we do not consider only spatial data herein.) Then, a linear regression model for group j may be expressed in the form

$$\begin{aligned} {\varvec{y}}_j = {\varvec{X}}_j {\varvec{\beta }}_j + {\varvec{\varepsilon }}_j, \end{aligned}$$
(1)

where \({\varvec{\beta }}_j\) is a k-dimensional vector of regression coefficients and \({\varvec{\varepsilon }}_j\) is an \(n_j\)-dimensional vector of an error variable. We assume that \(\mathop {\textrm{rank}}({\varvec{X}}_j) = k < n_j\) and that the first column of \({\varvec{X}}_j\) is \({\varvec{1}}_{n_j}\), where \({\varvec{1}}_n\) is the n-dimensional vector of ones. The simplest strategy for estimating \({\varvec{\beta }}_j\) is the ordinary least squares (OLS) method; applying this method, the OLS estimator takes the form

$$\begin{aligned} \hat{{\varvec{\beta }}}_j = {\varvec{M}}_j^{-1} {\varvec{c}}_j; \quad {\varvec{M}}_j = {\varvec{X}}_j' {\varvec{X}}_j, \quad {\varvec{c}}_j = {\varvec{X}}_j' {\varvec{y}}_j \quad (j = 1, \ldots , m). \end{aligned}$$
(2)

Then, it is of interest to determine which groups have (or do not have) homogeneity. For example, groups 1 and 2 may have homogeneity, in which case we would like that the estimates for \({\varvec{\beta }}_1\) and \({\varvec{\beta }}_2\) be equal. Fused Lasso (Tibshirani et al. 2005) is a technique for establishing this sort of relationship between pairs of unknown parameters with adjacent subscripts. For model (1), we may apply group fused Lasso (e.g., see Alaíz et al. 2013; Bleakley and Vert 2011; Qian and Su 2016), an extension of fused Lasso in which the solution is obtained as the minimizer of a penalized residual sum of squares (PRSS) of the form

$$\begin{aligned} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2 + \lambda \sum _{j=1}^{m-1} \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_{j+1} \Vert , \end{aligned}$$

where \(\lambda\) is a non-negative tuning parameter. The optimization problem for group fused Lasso may be reduced to that for group Lasso (Yuan and Lin 2006), which is a group selection version of Lasso (Tibshirani 1996), and can then be solved (Bleakley and Vert 2011). However, although group fused Lasso is capable of addressing one-to-one relationships between groups—such as relationships between groups 1 and 2, or between groups 2 and 3—it has no provision for taking into account one-to-many relationships, such as a relationship existing between group 1 and groups 2, 3, and 4. Here we fill this gap by generalized group fused Lasso (GGFL). Specifically, we estimate \({\varvec{\beta }}_1, \ldots , {\varvec{\beta }}_m\) based on minimizing the following PRSS:

$$\begin{aligned} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2 + \lambda \sum _{j=1}^m \sum _{\ell \in D_j} w_{j \ell } \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_\ell \Vert , \end{aligned}$$
(3)

where \(D_j \subseteq \{ 1, \ldots , m \} \backslash \{j\}\) is the index set expressing adjacency relationships among m groups and \(w_{j \ell }\) is a positive weight based on adaptive Lasso (Zou 2006). Note that GGFL is also known as network Lasso (Hallac et al. 2015). When \(D_j = \{ j+1 \}\ (j = 1, \ldots , m-1)\), \(D_m = \emptyset\), and \(w_{j \ell } = 1\), GGFL coincides with group fused Lasso. Here, we assume that \(\ell \in D_j \Leftrightarrow j \in D_\ell\) and \(w_{j \ell } = w_{\ell j}\). GGFL may also be viewed as a group version of generalized fused Lasso (e.g., Ohishi et al. 2021; Xin et al. 2014). There is also generalized fused group Lasso (Cao et al. 2018), but it considers parameters to be equal or not element-wise, i.e., differently from GGFL.

GGFL can be seen as a discrete version of GWR. GWR deals with spatial information by using distances between sample points, while GGFL deals with it by using adjacency relations of subspaces. Moreover, GGFL requires only an optimization of a tuning parameter. Hence, we can expect that GGFL can perform well for even a large sample data. GGFL also has the advantage that re-estimation for future observations is not need. On the other hand, there is a concern that the flexibility of estimation which is an advantage of GWR declines using GGFL. We can maintain flexibility by splitting the space subjected to analysis as finely as possible. Although this may reduce the estimation accuracy, we can expect that GGFL avoids such a reduction by joining several parameter vectors together.

The GGFL optimization problem, like that of group fused Lasso, may be reduced to the group Lasso optimization problem; however, it is not possible to obtain a unique solution. Although we can also apply ADMM (Boyd et al. 2011) to solve the optimization problem, like Hallac et al. (2015), ADMM is not fast and estimates by ADMM cannot be exactly equal (caused by numerical error). Particularly, the latter problem prevents clustering. Thus, in this work, we propose a GGFL optimization strategy based on a coordinate descent algorithm (CDA). A CDA updates a solution along each coordinate direction, and hence is known not to be very fast but it can be applied to even a large sample data. CDAs have been applied to many optimization problems related to Lasso, e.g., Lasso itself (Fu 1998), group Lasso (Yuan and Lin 2006), fused Lasso (Friedman et al. 2007), and generalized fused Lasso (Ohishi et al. 2021). Actually, the CDA by Ohishi et al. (2021) performed well for a large sample data for which R package genlasso (e.g., Arnold and Tibshirani 2019), which is based on Tibshirani and Taylor (2011), causes memory shortage. Therefore, we expect a CDA to perform well for the GGFL optimization problem. For our GGFL optimization problem, because GGFL involves multiple non-differentiable points, we first derive a condition for the objective function to attain the minimum at a non-differentiable point. When this condition is not satisfied, we numerically search for the minimizer using its gradient. If a non-differentiable point minimizes the objective function, the corresponding parameter vectors are equal. Hence, numerical error does not exist and a CDA can provide exactly equal estimates. Furthermore, we show a solution for each coordinate direction converges to the optimal solution.

The remainder of this paper is organized as follows. In Sect. 2, we propose a CDA for solving the GGFL optimization problem. We also discuss the convergence of our CDA. In Sect. 3, we conduct simulations to characterize the performance of GGFL, and then present an application case study involving actual data. Section 4 is the conclusion of this paper. Technical details are provided in the Appendix.

2 Main results

In this section, we discuss a CDA for minimizing the PRSS (3). The CDA computes the minimizer by repeatedly minimizing the following objective function for \({\varvec{\beta }}_j\ (j \in \{ 1, \ldots , m \})\), in which all terms that do not depend on \({\varvec{\beta }}_j\) are neglected:

$$\begin{aligned} f_j ({\varvec{\beta }}_j) = {\varvec{\beta }}_j' {\varvec{M}}_j {\varvec{\beta }}_j - 2 {\varvec{c}}_j' {\varvec{\beta }}_j + \sum _{\ell \in D_j} \lambda _{j \ell } \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_\ell \Vert , \end{aligned}$$
(4)

where \(\lambda _{j \ell } = 2 \lambda w_{j \ell }\) and \(\hat{{\varvec{\beta }}}_\ell\) expresses that \({\varvec{\beta }}_\ell\) is fixed. We note that \(f_j ({\varvec{\beta }}_j)\) fails to be differentiable at \({\varvec{\beta }}_j \in \mathcal {R}_j = \{ \hat{{\varvec{\beta }}}_\ell \mid \ell \in D_j \}\). Here, assume that \(\lambda _{j \ell } > 0\) and that \(\forall \ell _1, \ell _2 \in D_j, \ell _1 \ne \ell _2 \Rightarrow \hat{{\varvec{\beta }}}_{\ell _1} \ne \hat{{\varvec{\beta }}}_{\ell _2}\). If there exist \(\ell _1, \ell _2 \in D_j\ (\ell _1 \ne \ell _2)\) such that \(\hat{{\varvec{\beta }}}_{\ell _1} = \hat{{\varvec{\beta }}}_{\ell _2}\), we can make the replacements \(\lambda _{j \ell _1} \leftarrow \lambda _{j \ell _1} + \lambda _{j \ell _2}\) and \(D_j \leftarrow D_j \backslash \{ \ell _2 \}\). In this work, we first determine whether the objective function \(f_j ({\varvec{\beta }}_j)\) attains the minimum at \({\varvec{\beta }}_j \in \mathcal {R}_j\). If not, we will proceed to seek the minimizer numerically.

2.1 Conditions indicating the presence of groups to be joined

Let \(f ({\varvec{z}})\ ({\varvec{z}}\in \mathbb {R}^q)\) be a convex function that is differentiable for \({\varvec{z}}\ne {\varvec{z}}_0\) and define a set \(\mathcal {A}_q\) by

$$\begin{aligned} \mathcal {A}_q = \{{\varvec{\alpha }}\in \mathbb {R}^q \mid \Vert {\varvec{\alpha }}\Vert = 1\}. \end{aligned}$$
(5)

Then, a subdifferential of f at \(\tilde{{\varvec{z}}}\) and a one-sided directional derivative of f at \(\tilde{{\varvec{z}}}\) with respect to \({\varvec{\alpha }}\in \mathcal {A}_q\) are respectively defined by

$$\begin{aligned} \partial f (\tilde{{\varvec{z}}})&= \left\{ {\varvec{u}}\in \mathbb {R}^q \mid f ({\varvec{z}}) \ge f (\tilde{{\varvec{z}}}) + {\varvec{u}}' ({\varvec{z}}- \tilde{{\varvec{z}}})\ (\forall {\varvec{z}}\in \mathbb {R}^q) \right\} , \\ Df (\tilde{{\varvec{z}}}, {\varvec{\alpha }})&= \lim _{\delta \rightarrow +0} \dfrac{ f (\tilde{{\varvec{z}}} + \delta {\varvec{\alpha }}) - f (\tilde{{\varvec{z}}}) }{ \delta }. \end{aligned}$$

They have the following relations (e.g., see Rockafellar 1970, Parts V and VI):

$$\begin{aligned}&{\varvec{u}}\in \partial f (\tilde{{\varvec{z}}}) \Longleftrightarrow \forall {\varvec{\alpha }}\in \mathcal {A}_q,\ Df (\tilde{{\varvec{z}}}, {\varvec{\alpha }}) \ge {\varvec{u}}' {\varvec{\alpha }}, \\&\forall {\varvec{z}}\in \mathbb {R}^q,\ f ({\varvec{z}}) \ge f (\tilde{{\varvec{z}}}) \Longleftrightarrow {\varvec{0}}_q \in \partial f (\tilde{{\varvec{z}}}). \end{aligned}$$

Hence, a necessary and sufficient condition for \(f ({\varvec{z}})\) to attain the minimum at \({\varvec{z}}={\varvec{z}}_0\) is

$$\begin{aligned} \forall {\varvec{\alpha }}\in \mathcal {A}_q,\ Df ({\varvec{z}}_0, {\varvec{\alpha }}) \ge 0. \end{aligned}$$
(6)

We now use this condition to derive a condition for the objective function (4) to achieve the minimum at a non-differentiable point.

Equation (4) is essentially equal to the following function:

$$\begin{aligned} f ({\varvec{\beta }}) = {\varvec{\beta }}' {\varvec{M}}{\varvec{\beta }}- 2 {\varvec{c}}' {\varvec{\beta }}+ \sum _{j=1}^r \lambda _j \Vert {\varvec{\beta }}- {\varvec{b}}_j \Vert , \end{aligned}$$
(7)

where \({\varvec{M}}\) is a \(k \times k\) positive definite matrix, \({\varvec{b}}_1, \ldots , {\varvec{b}}_r \in \mathbb {R}^k\) are distinct vectors, and \(\lambda _1, \ldots , \lambda _r\) are positive values. By substituting \({\varvec{\theta }}+ {\varvec{M}}^{-1} {\varvec{c}}\) for \({\varvec{\beta }}\) and \({\varvec{a}}_j + {\varvec{M}}^{-1} {\varvec{c}}\) for \({\varvec{b}}_j\), in (7), and dividing it by 2, we can obtain

$$\begin{aligned} \tilde{f} ({\varvec{\theta }}) = \dfrac{1}{2} {\varvec{\theta }}' {\varvec{M}}{\varvec{\theta }}+ \sum _{j=1}^r \psi _j \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert - \dfrac{1}{2} {\varvec{c}}' {\varvec{M}}^{-1} {\varvec{c}}, \end{aligned}$$
(8)

where \(\psi _j = \lambda _j / 2\ (> 0)\). Since \({\varvec{b}}_1, \ldots , {\varvec{b}}_r\) are distinct vectors, \({\varvec{a}}_1, \ldots , {\varvec{a}}_r\) are also distinct vectors. Let \({\varvec{\beta }}^\star\) and \({\varvec{\theta }}^\star\) be the minimizers of \(f ({\varvec{\beta }})\) and \(\tilde{f} ({\varvec{\theta }})\), respectively, i.e.,

$$\begin{aligned} {\varvec{\beta }}^\star = \arg \min _{{\varvec{\beta }}\in \mathbb {R}^k} f ({\varvec{\beta }}), \quad {\varvec{\theta }}^\star = \arg \min _{{\varvec{\theta }}\in \mathbb {R}^k} \tilde{f} ({\varvec{\theta }}). \end{aligned}$$
(9)

Notice that \({\varvec{\beta }}^\star = {\varvec{\theta }}^\star + {\varvec{M}}^{-1} {\varvec{c}}\). Hence, we consider minimizing \(\tilde{f} ({\varvec{\theta }})\) instead of \(f ({\varvec{\beta }})\).

A one-sided directional derivative of \(\tilde{f}\) at \({\varvec{a}}_s\ (s \in D = \{ 1, \ldots , r \})\) with respect to \({\varvec{\alpha }}\in \mathcal {A}_k\) is given by

$$\begin{aligned} D \tilde{f} ({\varvec{a}}_s, {\varvec{\alpha }}) = {\varvec{a}}_s' {\varvec{M}}{\varvec{\alpha }}+ \psi _s + \sum _{j \ne s}^r \psi _j \dfrac{ ({\varvec{a}}_s - {\varvec{a}}_j)' {\varvec{\alpha }}}{ \Vert {\varvec{a}}_s - {\varvec{a}}_j \Vert }. \end{aligned}$$

Hence, from (6), \(\tilde{f} ({\varvec{\theta }})\) attains the minimum at \({\varvec{\theta }}= {\varvec{a}}_s\) if and only if for all \({\varvec{\alpha }}\in \mathcal {A}_k\), \(\psi _s \ge - {\varvec{v}}_s' {\varvec{\alpha }}\), where \({\varvec{v}}_s\) is given by

$$\begin{aligned} {\varvec{v}}_s = {\varvec{M}}{\varvec{a}}_s + \sum _{j \ne s}^r \dfrac{ \psi _j }{ \Vert {\varvec{a}}_s - {\varvec{a}}_j \Vert } ({\varvec{a}}_s - {\varvec{a}}_j) = {\varvec{M}}{\varvec{b}}_s - {\varvec{c}}+ \dfrac{1}{2} \sum _{j \ne s}^r \dfrac{ \lambda _j }{ \Vert {\varvec{b}}_s - {\varvec{b}}_j \Vert } ({\varvec{b}}_s - {\varvec{b}}_j). \end{aligned}$$
(10)

Also, noting that \(\Vert {\varvec{\alpha }}\Vert = 1\), from the Cauchy–Schwarz inequality, we find that \(- {\varvec{v}}_s' {\varvec{\alpha }}\le \Vert {\varvec{v}}_s \Vert\), with equality holding only for \({\varvec{\alpha }}= -{\varvec{v}}_s / \Vert {\varvec{v}}_s \Vert\). From these results, we obtain the following theorem.

Theorem 1

The function \(\tilde{f} ({\varvec{\theta }})\) in (8) attains the minimum at the non-differentiable point \({\varvec{\theta }}= {\varvec{a}}_s\) if there exists \(s \in D\) such that \(\psi _s \ge \Vert {\varvec{v}}_s \Vert\).

From the theorem, the condition for \(f_j ({\varvec{\beta }}_j)\ (j \in \{ 1, \ldots , m \})\) in (4) to attain the minimum at \({\varvec{\beta }}_j = \hat{{\varvec{\beta }}}_s\ (s \in D_j)\) is

$$\begin{aligned} \lambda _{j s} \ge \Vert {\varvec{v}}_{j, s} \Vert , \quad {\varvec{v}}_{j, s} = 2 ({\varvec{M}}_j \hat{{\varvec{\beta }}}_s - {\varvec{c}}_j) + \sum _{\ell \in D_j \backslash \{s\}} \dfrac{ \lambda _{j \ell } }{ \Vert \hat{{\varvec{\beta }}}_s - \hat{{\varvec{\beta }}}_\ell \Vert } (\hat{{\varvec{\beta }}}_s - \hat{{\varvec{\beta }}}_\ell ). \end{aligned}$$
(11)

If there exists \(s \in D_j\) satisfying the above condition, then the estimates for \({\varvec{\beta }}_j\) and \({\varvec{\beta }}_s\) will be exactly equal, indicating that groups j and s have homogeneity and these are joined together.

2.2 Searching minimizer in the absence of groups to be joined

Assume that the objective function \(\tilde{f} ({\varvec{\theta }})\) in (8) does not attain the minimum at any non-differentiable points \({\varvec{\theta }}\in \mathcal {R}= \{ {\varvec{a}}_1, \ldots , {\varvec{a}}_r \}\)—that is, that the condition of Theorem 1 is not satisfied for any \(j \in D\). Then we can search for the minimizer by using the gradient of \(\tilde{f} ({\varvec{\theta }})\) that takes the form

$$\begin{aligned} {\varvec{g}}({\varvec{\theta }}) = \dfrac{ \partial }{ \partial {\varvec{\theta }}} \tilde{f} ({\varvec{\theta }}) = {\varvec{M}}{\varvec{\theta }}+ \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } ({\varvec{\theta }}- {\varvec{a}}_j)\quad ({\varvec{\theta }}\notin \mathcal {R}). \end{aligned}$$
(12)

By solving the equation \({\varvec{g}}({\varvec{\theta }}) = {\varvec{0}}_k\), we can derive the following update equation:

$$\begin{aligned} {\varvec{\theta }}^\text {new} = {\varvec{\eta }}({\varvec{\theta }}^\text {old}), \quad {\varvec{\eta }}({\varvec{\theta }}) = {\left\{ \begin{array}{ll} {\varvec{Q}}({\varvec{\theta }})^{-1} \displaystyle \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } {\varvec{a}}_j &{}({\varvec{\theta }}\notin \mathcal {R}) \\ {\varvec{\theta }}&{}({\varvec{\theta }}\in \mathcal {R}) \end{array}\right. }, \end{aligned}$$
(13)

where \({\varvec{Q}}({\varvec{\theta }})\) is defined for \({\varvec{\theta }}\notin \mathcal {R}\) by

$$\begin{aligned} {\varvec{Q}}({\varvec{\theta }})&= {\varvec{M}}+ s ({\varvec{\theta }}) {\varvec{I}}_k, \quad s ({\varvec{\theta }}) = \sum _{j = 1}^r \dfrac{\psi _j}{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert }\ ({\varvec{\theta }}\notin \mathcal {R}). \end{aligned}$$
(14)

Note that we define \({\varvec{\eta }}({\varvec{\theta }}) = {\varvec{\theta }}\) for \({\varvec{\theta }}\in \mathcal {R}\) since \({\varvec{g}}({\varvec{\theta }})\) cannot be defined for \({\varvec{\theta }}\in \mathcal {R}\). Although we would like to propose the update given by (13) for obtaining the minimizer, the update stops when \({\varvec{\theta }}^\text {new} \in \mathcal {R}\). Therefore, \({\varvec{\theta }}\) such that \({\varvec{\eta }}({\varvec{\theta }}) \in \mathcal {R}\) must be removed from the search area. Of course, it is very rare that there exists \(j \in D\) such that \({\varvec{\eta }}({\varvec{\theta }}) = {\varvec{a}}_j\) when \({\varvec{\theta }}\ne {\varvec{a}}_j\), so for practical use, there is no need to consider removing such points. However, to guarantee convergence to the minimizer theoretically, we will consider these situations as well. To be precise, if you are unfortunate enough to have \({\varvec{\eta }}({\varvec{\theta }}) = {\varvec{a}}_j\), we will have to update it by shifting it slightly from \({\varvec{a}}_j\). Let \({\varvec{\theta }}_0\) be some initial vector. Then, we define the update equation minimizing \(\tilde{f} ({\varvec{\theta }})\) as follows:

$$\begin{aligned} {\varvec{\theta }}_{i+1} = {\left\{ \begin{array}{ll} {\varvec{\eta }}({\varvec{\theta }}_i) &{}({\varvec{\eta }}({\varvec{\theta }}_i) \notin \mathcal {R}) \\ {\varvec{\eta }}\left( {\varvec{a}}_j - d_j {\varvec{\alpha }}({\varvec{v}}_j) / 5 \right) &{}({\varvec{\eta }}({\varvec{\theta }}_i) = {\varvec{a}}_j) \end{array}\right. } \quad (i = 0, 1, \ldots ), \end{aligned}$$
(15)

where \(d_j\) and \({\varvec{\alpha }}({\varvec{\theta }})\) are respectively given by

$$\begin{aligned} d_j = \dfrac{\psi _j}{\sum _{\ell =1}^r \psi _\ell } \min _{\ell \in D \backslash \{ j \}} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert , \quad {\varvec{\alpha }}({\varvec{\theta }}) = {\varvec{\theta }}/ \Vert {\varvec{\theta }}\Vert . \end{aligned}$$
(16)

The update given by (15) guarantees the solution attains the minimizer by the following theorem (the proof is given in Appendix A).

Theorem 2

Suppose that \(\tilde{f} ({\varvec{\theta }})\) in (8) does not attain the minimum at any non-differentiable points \({\varvec{\theta }}\in \mathcal {R}\), i.e., \({\varvec{\theta }}^\star \notin \mathcal {R}\). For any initial vector \({\varvec{\theta }}_0\), \({\varvec{\theta }}_i\) in (15) converges to \({\varvec{\theta }}^\star\) as \(i \rightarrow \infty\).

Although the update Eq. (15) specifically indicates how to shift \({\varvec{\eta }}({\varvec{\theta }}_i)\) when \({\varvec{\eta }}({\varvec{\theta }}_i) = {\varvec{a}}_j\), this is one instance of satisfying Theorem 2.

Now, we consider minimizing \(f_j ({\varvec{\beta }}_j)\ (j \in \{ 1, \ldots , m \})\) in (4) when \(f_j ({\varvec{\beta }}_j)\) does not attain the minimum at any non-differentiable points \({\varvec{\beta }}_j \in \mathcal {R}_j = \{ \hat{{\varvec{\beta }}}_\ell \mid \ell \in D_j \}\). We define \({\varvec{\eta }}_j ({\varvec{\beta }}_j)\) and \(d_{j, \ell }\ (\ell \in D_j)\) by

$$\begin{aligned} {\varvec{\eta }}_j ({\varvec{\beta }}_j) = {\left\{ \begin{array}{ll} {\varvec{Q}}_j ({\varvec{\beta }}_j)^{-1} {\varvec{h}}_j ({\varvec{\beta }}_j) &{}({\varvec{\beta }}_j \notin \mathcal {R}_j) \\ {\varvec{\beta }}_j &{}({\varvec{\beta }}_j \in \mathcal {R}_j) \end{array}\right. }, \quad d_{j, \ell } = \dfrac{\lambda _{j \ell }}{\sum _{s \in D_j} \lambda _{j s}} \left( \min _{s \in D_j \backslash \{ \ell \}} \Vert \hat{{\varvec{\beta }}}_\ell - \hat{{\varvec{\beta }}}_s \Vert \right) , \end{aligned}$$

where \({\varvec{Q}}_j ({\varvec{\beta }}_j)\) and \({\varvec{h}}_j ({\varvec{\beta }}_j)\) are respectively defined for \({\varvec{\beta }}_j \notin \mathcal {R}_j\) by

$$\begin{aligned} {\varvec{Q}}_j ({\varvec{\beta }}_j)&= 2 {\varvec{M}}_j + \sum _{\ell \in D_j} \dfrac{\lambda _{j \ell }}{ \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_\ell \Vert } {\varvec{I}}_k, \quad {\varvec{h}}_j ({\varvec{\beta }}_j) = 2 {\varvec{c}}_j + \sum _{\ell \in D_j} \dfrac{ \lambda _{j \ell } }{ \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_\ell \Vert } \hat{{\varvec{\beta }}}_\ell . \end{aligned}$$

Then, the update equation minimizing \(f_j ({\varvec{\beta }}_j)\) is given by

$$\begin{aligned} {\varvec{\beta }}_j^\text {new} = {\left\{ \begin{array}{ll} {\varvec{\eta }}_j ({\varvec{\beta }}_j^\text {old}) &{}\left( {\varvec{\eta }}_j ({\varvec{\beta }}_j^\text {old}) \notin \mathcal {R}_j \right) \\ {\varvec{\eta }}_j \left( \hat{{\varvec{\beta }}}_\ell - d_{j, \ell } {\varvec{\alpha }}({\varvec{v}}_{j, \ell }) / 5 \right) &{}\left( {\varvec{\eta }}_j ({\varvec{\beta }}_j^\text {old}) = \hat{{\varvec{\beta }}}_\ell \right) \end{array}\right. }. \end{aligned}$$
(17)

From Theorem 2, the solution obtained by the above converges to the minimizer of \(f_j ({\varvec{\beta }}_j)\).

2.3 Coordinate descent algorithm for GGFL

For fused Lasso and generalized fused Lasso, Friedman et al. (2007) and Ohishi et al. (2021) respectively proposed CDAs. These papers note the following phenomenon. When several groups are joined together at intermediate stages of the algorithm, the optimization process gets stuck: the corresponding objective-function values stagnate—ceasing to improve on subsequent iterations—and the algorithm fails to achieve minimization. To avoid this difficulty, similar to as reported in the two papers, our CDA incorporates two cycles: a descent cycle and a fusion cycle.

The descent cycle repeatedly minimizes \(f_j ({\varvec{\beta }}_j)\) in (4) for \(j \in \{ 1, \ldots , m \}\). More specifically, for GGFL, the descent cycle proceeds according to the following algorithm.

Algorithm 1

(Descent cycle for GGFL)

  1. Step 1.

    For \(f_j ({\varvec{\beta }}_j)\), check the condition (11). If there exists \(s \in D_j\) such that the condition holds, update the solution as \(\hat{{\varvec{\beta }}}_j = \hat{{\varvec{\beta }}}_s\); otherwise repeat the update given by (17) until it converges.

  2. Step 2.

    Repeat Step 1 for all \(j \in \{1, \ldots , m\}\).

The fusion cycle is designed to avoid the phenomenon in which the joining of groups during the descent cycle causes the optimization process to flounder, obstructing the progress of the algorithm. Suppose that, following a descent cycle, we have solutions \(\hat{{\varvec{\beta }}}_1, \ldots , \hat{{\varvec{\beta }}}_m\) with distinct vectors of the solutions \(\hat{{\varvec{\xi }}}_1, \ldots , \hat{{\varvec{\xi }}}_t\ (t < m)\). Then, we can obtain index sets \(E_\ell = \{ j \in \{ 1, \ldots , m \} \mid \hat{{\varvec{\xi }}}_\ell = \hat{{\varvec{\beta }}}_j \}\ (\ell = 1, \ldots , t)\) satisfying \(\cup _{\ell =1}^t E_\ell = \{ 1, \ldots , m \}\), \(E_\ell \ne \emptyset\) and \(E_\ell \cap E_j = \emptyset \ (\ell \ne j)\). Moreover, define \(D_\ell ^*\subseteq \{1, \ldots , t\} \backslash \{\ell \}\ (\ell \in \{ 1, \ldots , t \})\) and \(w_{\ell i}^*\ (\ell \in \{ 1, \ldots , t\};\ i \in D_\ell ^*)\) as follows:

$$\begin{aligned} D_\ell ^*&= \left\{ s \in \{ 1, \ldots , t \} \backslash \{ \ell \} \mid E_s \cap F_\ell \ne \emptyset \right\} , \quad F_\ell = \bigcup _{j \in E_\ell } D_j \backslash E_\ell , \\ w_{\ell i}^*&= \sum _{(j,s) \in \mathcal {J}_{\ell i}} w_{j s}, \quad \mathcal {J}_{\ell i} = \bigcup _{j \in E_\ell } \{ j \} \times (E_i \cap D_j), \end{aligned}$$

where \(D_\ell ^*\) satisfies \(D_\ell ^*\ne \emptyset\) and \(s \in D_\ell ^*\Leftrightarrow \ell \in D_s^*\). In words, \(D_\ell ^*\) is the set of fused-group indexes that have an adjacency relationship with fused group \(\ell\). The two terms in the PRSS (3) may be rewritten as follows (see Ohishi et al. 2021):

$$\begin{aligned} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2&= \sum _{\ell =1}^t \sum _{j \in E_\ell } \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\xi }}_\ell \Vert ^2, \\ \sum _{j=1}^m \sum _{\ell \in D_j} w_{j \ell } \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_\ell \Vert&= 2 \sum _{i \in D_\ell ^*} w_{\ell i}^*\Vert {\varvec{\xi }}_\ell - {\varvec{\xi }}_i \Vert + \sum _{j \notin E_\ell } \sum _{i \in D_j \backslash E_\ell } w_{j i} \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_i \Vert . \end{aligned}$$

Thus, excluding from the PRSS (3) all terms that do not depend on \({\varvec{\xi }}_\ell\), we may express the objective function for the fusion cycle as

$$\begin{aligned} f_\ell ^*({\varvec{\xi }}_\ell )&= {\varvec{\xi }}_\ell ' \left( \sum _{j \in E_\ell } {\varvec{M}}_j \right) {\varvec{\xi }}_\ell - 2 \left( \sum _{j \in E_\ell } {\varvec{c}}_j \right) ' {\varvec{\xi }}_\ell + \sum _{i \in D_\ell ^*} \lambda _{\ell i}^*\Vert {\varvec{\xi }}_\ell - \hat{{\varvec{\xi }}}_i \Vert , \end{aligned}$$

where \(\lambda _{\ell i}^*= 2 \lambda w_{\ell i}^*\) and \(\hat{{\varvec{\xi }}}_i\) expresses that \({\varvec{\xi }}_i\) is fixed. In the fusion cycle, the descent cycle for \(f_\ell ^*({\varvec{\xi }}_\ell )\) is executed. Notice that \(f_\ell ^*({\varvec{\xi }}_\ell )\) is essentially equal to (7). Hence, we can minimize \(f_\ell ^*({\varvec{\xi }}_\ell )\) similarly to the minimization of (4).

Assembling the steps discussed above yields the following CDA for GGFL.

Algorithm 2

(Coordinate descent algorithm for GGFL)

  1. Step 1.

    Choose initial vectors for \({\varvec{\beta }}_1, \ldots , {\varvec{\beta }}_m\) and tuning parameter \(\lambda\).

  2. Step 2.

    Execute the descent cycle.

  3. Step 3.

    If any groups were joined, execute the fusion cycle. If not, proceed to Step 4.

  4. Step 4.

    Repeat Steps 2 and 3 until the solution converges.

Because GGFL depends on \(\lambda\), for practical applications, it is important to optimize the value of \(\lambda\). Let \(\hat{{\varvec{\beta }}}_\text {max}\) denote the solution with all groups joined; that is, \(\hat{{\varvec{\beta }}}_\text {max}\) is the OLS estimator for \({\varvec{\beta }}_j\) obtained by setting \({\varvec{\beta }}_1 = \cdots = {\varvec{\beta }}_m\). Then we define \(\lambda _{\max }\) as follows:

$$\begin{aligned} \lambda _{\max }= \max _{j \in \{ 1, \ldots , m \}} \dfrac{ \Vert {\varvec{M}}_j \hat{{\varvec{\beta }}}_\text {max} - {\varvec{c}}_j \Vert }{ \sum _{\ell \in D_j} w_{j \ell } }. \end{aligned}$$
(18)

The significance of \(\lambda _{\max }\) may be understood by noting that the solution obtained for \(\lambda = \lambda _{\max }\) satisfies \(\hat{{\varvec{\beta }}}_1 = \cdots = \hat{{\varvec{\beta }}}_m = \hat{{\varvec{\beta }}}_\text {max}\). Thus, we execute Algorithm 2 for each candidate \(\lambda\) value in the range \((0,\lambda _{\max }]\) and select the optimal \(\lambda\) value. For example, the optimal \(\lambda\) value can be selected based on minimizing a model selection criterion.

3 Numerical studies

In this section, we investigate our CDA via simulation data and real data. The numerical calculation programs are executed in R (ver. 4.1.1) on a computer with the Windows 10 Pro operating system, an Intel (R) Core (TM) i9-9900 processor, and 128 GB of RAM. The R codes are available in the R package GGFL (https://github.com/ohishim/GGFL)

3.1 Simulation

3.1.1 Performance of GGFL

For this simulation, we construct simulation spaces similar to those used in Ohishi et al. (2021). We consider two problem sizes, with group counts \(m=10\) and \(m=20\) and the adjacency relationships depicted in Figs. 1 or 2 (for \(m=10\)) and Figs. 3 or 4 (for \(m=20\)).

Fig. 1
figure 1

True join configurations in case 1 for \(m = 10\) (\(m^*= 3\))

Fig. 2
figure 2

True join configurations in case 2 for \(m = 10\) (\(m^*= 6\))

Fig. 3
figure 3

True join configurations in case 1 for \(m = 20\) (\(m^*= 6\))

Fig. 4
figure 4

True join configurations in case 2 for \(m = 20\) (\(m^*= 12\))

For example, for the \(m=10\) problem, group 1 is adjacent to groups 2, 3, 4, 5, and 6, i.e., \(D_1 = \{ 2, 3, 4, 5, 6 \}\). For each m value, we consider two possible cases of group-join configurations; each configuration is specified by enumerating the true sets \(E_\ell \ (\ell = 1, \ldots , m^*)\) of joined groups, where \(m^*\in \{ 3, 6 \}\) for \(m = 10\) and \(m^*\in \{ 6, 12 \}\) for \(m = 20\) (here \(m^*\) is the true number of sets of joined groups), as follows:

  • \(m = 10\), case 1:

    $$\begin{aligned} E_1 = \{ 1, 2, 3\},\quad E_2 = \{ 4, 5, 6, 9, 10 \},\quad E_3 = \{ 7, 8 \}. \end{aligned}$$
  • \(m = 10\), case 2:

    $$\begin{aligned} E_1&= \{ 1, 3\},&E_2&= \{ 2 \},&E_3&= \{ 4, 6, 10 \}, \\ E_4&= \{ 5 \},&E_5&= \{ 7, 8 \},&E_6&= \{ 9 \}. \end{aligned}$$
  • \(m = 20\), case 1:

    $$\begin{aligned} E_1&= \{ 1, 2, 3 \},&E_2&= \{ 4, 5, 6 \},&E_3&= \{ 7, 8, 19, 20 \}, \\ E_4&= \{ 9, 10, 12, 13 \},&E_5&= \{ 11, 14, 15, 16 \},&E_6&= \{ 17, 18 \}. \end{aligned}$$
  • \(m = 20\), case 2:

    $$\begin{aligned} E_1&= \{ 1 \},&E_2&= \{ 2, 3 \},&E_3&= \{ 4 \},&E_4&= \{ 5, 6 \}, \\ E_5&= \{ 7, 8 \},&E_6&= \{ 9, 10 \},&E_7&= \{ 11 \},&E_8&= \{ 12, 13 \}, \\ E_9&= \{ 14, 15, 16 \},&E_{10}&= \{ 17, 18 \},&E_{11}&= \{ 19 \},&E_{12}&= \{ 20 \}. \end{aligned}$$

These group-join configurations are illustrated in Figs. 1, 2, 3 and 4. In this section, we use simulation data to assess whether the GGFL technique successfully determines the true join configuration in each case.

Table 1 MSE results when case 1
Table 2 MSE results when case 2

Let \({\varvec{X}}\) be an \(n \times k\) matrix defined by \({\varvec{X}}= ({\varvec{X}}_1', \ldots , {\varvec{X}}_m')' = ({\varvec{1}}_n, {\varvec{X}}_0 {\varvec{\Psi }}(0.5)^{1/2})\), where \({\varvec{X}}_0\) is an \(n \times (k-1)\) matrix of which all elements are identically and independently distributed according to \(U (-1, 1)\), and \({\varvec{\Psi }}(\rho )\) is the symmetric matrix of order \(k-1\) of which the (ij)th element is \(\rho ^{|i-j|}\). Then, our simulation data are generated from \({\varvec{y}}_j \sim N_{n_j} ({\varvec{X}}_j {\varvec{\beta }}_j, {\varvec{I}}_{n_j})\ (j = 1, \ldots , m)\), where \({\varvec{\beta }}_j\) is defined by \({\varvec{\beta }}_j = \ell {\varvec{1}}_k \ (j \in E_\ell ;\ \ell = 1, \ldots , m^*)\). Here, we set \(n_1 = \cdots = n_m = n_0 \in \{ 50, 100, 200 \}\), i.e., \(n = m n_0\). In our simulation, we characterize not only the selection probability (SP) of the true join configuration but also the mean square error (MSE). For \(\hat{{\varvec{\beta }}} = (\hat{{\varvec{\beta }}}_1', \ldots , \hat{{\varvec{\beta }}}_m')'\) and \(\hat{{\varvec{y}}} = (\hat{{\varvec{y}}}_1', \ldots , \hat{{\varvec{y}}}_m')'\), we compute the following two MSEs:

$$\begin{aligned} \mathop {\textrm{MSE}}\nolimits _{\beta } [\hat{{\varvec{\beta }}}]&= \mathop {\textrm{E}}\nolimits \left[ \sum _{j = 1}^m \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_j \Vert ^2 \right] / (k m), \quad \mathop {\textrm{MSE}}\nolimits _{y} [\hat{{\varvec{y}}}] = \mathop {\textrm{E}}\nolimits \left[ \sum _{j = 1}^m \Vert {\varvec{y}}_j - \hat{{\varvec{y}}}_j \Vert ^2 \right] / n, \end{aligned}$$

where \(\hat{{\varvec{y}}}_j\) is given by \(\hat{{\varvec{y}}}_j = {\varvec{X}}_j \hat{{\varvec{\beta }}}_j\). The expected value of the MSE is characterized by 10,000 repetitions of the Monte Carlo simulation. We consider three estimators for \({\varvec{\beta }}_j\):

  • GGFL: The estimator produced by the GGFL method proposed in this paper.

  • OLS 1: The OLS estimators defined by (2), i.e., \(\hat{{\varvec{\beta }}}_j = {\varvec{M}}_j^{-1} {\varvec{c}}_j\).

  • OLS 2: The OLS estimator for a common set \({\varvec{\beta }}_1 = \cdots = {\varvec{\beta }}_m\), i.e., \(\hat{{\varvec{\beta }}}_j = \hat{{\varvec{\beta }}}_{\max }\).

Table 3 Selection probability (%)

Denoting estimators for OLS 1 by \(\tilde{{\varvec{\beta }}}_1, \ldots , \tilde{{\varvec{\beta }}}_m\), we use \(w_{j \ell } = \Vert \tilde{{\varvec{\beta }}}_j - \tilde{{\varvec{\beta }}}_\ell \Vert ^{-1}\ (j = 1, \ldots , m,\ \ell \in D_j)\) for GGFL penalties. Using the EGCV criterion (Ohishi et al. 2020), we determine the optimal tuning parameter and set the penalty strength to \(\log n\), where the candidates for the optimal tuning parameter are the 100 points defined by \(\lambda _{\max }(3/4)^j\ (j = 0, \ldots , 99)\) and \(\lambda _{\max }\) is given by (18). The MSE values for cases 1 and 2 are tabulated in Tables 1 and 2, respectively, in which bold font indicates the smallest value for each setting. From the tables, we can see that MSE values for GGFL and OLS 1 are decreasing as n increases, but MSE values for OLS 2 are not necessarily decreasing as n increases. In addition, for most settings, GGFL achieves the smallest MSE values of all the estimation methods considered for both predicted and estimated values. Moreover, there are three settings in which OLS 1 achieves the smallest MSE values. The reason why GGFL is inferior than OLS 1 is that the true join configuration could not selected since n is not large enough. We can find it in Table 3. Table 3 summarizes the SP values. The results indicate the possibility that the GGFL method has consistency for the selection of true join configurations. Moreover, there are three settings in which SP values are less than 100. For the three settings, GGFL is inferior than OLS 1 in terms of MSE.

3.1.2 Comparison with ADMM

Fig. 5
figure 5

Difference of the objective function for case 1

Fig. 6
figure 6

Difference of the objective function for case 2

In this section, we compare our CDA with ADMM (Hallac et al. 2015; the details are given in Appendix C) for the same situation as in the previous section. The two algorithms are evaluated with respect to runtime and the minimum value of the objective function at fixed \(\lambda\). Here, we set the value of \(\lambda\) as \(\lambda = \zeta \lambda _{\max }\ (\zeta \in \{ 1/1000, 1/100, 1/10 \})\). Let \(\hat{f}_\text {CDA}\) and \(\hat{f}_\text {ADMM}\) be the minimum values of the objective function by CDA and ADMM, respectively. Figures 5 and 6 show differences \(\hat{f}_\text {ADMM} - \hat{f}_\text {CDA}\) over 100 repetitions for each \(\zeta\). As seen from the figures, most differences are positive and difference becomes larger as \(\zeta\) increases. This means that our CDA can better minimize the objective function than ADMM and that their difference becomes larger as \(\lambda\) increases. We guess the reason to be that the estimates obtained by ADMM are not exactly equal. Actually, when \(\lambda\) is very small, the differences are very small because most estimates are not equal. On the other hand, when \(\lambda\) is large, many estimates are equal and so the differences are large. Furthermore, the degrees of freedoms of the estimates obtained by ADMM were always equal to the number of parameters. Therefore, when the optimal tuning parameter is selected based on minimizing a model selection criterion like the EGCV criterion, it is always 0 and a GGFL penalty is meaningless.

Table 4 Runtime comparison
Table 5 The number of iterations until convergence
Fig. 7
figure 7

Tracking convergence process

Table 4 shows runtime for CDA and ADMM. From the table, we can see that our CDA can optimize the objective function much faster than ADMM. Moreover, the table shows runtime for CDA becomes shorter as n increases, while that for ADMM does not. This fact relates the number of iterations until convergence, which is provided in Table 5. From Tables 4 and 5, we can see that the runtime and the number of iterations are strongly dependent. Recall that the objective function (3) has penalty weights based on the OLS estimator. Since the OLS estimator becomes stable as n increases and the penalty weights perform well, we can say that the number of iterations for CDA decreases as n increases. Although it is the same in ADMM, ADMM has the three parameter vectors, not only \({\varvec{\beta }}\), but also \({\varvec{\gamma }}\) and \(\xi\). In terms of \({\varvec{\beta }}\) and \({\varvec{\gamma }}\), we may say the same to CDA. However, the convergences of \({\varvec{\beta }}\) and \({\varvec{\gamma }}\) depend on \({\varvec{\xi }}\), and we do not know whether the convergence of \({\varvec{\xi }}\) becomes stable as n increases. Figure 7 shows paths for convergence when \(m=10\), \(k = 20\), and \(\lambda = \lambda _{\max } / 100\) in which the y-axis shows a distance between a solution for each iteration and the optimal solution and the x-axis shows the number of iterations, where the distance is defined as follows: for any parameter vector \({\varvec{\mu }}\), we write a solution in the ith iteration as \({\varvec{\mu }}_i\), and write the optimal solution as \(\hat{{\varvec{\mu }}}\). Then, we define the distance for the ith iteration as

$$\begin{aligned} \Vert {\varvec{\mu }}_i - \hat{{\varvec{\mu }}} \Vert / \Vert {\varvec{\mu }}_0 - \hat{{\varvec{\mu }}} \Vert \quad (i = 0, 1, \ldots ), \end{aligned}$$

where \({\varvec{\mu }}_0\) is an initial vector. From the figure, we can see that both of \({\varvec{\beta }}\)s in ADMM and CDA go for convergence directly, while paths for \({\varvec{\gamma }}\) and \({\varvec{\xi }}\) in ADMM are bumpy. Here, the lines for \({\varvec{\beta }}\) in CDA are truncated at 4 because CDA converged at the fourth iteration for all ns. The \({\varvec{\gamma }}\) is only once away from the optimal solution at the first iteration and its distance becomes larger as n increases. We can guess that the reason for this relates a constraint of ADMM which is \({\varvec{\beta }}_j = {\varvec{\gamma }}_{j \ell }\). Although the initial vectors hold the constraint, it breaks down if \({\varvec{\beta }}\) or \({\varvec{\gamma }}\) is updated even once. Afterwards, \({\varvec{\gamma }}\) goes for convergence directly like \({\varvec{\beta }}\), but its speed is slow. The \({\varvec{\xi }}\) is also away from the optimal solution in the middle of the process. Hence, we can conclude that \({\varvec{\gamma }}\) and \({\varvec{\xi }}\) prevent steady convergence.

3.2 A real data example

In this section, we present an application of the method proposed in this paper to actual data. The dataset we use is similar to that used in Ohishi et al. (2021); it consists of rental prices—and additional data describing environmental conditions—for studio apartments in Tokyo’s 23 wards as observed between April 2014 and April 2015. The dataset, compiled by Tokyo Kantei Co., Ltd., has a sample size of \(n = 61{,}999\) and contains \(m = 852\) groups; the specific data items it covers are listed in Table 6, where Y and A1 through A4 are continuous variables, and B1 through B6 are dummy variables that take the value of 1 or 0.

Table 6 Data items

For this dataset, the territory covered by Tokyo’s 23 wards was divided into 852 geographical subregions, corresponding to the 852 groups in the dataset. In our analysis, we take the monthly apartment rent to be the response variable, using all other data items as explanatory variables; however, problems arise when dummy variables are modeled group-wise. For example, if all apartments in group j have a parking lot, a rank deficient occurs in \({\varvec{X}}_j\). For this reason, our dummy variables are common to all groups. More specifically, our modeling proceeds as follows. For group j, let \({\varvec{y}}_j\) be an \(n_j\)-dimensional vector of the response variable for group j, let \({\varvec{X}}_j\) be an \(n_j \times 5\) matrix of explanatory variables, whose first column is \({\varvec{1}}_{n_j}\) and remaining 4 columns correspond to items A1 through A4, and let \({\varvec{Z}}_j\) be an additional \(n_j \times 6\) matrix of explanatory variables, whose columns correspond to items B1 through B6. Then we consider the model \({\varvec{y}}_j = {\varvec{X}}_j {\varvec{\beta }}_j + {\varvec{Z}}_j {\varvec{\gamma }}+ {\varvec{\varepsilon }}_j\ (j = 1, \ldots , m)\), where \({\varvec{\beta }}_j\) and \({\varvec{\gamma }}\) are five- and six-dimensional vectors of regression coefficients, respectively. This means that Tokyo’s 23 wards are represented by \(m\ (= 852)\) submodels. We estimate \({\varvec{\beta }}_j\) and \({\varvec{\gamma }}\) as follows:

$$\begin{aligned} \hat{{\varvec{\beta }}}_\lambda&= \arg \min _{{\varvec{\beta }}} \left\{ \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j - {\varvec{Z}}_j \hat{{\varvec{\gamma }}}_\lambda \Vert ^2 + \lambda \sum _{j=1}^m \sum _{\ell \in D_j} w_{j \ell } \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_\ell \Vert \right\} ,\\ \hat{{\varvec{\gamma }}}_\lambda&= \arg \min _{{\varvec{\gamma }}} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j \hat{{\varvec{\beta }}}_{\lambda , j} - {\varvec{Z}}_j {\varvec{\gamma }}\Vert ^2, \end{aligned}$$

where \({\varvec{\beta }}= ({\varvec{\beta }}_1', \ldots , {\varvec{\beta }}_m')'\) and \(\hat{{\varvec{\beta }}}_\lambda = (\hat{{\varvec{\beta }}}_{\lambda , 1}', \ldots , \hat{{\varvec{\beta }}}_{\lambda , m}')'\). Here, the updates of \({\varvec{\beta }}\) by our CDA and \({\varvec{\gamma }}\) must be alternately repeated until they converge. The estimates for \({\varvec{\beta }}\) and \({\varvec{\gamma }}\) are obtained for each \(\lambda\) value, and then the optimal \(\lambda\) value is selected by minimizing the EGCV criterion with the penalty strength \(\log n\), where candidate \(\lambda\) values are given by \(\lambda _{\max } (3/4)^j\ (j = 0, \ldots , 99)\).

By GGFL estimation, the 852 subregions in Tokyo’s 23 wards were joined into 166 clusters (Fig. 8).

Fig. 8
figure 8

The 852 subregions (left) and 166 clusters (right) of Tokyo’s 23 wards

Here, we show the results in comparison with GWR. Table 7 summarizes the estimates for the common regression coefficients. There is not much difference between the estimates by GGFL and GWR.

Table 7 Estimates for common regression coefficients

Figures 9, 10, 11, 12 and 13 are choropleth maps (for GGFL) and scatter plots (for GWR) of estimates for varying coefficients. The GGFL results are given as choropleth maps because GGFL gives an estimate for each subregion, which can be assigned to a color, except regions where apartments cannot exist, which are indicated by grey. On the other hand, the GWR results are displayed as scatter plots because GWR gives an estimate for each observation point. From these figures, we can see that the obtained GGFL and GWR results follow similar trends. Nevertheless, clear differences between the two methods are also visible.

Fig. 9
figure 9

Intercept

Fig. 10
figure 10

Floor area (A1)

Fig. 11
figure 11

Building age (A2)

Fig. 12
figure 12

Interaction of the top floor and a room floor (A3)

Fig. 13
figure 13

Walking time (A4)

Table 8 shows the summary of the comparison. Although GWR has a good model fit, GGFL fitting is also not bad. On the other hand, GGFL was much faster than GWR. Furthermore, PE of the table expresses a prediction error. We calculate prediction squared errors from 100 repetitions of the holdout method, where test data are 100 random observations. PE is then defined as the square root of the mean value of the prediction squared errors. We found that although GWR has better prediction accuracy than GGFL, the difference is not very large.

Table 8 Summary of the comparison

4 Conclusion

In this paper, we dealt with the GGFL optimization problem. Since the problem cannot be solved by existing algorithms, we proposed a CDA. To address multiple non-differentiable points well, we invoked two steps: judgment as to whether the objective function attains the minimum at a non-differentiable point; searching for the solution when the objective function is not minimized at any non-differentiable points. We also invoked a fusion cycle to avoid a solution getting stuck, similar to as in Friedman et al. (2007) and Ohishi et al. (2021). Furthermore, our CDA is guaranteed to converge and give the optimal solution, for each coordinate direction. In a simulation study, it was found that our CDA performs well and has good properties, such as consistency, with respect to the selection of true joins. Moreover, our CDA performed better than ADMM (Hallac et al. 2015). One of our aims was the clustering of m submodels, and hence we adopted a GGFL penalty which is based on the \(\ell _1\) norm. However, numerical error in estimates obtained by ADMM prevents clustering. Hence, we needed to propose CDA. If clustering is not needed, i.e., shrinkage is a purpose, this is not a problem. Our CDA also performed well for an actual dataset of a large sample. We have stated only the good points of our CDA, but we should also note here that the algorithm is slightly complicated. In addition, although ADMM by Hallac et al. (2015) can be extended to using nonconvex penalties, it is difficult for our method to that. In a real data example, we compared GGFL with GWR. Although GWR was slightly superior than GGFL, GWR necessarily requires coordinates for all sample points for its implementation. It may be practically difficult to get all coordinates. In such a case, GGFL is useful since it does not require coordinates.