Coordinate descent algorithm for generalized group fused Lasso

Ohishi, Mineaki; Okamura, Kensuke; Itoh, Yoshimichi; Wakaki, Hirofumi; Yanagihara, Hirokazu

doi:10.1007/s41237-024-00233-6

Coordinate descent algorithm for generalized group fused Lasso

Invited Paper
Open access
Published: 28 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Behaviormetrika Aims and scope Submit manuscript

Coordinate descent algorithm for generalized group fused Lasso

Download PDF

Mineaki Ohishi ORCID: orcid.org/0000-0001-8727-3631¹,
Kensuke Okamura²,
Yoshimichi Itoh²,
Hirofumi Wakaki³ &
…
Hirokazu Yanagihara³

335 Accesses
3 Altmetric
Explore all metrics

Abstract

We deal with a model with discrete varying coefficients to consider modeling for heterogeneity and clustering for homogeneity, and estimate the varying coefficients by generalized group fused Lasso (GGFL). GGFL allows homogeneous groups to be joined together based on one-to-many relationships among groups. This makes GGFL a powerful technique, but to date there has been no effective algorithm for obtaining the solutions. Here we propose an algorithm for obtaining a GGFL solution based on the coordinate descent method, and show that a solution for each coordinate direction converges to the optimal solution. In a simulation, we show our algorithm is superior to ADMM, which is one of the popular algorithms. We also present an application to a spatial data analysis.

Generalized fused Lasso for grouped data in generalized linear models

Article Open access 25 May 2024

Coordinate Descent Algorithm for Normal-Likelihood-Based Group Lasso in Multivariate Linear Regression

Some properties of generalized fused lasso and its applications to high dimensional data

Article 29 October 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this study, we focus on the heterogeneity latent in data. For example, spatial data have heterogeneity, i.e., each individual is affected by space (e.g., area and location). For such data, it is important how their heterogeneity is expressed. Furthermore, it is of interest to determine homogeneous clusters. Therefore, we consider not only modeling for heterogeneity but also clustering for homogeneity.

In the case of modeling for heterogeneity, we can apply a varying coefficient model and estimate its coefficients by a weighted local estimation. Geographically weighted regression (GWR; Brunsdon et al. 1996) is one popular method for spatial data analysis (e.g., Geng et al. 2011; Lu et al. 2011; Wang and Wang 2020). GWR estimates the coefficients for each sample point with weights based on distances between sample points, and provides a flexible estimation. However, applying GWR to a large sample data is difficult in terms of calculation cost. This is because calculating distances for all pairs of sample points is necessary to obtain the weights and the method also requires optimization of the location and bandwidth of a kernel function. Furthermore, one needs to perform estimation again to obtain predictive values for future observations and GWR cannot perform clustering. The general varying coefficient model can be considered to have similar disadvantages.

In this paper, we consider a model with discrete varying coefficients. Specifically, we consider a problem involving m groups with homogeneity group-wise for which we have a dataset $({\varvec{y}}_j, {\varvec{X}}_j)$ for group each $j\ (\in \{ 1, \ldots , m \})$, where ${\varvec{y}}_j$ is an $n_j$-dimensional vector of a response variable and ${\varvec{X}}_j$ is an $n_j \times k$ matrix of explanatory variables. For spatial data, we can obtain the m groups by splitting the space subjected to analysis into m subspaces. (However, note that we do not consider only spatial data herein.) Then, a linear regression model for group j may be expressed in the form

$$\begin{aligned} {\varvec{y}}_j = {\varvec{X}}_j {\varvec{\beta }}_j + {\varvec{\varepsilon }}_j, \end{aligned}$$

(1)

where ${\varvec{\beta }}_j$ is a k-dimensional vector of regression coefficients and ${\varvec{\varepsilon }}_j$ is an $n_j$-dimensional vector of an error variable. We assume that $\mathop {\textrm{rank}}({\varvec{X}}_j) = k < n_j$ and that the first column of ${\varvec{X}}_j$ is ${\varvec{1}}_{n_j}$, where ${\varvec{1}}_n$ is the n-dimensional vector of ones. The simplest strategy for estimating ${\varvec{\beta }}_j$ is the ordinary least squares (OLS) method; applying this method, the OLS estimator takes the form

$$\begin{aligned} \hat{{\varvec{\beta }}}_j = {\varvec{M}}_j^{-1} {\varvec{c}}_j; \quad {\varvec{M}}_j = {\varvec{X}}_j' {\varvec{X}}_j, \quad {\varvec{c}}_j = {\varvec{X}}_j' {\varvec{y}}_j \quad (j = 1, \ldots , m). \end{aligned}$$

(2)

Then, it is of interest to determine which groups have (or do not have) homogeneity. For example, groups 1 and 2 may have homogeneity, in which case we would like that the estimates for ${\varvec{\beta }}_1$ and ${\varvec{\beta }}_2$ be equal. Fused Lasso (Tibshirani et al. 2005) is a technique for establishing this sort of relationship between pairs of unknown parameters with adjacent subscripts. For model (1), we may apply group fused Lasso (e.g., see Alaíz et al. 2013; Bleakley and Vert 2011; Qian and Su 2016), an extension of fused Lasso in which the solution is obtained as the minimizer of a penalized residual sum of squares (PRSS) of the form

$$\begin{aligned} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2 + \lambda \sum _{j=1}^{m-1} \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_{j+1} \Vert , \end{aligned}$$

where $\lambda$ is a non-negative tuning parameter. The optimization problem for group fused Lasso may be reduced to that for group Lasso (Yuan and Lin 2006), which is a group selection version of Lasso (Tibshirani 1996), and can then be solved (Bleakley and Vert 2011). However, although group fused Lasso is capable of addressing one-to-one relationships between groups—such as relationships between groups 1 and 2, or between groups 2 and 3—it has no provision for taking into account one-to-many relationships, such as a relationship existing between group 1 and groups 2, 3, and 4. Here we fill this gap by generalized group fused Lasso (GGFL). Specifically, we estimate ${\varvec{\beta }}_1, \ldots , {\varvec{\beta }}_m$ based on minimizing the following PRSS:

$$\begin{aligned} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2 + \lambda \sum _{j=1}^m \sum _{\ell \in D_j} w_{j \ell } \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_\ell \Vert , \end{aligned}$$

(3)

where $D_j \subseteq \{ 1, \ldots , m \} \backslash \{j\}$ is the index set expressing adjacency relationships among m groups and $w_{j \ell }$ is a positive weight based on adaptive Lasso (Zou 2006). Note that GGFL is also known as network Lasso (Hallac et al. 2015). When $D_j = \{ j+1 \}\ (j = 1, \ldots , m-1)$, $D_m = \emptyset$, and $w_{j \ell } = 1$, GGFL coincides with group fused Lasso. Here, we assume that $\ell \in D_j \Leftrightarrow j \in D_\ell$ and $w_{j \ell } = w_{\ell j}$. GGFL may also be viewed as a group version of generalized fused Lasso (e.g., Ohishi et al. 2021; Xin et al. 2014). There is also generalized fused group Lasso (Cao et al. 2018), but it considers parameters to be equal or not element-wise, i.e., differently from GGFL.

GGFL can be seen as a discrete version of GWR. GWR deals with spatial information by using distances between sample points, while GGFL deals with it by using adjacency relations of subspaces. Moreover, GGFL requires only an optimization of a tuning parameter. Hence, we can expect that GGFL can perform well for even a large sample data. GGFL also has the advantage that re-estimation for future observations is not need. On the other hand, there is a concern that the flexibility of estimation which is an advantage of GWR declines using GGFL. We can maintain flexibility by splitting the space subjected to analysis as finely as possible. Although this may reduce the estimation accuracy, we can expect that GGFL avoids such a reduction by joining several parameter vectors together.

The GGFL optimization problem, like that of group fused Lasso, may be reduced to the group Lasso optimization problem; however, it is not possible to obtain a unique solution. Although we can also apply ADMM (Boyd et al. 2011) to solve the optimization problem, like Hallac et al. (2015), ADMM is not fast and estimates by ADMM cannot be exactly equal (caused by numerical error). Particularly, the latter problem prevents clustering. Thus, in this work, we propose a GGFL optimization strategy based on a coordinate descent algorithm (CDA). A CDA updates a solution along each coordinate direction, and hence is known not to be very fast but it can be applied to even a large sample data. CDAs have been applied to many optimization problems related to Lasso, e.g., Lasso itself (Fu 1998), group Lasso (Yuan and Lin 2006), fused Lasso (Friedman et al. 2007), and generalized fused Lasso (Ohishi et al. 2021). Actually, the CDA by Ohishi et al. (2021) performed well for a large sample data for which R package genlasso (e.g., Arnold and Tibshirani 2019), which is based on Tibshirani and Taylor (2011), causes memory shortage. Therefore, we expect a CDA to perform well for the GGFL optimization problem. For our GGFL optimization problem, because GGFL involves multiple non-differentiable points, we first derive a condition for the objective function to attain the minimum at a non-differentiable point. When this condition is not satisfied, we numerically search for the minimizer using its gradient. If a non-differentiable point minimizes the objective function, the corresponding parameter vectors are equal. Hence, numerical error does not exist and a CDA can provide exactly equal estimates. Furthermore, we show a solution for each coordinate direction converges to the optimal solution.

The remainder of this paper is organized as follows. In Sect. 2, we propose a CDA for solving the GGFL optimization problem. We also discuss the convergence of our CDA. In Sect. 3, we conduct simulations to characterize the performance of GGFL, and then present an application case study involving actual data. Section 4 is the conclusion of this paper. Technical details are provided in the Appendix.

2 Main results

In this section, we discuss a CDA for minimizing the PRSS (3). The CDA computes the minimizer by repeatedly minimizing the following objective function for ${\varvec{\beta }}_j\ (j \in \{ 1, \ldots , m \})$, in which all terms that do not depend on ${\varvec{\beta }}_j$ are neglected:

$$\begin{aligned} f_j ({\varvec{\beta }}_j) = {\varvec{\beta }}_j' {\varvec{M}}_j {\varvec{\beta }}_j - 2 {\varvec{c}}_j' {\varvec{\beta }}_j + \sum _{\ell \in D_j} \lambda _{j \ell } \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_\ell \Vert , \end{aligned}$$

(4)

where $\lambda _{j \ell } = 2 \lambda w_{j \ell }$ and $\hat{{\varvec{\beta }}}_\ell$ expresses that ${\varvec{\beta }}_\ell$ is fixed. We note that $f_j ({\varvec{\beta }}_j)$ fails to be differentiable at ${\varvec{\beta }}_j \in \mathcal {R}_j = \{ \hat{{\varvec{\beta }}}_\ell \mid \ell \in D_j \}$. Here, assume that $\lambda _{j \ell } > 0$ and that $\forall \ell _1, \ell _2 \in D_j, \ell _1 \ne \ell _2 \Rightarrow \hat{{\varvec{\beta }}}_{\ell _1} \ne \hat{{\varvec{\beta }}}_{\ell _2}$. If there exist $\ell _1, \ell _2 \in D_j\ (\ell _1 \ne \ell _2)$ such that $\hat{{\varvec{\beta }}}_{\ell _1} = \hat{{\varvec{\beta }}}_{\ell _2}$, we can make the replacements $\lambda _{j \ell _1} \leftarrow \lambda _{j \ell _1} + \lambda _{j \ell _2}$ and $D_j \leftarrow D_j \backslash \{ \ell _2 \}$. In this work, we first determine whether the objective function $f_j ({\varvec{\beta }}_j)$ attains the minimum at ${\varvec{\beta }}_j \in \mathcal {R}_j$. If not, we will proceed to seek the minimizer numerically.

2.1 Conditions indicating the presence of groups to be joined

Let $f ({\varvec{z}})\ ({\varvec{z}}\in \mathbb {R}^q)$ be a convex function that is differentiable for ${\varvec{z}}\ne {\varvec{z}}_0$ and define a set $\mathcal {A}_q$ by

$$\begin{aligned} \mathcal {A}_q = \{{\varvec{\alpha }}\in \mathbb {R}^q \mid \Vert {\varvec{\alpha }}\Vert = 1\}. \end{aligned}$$

(5)

Then, a subdifferential of f at $\tilde{{\varvec{z}}}$ and a one-sided directional derivative of f at $\tilde{{\varvec{z}}}$ with respect to ${\varvec{\alpha }}\in \mathcal {A}_q$ are respectively defined by

$$\begin{aligned} \partial f (\tilde{{\varvec{z}}})&= \left\{ {\varvec{u}}\in \mathbb {R}^q \mid f ({\varvec{z}}) \ge f (\tilde{{\varvec{z}}}) + {\varvec{u}}' ({\varvec{z}}- \tilde{{\varvec{z}}})\ (\forall {\varvec{z}}\in \mathbb {R}^q) \right\} , \\ Df (\tilde{{\varvec{z}}}, {\varvec{\alpha }})&= \lim _{\delta \rightarrow +0} \dfrac{ f (\tilde{{\varvec{z}}} + \delta {\varvec{\alpha }}) - f (\tilde{{\varvec{z}}}) }{ \delta }. \end{aligned}$$

They have the following relations (e.g., see Rockafellar 1970, Parts V and VI):

$$\begin{aligned}&{\varvec{u}}\in \partial f (\tilde{{\varvec{z}}}) \Longleftrightarrow \forall {\varvec{\alpha }}\in \mathcal {A}_q,\ Df (\tilde{{\varvec{z}}}, {\varvec{\alpha }}) \ge {\varvec{u}}' {\varvec{\alpha }}, \\&\forall {\varvec{z}}\in \mathbb {R}^q,\ f ({\varvec{z}}) \ge f (\tilde{{\varvec{z}}}) \Longleftrightarrow {\varvec{0}}_q \in \partial f (\tilde{{\varvec{z}}}). \end{aligned}$$

Hence, a necessary and sufficient condition for $f ({\varvec{z}})$ to attain the minimum at ${\varvec{z}}={\varvec{z}}_0$ is

$$\begin{aligned} \forall {\varvec{\alpha }}\in \mathcal {A}_q,\ Df ({\varvec{z}}_0, {\varvec{\alpha }}) \ge 0. \end{aligned}$$

(6)

We now use this condition to derive a condition for the objective function (4) to achieve the minimum at a non-differentiable point.

Equation (4) is essentially equal to the following function:

$$\begin{aligned} f ({\varvec{\beta }}) = {\varvec{\beta }}' {\varvec{M}}{\varvec{\beta }}- 2 {\varvec{c}}' {\varvec{\beta }}+ \sum _{j=1}^r \lambda _j \Vert {\varvec{\beta }}- {\varvec{b}}_j \Vert , \end{aligned}$$

(7)

where ${\varvec{M}}$ is a $k \times k$ positive definite matrix, ${\varvec{b}}_1, \ldots , {\varvec{b}}_r \in \mathbb {R}^k$ are distinct vectors, and $\lambda _1, \ldots , \lambda _r$ are positive values. By substituting ${\varvec{\theta }}+ {\varvec{M}}^{-1} {\varvec{c}}$ for ${\varvec{\beta }}$ and ${\varvec{a}}_j + {\varvec{M}}^{-1} {\varvec{c}}$ for ${\varvec{b}}_j$, in (7), and dividing it by 2, we can obtain

$$\begin{aligned} \tilde{f} ({\varvec{\theta }}) = \dfrac{1}{2} {\varvec{\theta }}' {\varvec{M}}{\varvec{\theta }}+ \sum _{j=1}^r \psi _j \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert - \dfrac{1}{2} {\varvec{c}}' {\varvec{M}}^{-1} {\varvec{c}}, \end{aligned}$$

(8)

where $\psi _j = \lambda _j / 2\ (> 0)$. Since ${\varvec{b}}_1, \ldots , {\varvec{b}}_r$ are distinct vectors, ${\varvec{a}}_1, \ldots , {\varvec{a}}_r$ are also distinct vectors. Let ${\varvec{\beta }}^\star$ and ${\varvec{\theta }}^\star$ be the minimizers of $f ({\varvec{\beta }})$ and $\tilde{f} ({\varvec{\theta }})$, respectively, i.e.,

$$\begin{aligned} {\varvec{\beta }}^\star = \arg \min _{{\varvec{\beta }}\in \mathbb {R}^k} f ({\varvec{\beta }}), \quad {\varvec{\theta }}^\star = \arg \min _{{\varvec{\theta }}\in \mathbb {R}^k} \tilde{f} ({\varvec{\theta }}). \end{aligned}$$

(9)

Notice that ${\varvec{\beta }}^\star = {\varvec{\theta }}^\star + {\varvec{M}}^{-1} {\varvec{c}}$. Hence, we consider minimizing $\tilde{f} ({\varvec{\theta }})$ instead of $f ({\varvec{\beta }})$.

A one-sided directional derivative of $\tilde{f}$ at ${\varvec{a}}_s\ (s \in D = \{ 1, \ldots , r \})$ with respect to ${\varvec{\alpha }}\in \mathcal {A}_k$ is given by

$$\begin{aligned} D \tilde{f} ({\varvec{a}}_s, {\varvec{\alpha }}) = {\varvec{a}}_s' {\varvec{M}}{\varvec{\alpha }}+ \psi _s + \sum _{j \ne s}^r \psi _j \dfrac{ ({\varvec{a}}_s - {\varvec{a}}_j)' {\varvec{\alpha }}}{ \Vert {\varvec{a}}_s - {\varvec{a}}_j \Vert }. \end{aligned}$$

Hence, from (6), $\tilde{f} ({\varvec{\theta }})$ attains the minimum at ${\varvec{\theta }}= {\varvec{a}}_s$ if and only if for all ${\varvec{\alpha }}\in \mathcal {A}_k$, $\psi _s \ge - {\varvec{v}}_s' {\varvec{\alpha }}$, where ${\varvec{v}}_s$ is given by

$$\begin{aligned} {\varvec{v}}_s = {\varvec{M}}{\varvec{a}}_s + \sum _{j \ne s}^r \dfrac{ \psi _j }{ \Vert {\varvec{a}}_s - {\varvec{a}}_j \Vert } ({\varvec{a}}_s - {\varvec{a}}_j) = {\varvec{M}}{\varvec{b}}_s - {\varvec{c}}+ \dfrac{1}{2} \sum _{j \ne s}^r \dfrac{ \lambda _j }{ \Vert {\varvec{b}}_s - {\varvec{b}}_j \Vert } ({\varvec{b}}_s - {\varvec{b}}_j). \end{aligned}$$

(10)

Also, noting that $\Vert {\varvec{\alpha }}\Vert = 1$, from the Cauchy–Schwarz inequality, we find that $- {\varvec{v}}_s' {\varvec{\alpha }}\le \Vert {\varvec{v}}_s \Vert$, with equality holding only for ${\varvec{\alpha }}= -{\varvec{v}}_s / \Vert {\varvec{v}}_s \Vert$. From these results, we obtain the following theorem.

Theorem 1

The function $\tilde{f} ({\varvec{\theta }})$ in (8) attains the minimum at the non-differentiable point ${\varvec{\theta }}= {\varvec{a}}_s$ if there exists $s \in D$ such that $\psi _s \ge \Vert {\varvec{v}}_s \Vert$.

From the theorem, the condition for $f_j ({\varvec{\beta }}_j)\ (j \in \{ 1, \ldots , m \})$ in (4) to attain the minimum at ${\varvec{\beta }}_j = \hat{{\varvec{\beta }}}_s\ (s \in D_j)$ is

$$\begin{aligned} \lambda _{j s} \ge \Vert {\varvec{v}}_{j, s} \Vert , \quad {\varvec{v}}_{j, s} = 2 ({\varvec{M}}_j \hat{{\varvec{\beta }}}_s - {\varvec{c}}_j) + \sum _{\ell \in D_j \backslash \{s\}} \dfrac{ \lambda _{j \ell } }{ \Vert \hat{{\varvec{\beta }}}_s - \hat{{\varvec{\beta }}}_\ell \Vert } (\hat{{\varvec{\beta }}}_s - \hat{{\varvec{\beta }}}_\ell ). \end{aligned}$$

(11)

If there exists $s \in D_j$ satisfying the above condition, then the estimates for ${\varvec{\beta }}_j$ and ${\varvec{\beta }}_s$ will be exactly equal, indicating that groups j and s have homogeneity and these are joined together.

2.2 Searching minimizer in the absence of groups to be joined

Assume that the objective function $\tilde{f} ({\varvec{\theta }})$ in (8) does not attain the minimum at any non-differentiable points ${\varvec{\theta }}\in \mathcal {R}= \{ {\varvec{a}}_1, \ldots , {\varvec{a}}_r \}$—that is, that the condition of Theorem 1 is not satisfied for any $j \in D$. Then we can search for the minimizer by using the gradient of $\tilde{f} ({\varvec{\theta }})$ that takes the form

$$\begin{aligned} {\varvec{g}}({\varvec{\theta }}) = \dfrac{ \partial }{ \partial {\varvec{\theta }}} \tilde{f} ({\varvec{\theta }}) = {\varvec{M}}{\varvec{\theta }}+ \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } ({\varvec{\theta }}- {\varvec{a}}_j)\quad ({\varvec{\theta }}\notin \mathcal {R}). \end{aligned}$$

(12)

By solving the equation ${\varvec{g}}({\varvec{\theta }}) = {\varvec{0}}_k$, we can derive the following update equation:

$$\begin{aligned} {\varvec{\theta }}^\text {new} = {\varvec{\eta }}({\varvec{\theta }}^\text {old}), \quad {\varvec{\eta }}({\varvec{\theta }}) = {\left\{ \begin{array}{ll} {\varvec{Q}}({\varvec{\theta }})^{-1} \displaystyle \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } {\varvec{a}}_j &{}({\varvec{\theta }}\notin \mathcal {R}) \\ {\varvec{\theta }}&{}({\varvec{\theta }}\in \mathcal {R}) \end{array}\right. }, \end{aligned}$$

(13)

where ${\varvec{Q}}({\varvec{\theta }})$ is defined for ${\varvec{\theta }}\notin \mathcal {R}$ by

$$\begin{aligned} {\varvec{Q}}({\varvec{\theta }})&= {\varvec{M}}+ s ({\varvec{\theta }}) {\varvec{I}}_k, \quad s ({\varvec{\theta }}) = \sum _{j = 1}^r \dfrac{\psi _j}{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert }\ ({\varvec{\theta }}\notin \mathcal {R}). \end{aligned}$$

(14)

Note that we define ${\varvec{\eta }}({\varvec{\theta }}) = {\varvec{\theta }}$ for ${\varvec{\theta }}\in \mathcal {R}$ since ${\varvec{g}}({\varvec{\theta }})$ cannot be defined for ${\varvec{\theta }}\in \mathcal {R}$. Although we would like to propose the update given by (13) for obtaining the minimizer, the update stops when ${\varvec{\theta }}^\text {new} \in \mathcal {R}$. Therefore, ${\varvec{\theta }}$ such that ${\varvec{\eta }}({\varvec{\theta }}) \in \mathcal {R}$ must be removed from the search area. Of course, it is very rare that there exists $j \in D$ such that ${\varvec{\eta }}({\varvec{\theta }}) = {\varvec{a}}_j$ when ${\varvec{\theta }}\ne {\varvec{a}}_j$, so for practical use, there is no need to consider removing such points. However, to guarantee convergence to the minimizer theoretically, we will consider these situations as well. To be precise, if you are unfortunate enough to have ${\varvec{\eta }}({\varvec{\theta }}) = {\varvec{a}}_j$, we will have to update it by shifting it slightly from ${\varvec{a}}_j$. Let ${\varvec{\theta }}_0$ be some initial vector. Then, we define the update equation minimizing $\tilde{f} ({\varvec{\theta }})$ as follows:

$$\begin{aligned} {\varvec{\theta }}_{i+1} = {\left\{ \begin{array}{ll} {\varvec{\eta }}({\varvec{\theta }}_i) &{}({\varvec{\eta }}({\varvec{\theta }}_i) \notin \mathcal {R}) \\ {\varvec{\eta }}\left( {\varvec{a}}_j - d_j {\varvec{\alpha }}({\varvec{v}}_j) / 5 \right) &{}({\varvec{\eta }}({\varvec{\theta }}_i) = {\varvec{a}}_j) \end{array}\right. } \quad (i = 0, 1, \ldots ), \end{aligned}$$

(15)

where $d_j$ and ${\varvec{\alpha }}({\varvec{\theta }})$ are respectively given by

$$\begin{aligned} d_j = \dfrac{\psi _j}{\sum _{\ell =1}^r \psi _\ell } \min _{\ell \in D \backslash \{ j \}} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert , \quad {\varvec{\alpha }}({\varvec{\theta }}) = {\varvec{\theta }}/ \Vert {\varvec{\theta }}\Vert . \end{aligned}$$

(16)

The update given by (15) guarantees the solution attains the minimizer by the following theorem (the proof is given in Appendix A).

Theorem 2

Suppose that $\tilde{f} ({\varvec{\theta }})$ in (8) does not attain the minimum at any non-differentiable points ${\varvec{\theta }}\in \mathcal {R}$, i.e., ${\varvec{\theta }}^\star \notin \mathcal {R}$. For any initial vector ${\varvec{\theta }}_0$, ${\varvec{\theta }}_i$ in (15) converges to ${\varvec{\theta }}^\star$ as $i \rightarrow \infty$.

Although the update Eq. (15) specifically indicates how to shift ${\varvec{\eta }}({\varvec{\theta }}_i)$ when ${\varvec{\eta }}({\varvec{\theta }}_i) = {\varvec{a}}_j$, this is one instance of satisfying Theorem 2.

Now, we consider minimizing $f_j ({\varvec{\beta }}_j)\ (j \in \{ 1, \ldots , m \})$ in (4) when $f_j ({\varvec{\beta }}_j)$ does not attain the minimum at any non-differentiable points ${\varvec{\beta }}_j \in \mathcal {R}_j = \{ \hat{{\varvec{\beta }}}_\ell \mid \ell \in D_j \}$. We define ${\varvec{\eta }}_j ({\varvec{\beta }}_j)$ and $d_{j, \ell }\ (\ell \in D_j)$ by

$$\begin{aligned} {\varvec{\eta }}_j ({\varvec{\beta }}_j) = {\left\{ \begin{array}{ll} {\varvec{Q}}_j ({\varvec{\beta }}_j)^{-1} {\varvec{h}}_j ({\varvec{\beta }}_j) &{}({\varvec{\beta }}_j \notin \mathcal {R}_j) \\ {\varvec{\beta }}_j &{}({\varvec{\beta }}_j \in \mathcal {R}_j) \end{array}\right. }, \quad d_{j, \ell } = \dfrac{\lambda _{j \ell }}{\sum _{s \in D_j} \lambda _{j s}} \left( \min _{s \in D_j \backslash \{ \ell \}} \Vert \hat{{\varvec{\beta }}}_\ell - \hat{{\varvec{\beta }}}_s \Vert \right) , \end{aligned}$$

where ${\varvec{Q}}_j ({\varvec{\beta }}_j)$ and ${\varvec{h}}_j ({\varvec{\beta }}_j)$ are respectively defined for ${\varvec{\beta }}_j \notin \mathcal {R}_j$ by

$$\begin{aligned} {\varvec{Q}}_j ({\varvec{\beta }}_j)&= 2 {\varvec{M}}_j + \sum _{\ell \in D_j} \dfrac{\lambda _{j \ell }}{ \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_\ell \Vert } {\varvec{I}}_k, \quad {\varvec{h}}_j ({\varvec{\beta }}_j) = 2 {\varvec{c}}_j + \sum _{\ell \in D_j} \dfrac{ \lambda _{j \ell } }{ \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_\ell \Vert } \hat{{\varvec{\beta }}}_\ell . \end{aligned}$$

Then, the update equation minimizing $f_j ({\varvec{\beta }}_j)$ is given by

$$\begin{aligned} {\varvec{\beta }}_j^\text {new} = {\left\{ \begin{array}{ll} {\varvec{\eta }}_j ({\varvec{\beta }}_j^\text {old}) &{}\left( {\varvec{\eta }}_j ({\varvec{\beta }}_j^\text {old}) \notin \mathcal {R}_j \right) \\ {\varvec{\eta }}_j \left( \hat{{\varvec{\beta }}}_\ell - d_{j, \ell } {\varvec{\alpha }}({\varvec{v}}_{j, \ell }) / 5 \right) &{}\left( {\varvec{\eta }}_j ({\varvec{\beta }}_j^\text {old}) = \hat{{\varvec{\beta }}}_\ell \right) \end{array}\right. }. \end{aligned}$$

(17)

From Theorem 2, the solution obtained by the above converges to the minimizer of $f_j ({\varvec{\beta }}_j)$.

2.3 Coordinate descent algorithm for GGFL

For fused Lasso and generalized fused Lasso, Friedman et al. (2007) and Ohishi et al. (2021) respectively proposed CDAs. These papers note the following phenomenon. When several groups are joined together at intermediate stages of the algorithm, the optimization process gets stuck: the corresponding objective-function values stagnate—ceasing to improve on subsequent iterations—and the algorithm fails to achieve minimization. To avoid this difficulty, similar to as reported in the two papers, our CDA incorporates two cycles: a descent cycle and a fusion cycle.

The descent cycle repeatedly minimizes $f_j ({\varvec{\beta }}_j)$ in (4) for $j \in \{ 1, \ldots , m \}$. More specifically, for GGFL, the descent cycle proceeds according to the following algorithm.

Algorithm 1

(Descent cycle for GGFL)

Step 1.
For $f_j ({\varvec{\beta }}_j)$, check the condition (11). If there exists $s \in D_j$ such that the condition holds, update the solution as $\hat{{\varvec{\beta }}}_j = \hat{{\varvec{\beta }}}_s$; otherwise repeat the update given by (17) until it converges.
Step 2.
Repeat Step 1 for all $j \in \{1, \ldots , m\}$.

The fusion cycle is designed to avoid the phenomenon in which the joining of groups during the descent cycle causes the optimization process to flounder, obstructing the progress of the algorithm. Suppose that, following a descent cycle, we have solutions $\hat{{\varvec{\beta }}}_1, \ldots , \hat{{\varvec{\beta }}}_m$ with distinct vectors of the solutions $\hat{{\varvec{\xi }}}_1, \ldots , \hat{{\varvec{\xi }}}_t\ (t < m)$. Then, we can obtain index sets $E_\ell = \{ j \in \{ 1, \ldots , m \} \mid \hat{{\varvec{\xi }}}_\ell = \hat{{\varvec{\beta }}}_j \}\ (\ell = 1, \ldots , t)$ satisfying $\cup _{\ell =1}^t E_\ell = \{ 1, \ldots , m \}$, $E_\ell \ne \emptyset$ and $E_\ell \cap E_j = \emptyset \ (\ell \ne j)$. Moreover, define $D_\ell ^*\subseteq \{1, \ldots , t\} \backslash \{\ell \}\ (\ell \in \{ 1, \ldots , t \})$ and $w_{\ell i}^*\ (\ell \in \{ 1, \ldots , t\};\ i \in D_\ell ^*)$ as follows:

$$\begin{aligned} D_\ell ^*&= \left\{ s \in \{ 1, \ldots , t \} \backslash \{ \ell \} \mid E_s \cap F_\ell \ne \emptyset \right\} , \quad F_\ell = \bigcup _{j \in E_\ell } D_j \backslash E_\ell , \\ w_{\ell i}^*&= \sum _{(j,s) \in \mathcal {J}_{\ell i}} w_{j s}, \quad \mathcal {J}_{\ell i} = \bigcup _{j \in E_\ell } \{ j \} \times (E_i \cap D_j), \end{aligned}$$

where $D_\ell ^*$ satisfies $D_\ell ^*\ne \emptyset$ and $s \in D_\ell ^*\Leftrightarrow \ell \in D_s^*$. In words, $D_\ell ^*$ is the set of fused-group indexes that have an adjacency relationship with fused group $\ell$. The two terms in the PRSS (3) may be rewritten as follows (see Ohishi et al. 2021):

$$\begin{aligned} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2&= \sum _{\ell =1}^t \sum _{j \in E_\ell } \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\xi }}_\ell \Vert ^2, \\ \sum _{j=1}^m \sum _{\ell \in D_j} w_{j \ell } \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_\ell \Vert&= 2 \sum _{i \in D_\ell ^*} w_{\ell i}^*\Vert {\varvec{\xi }}_\ell - {\varvec{\xi }}_i \Vert + \sum _{j \notin E_\ell } \sum _{i \in D_j \backslash E_\ell } w_{j i} \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_i \Vert . \end{aligned}$$

Thus, excluding from the PRSS (3) all terms that do not depend on ${\varvec{\xi }}_\ell$, we may express the objective function for the fusion cycle as

$$\begin{aligned} f_\ell ^*({\varvec{\xi }}_\ell )&= {\varvec{\xi }}_\ell ' \left( \sum _{j \in E_\ell } {\varvec{M}}_j \right) {\varvec{\xi }}_\ell - 2 \left( \sum _{j \in E_\ell } {\varvec{c}}_j \right) ' {\varvec{\xi }}_\ell + \sum _{i \in D_\ell ^*} \lambda _{\ell i}^*\Vert {\varvec{\xi }}_\ell - \hat{{\varvec{\xi }}}_i \Vert , \end{aligned}$$

where $\lambda _{\ell i}^*= 2 \lambda w_{\ell i}^*$ and $\hat{{\varvec{\xi }}}_i$ expresses that ${\varvec{\xi }}_i$ is fixed. In the fusion cycle, the descent cycle for $f_\ell ^*({\varvec{\xi }}_\ell )$ is executed. Notice that $f_\ell ^*({\varvec{\xi }}_\ell )$ is essentially equal to (7). Hence, we can minimize $f_\ell ^*({\varvec{\xi }}_\ell )$ similarly to the minimization of (4).

Assembling the steps discussed above yields the following CDA for GGFL.

Algorithm 2

(Coordinate descent algorithm for GGFL)

Step 1.
Choose initial vectors for ${\varvec{\beta }}_1, \ldots , {\varvec{\beta }}_m$ and tuning parameter $\lambda$.
Step 2.
Execute the descent cycle.
Step 3.
If any groups were joined, execute the fusion cycle. If not, proceed to Step 4.
Step 4.
Repeat Steps 2 and 3 until the solution converges.

Because GGFL depends on $\lambda$, for practical applications, it is important to optimize the value of $\lambda$. Let $\hat{{\varvec{\beta }}}_\text {max}$ denote the solution with all groups joined; that is, $\hat{{\varvec{\beta }}}_\text {max}$ is the OLS estimator for ${\varvec{\beta }}_j$ obtained by setting ${\varvec{\beta }}_1 = \cdots = {\varvec{\beta }}_m$. Then we define $\lambda _{\max }$ as follows:

$$\begin{aligned} \lambda _{\max }= \max _{j \in \{ 1, \ldots , m \}} \dfrac{ \Vert {\varvec{M}}_j \hat{{\varvec{\beta }}}_\text {max} - {\varvec{c}}_j \Vert }{ \sum _{\ell \in D_j} w_{j \ell } }. \end{aligned}$$

(18)

The significance of $\lambda _{\max }$ may be understood by noting that the solution obtained for $\lambda = \lambda _{\max }$ satisfies $\hat{{\varvec{\beta }}}_1 = \cdots = \hat{{\varvec{\beta }}}_m = \hat{{\varvec{\beta }}}_\text {max}$. Thus, we execute Algorithm 2 for each candidate $\lambda$ value in the range $(0,\lambda _{\max }]$ and select the optimal $\lambda$ value. For example, the optimal $\lambda$ value can be selected based on minimizing a model selection criterion.

3 Numerical studies

In this section, we investigate our CDA via simulation data and real data. The numerical calculation programs are executed in R (ver. 4.1.1) on a computer with the Windows 10 Pro operating system, an Intel (R) Core (TM) i9-9900 processor, and 128 GB of RAM. The R codes are available in the R package GGFL (https://github.com/ohishim/GGFL)

3.1 Simulation

3.1.1 Performance of GGFL

For this simulation, we construct simulation spaces similar to those used in Ohishi et al. (2021). We consider two problem sizes, with group counts $m=10$ and $m=20$ and the adjacency relationships depicted in Figs. 1 or 2 (for $m=10$) and Figs. 3 or 4 (for $m=20$).

For example, for the $m=10$ problem, group 1 is adjacent to groups 2, 3, 4, 5, and 6, i.e., $D_1 = \{ 2, 3, 4, 5, 6 \}$. For each m value, we consider two possible cases of group-join configurations; each configuration is specified by enumerating the true sets $E_\ell \ (\ell = 1, \ldots , m^*)$ of joined groups, where $m^*\in \{ 3, 6 \}$ for $m = 10$ and $m^*\in \{ 6, 12 \}$ for $m = 20$ (here $m^*$ is the true number of sets of joined groups), as follows:

$m = 10$, case 1:
$$\begin{aligned} E_1 = \{ 1, 2, 3\},\quad E_2 = \{ 4, 5, 6, 9, 10 \},\quad E_3 = \{ 7, 8 \}. \end{aligned}$$
$m = 10$, case 2:
$$\begin{aligned} E_1&= \{ 1, 3\},&E_2&= \{ 2 \},&E_3&= \{ 4, 6, 10 \}, \\ E_4&= \{ 5 \},&E_5&= \{ 7, 8 \},&E_6&= \{ 9 \}. \end{aligned}$$
$m = 20$, case 1:
$$\begin{aligned} E_1&= \{ 1, 2, 3 \},&E_2&= \{ 4, 5, 6 \},&E_3&= \{ 7, 8, 19, 20 \}, \\ E_4&= \{ 9, 10, 12, 13 \},&E_5&= \{ 11, 14, 15, 16 \},&E_6&= \{ 17, 18 \}. \end{aligned}$$
$m = 20$, case 2:
$$\begin{aligned} E_1&= \{ 1 \},&E_2&= \{ 2, 3 \},&E_3&= \{ 4 \},&E_4&= \{ 5, 6 \}, \\ E_5&= \{ 7, 8 \},&E_6&= \{ 9, 10 \},&E_7&= \{ 11 \},&E_8&= \{ 12, 13 \}, \\ E_9&= \{ 14, 15, 16 \},&E_{10}&= \{ 17, 18 \},&E_{11}&= \{ 19 \},&E_{12}&= \{ 20 \}. \end{aligned}$$

These group-join configurations are illustrated in Figs. 1, 2, 3 and 4. In this section, we use simulation data to assess whether the GGFL technique successfully determines the true join configuration in each case.

Table 1 MSE results when case 1

Full size table

Table 2 MSE results when case 2

Full size table

Let ${\varvec{X}}$ be an $n \times k$ matrix defined by ${\varvec{X}}= ({\varvec{X}}_1', \ldots , {\varvec{X}}_m')' = ({\varvec{1}}_n, {\varvec{X}}_0 {\varvec{\Psi }}(0.5)^{1/2})$, where ${\varvec{X}}_0$ is an $n \times (k-1)$ matrix of which all elements are identically and independently distributed according to $U (-1, 1)$, and ${\varvec{\Psi }}(\rho )$ is the symmetric matrix of order $k-1$ of which the (i, j)th element is $\rho ^{|i-j|}$. Then, our simulation data are generated from ${\varvec{y}}_j \sim N_{n_j} ({\varvec{X}}_j {\varvec{\beta }}_j, {\varvec{I}}_{n_j})\ (j = 1, \ldots , m)$, where ${\varvec{\beta }}_j$ is defined by ${\varvec{\beta }}_j = \ell {\varvec{1}}_k \ (j \in E_\ell ;\ \ell = 1, \ldots , m^*)$. Here, we set $n_1 = \cdots = n_m = n_0 \in \{ 50, 100, 200 \}$, i.e., $n = m n_0$. In our simulation, we characterize not only the selection probability (SP) of the true join configuration but also the mean square error (MSE). For $\hat{{\varvec{\beta }}} = (\hat{{\varvec{\beta }}}_1', \ldots , \hat{{\varvec{\beta }}}_m')'$ and $\hat{{\varvec{y}}} = (\hat{{\varvec{y}}}_1', \ldots , \hat{{\varvec{y}}}_m')'$, we compute the following two MSEs:

$$\begin{aligned} \mathop {\textrm{MSE}}\nolimits _{\beta } [\hat{{\varvec{\beta }}}]&= \mathop {\textrm{E}}\nolimits \left[ \sum _{j = 1}^m \Vert {\varvec{\beta }}_j - \hat{{\varvec{\beta }}}_j \Vert ^2 \right] / (k m), \quad \mathop {\textrm{MSE}}\nolimits _{y} [\hat{{\varvec{y}}}] = \mathop {\textrm{E}}\nolimits \left[ \sum _{j = 1}^m \Vert {\varvec{y}}_j - \hat{{\varvec{y}}}_j \Vert ^2 \right] / n, \end{aligned}$$

where $\hat{{\varvec{y}}}_j$ is given by $\hat{{\varvec{y}}}_j = {\varvec{X}}_j \hat{{\varvec{\beta }}}_j$. The expected value of the MSE is characterized by 10,000 repetitions of the Monte Carlo simulation. We consider three estimators for ${\varvec{\beta }}_j$:

GGFL: The estimator produced by the GGFL method proposed in this paper.
OLS 1: The OLS estimators defined by (2), i.e., $\hat{{\varvec{\beta }}}_j = {\varvec{M}}_j^{-1} {\varvec{c}}_j$.
OLS 2: The OLS estimator for a common set ${\varvec{\beta }}_1 = \cdots = {\varvec{\beta }}_m$, i.e., $\hat{{\varvec{\beta }}}_j = \hat{{\varvec{\beta }}}_{\max }$.

Table 3 Selection probability (%)

Full size table

Denoting estimators for OLS 1 by $\tilde{{\varvec{\beta }}}_1, \ldots , \tilde{{\varvec{\beta }}}_m$, we use $w_{j \ell } = \Vert \tilde{{\varvec{\beta }}}_j - \tilde{{\varvec{\beta }}}_\ell \Vert ^{-1}\ (j = 1, \ldots , m,\ \ell \in D_j)$ for GGFL penalties. Using the EGCV criterion (Ohishi et al. 2020), we determine the optimal tuning parameter and set the penalty strength to $\log n$, where the candidates for the optimal tuning parameter are the 100 points defined by $\lambda _{\max }(3/4)^j\ (j = 0, \ldots , 99)$ and $\lambda _{\max }$ is given by (18). The MSE values for cases 1 and 2 are tabulated in Tables 1 and 2, respectively, in which bold font indicates the smallest value for each setting. From the tables, we can see that MSE values for GGFL and OLS 1 are decreasing as n increases, but MSE values for OLS 2 are not necessarily decreasing as n increases. In addition, for most settings, GGFL achieves the smallest MSE values of all the estimation methods considered for both predicted and estimated values. Moreover, there are three settings in which OLS 1 achieves the smallest MSE values. The reason why GGFL is inferior than OLS 1 is that the true join configuration could not selected since n is not large enough. We can find it in Table 3. Table 3 summarizes the SP values. The results indicate the possibility that the GGFL method has consistency for the selection of true join configurations. Moreover, there are three settings in which SP values are less than 100. For the three settings, GGFL is inferior than OLS 1 in terms of MSE.

3.1.2 Comparison with ADMM

In this section, we compare our CDA with ADMM (Hallac et al. 2015; the details are given in Appendix C) for the same situation as in the previous section. The two algorithms are evaluated with respect to runtime and the minimum value of the objective function at fixed $\lambda$. Here, we set the value of $\lambda$ as $\lambda = \zeta \lambda _{\max }\ (\zeta \in \{ 1/1000, 1/100, 1/10 \})$. Let $\hat{f}_\text {CDA}$ and $\hat{f}_\text {ADMM}$ be the minimum values of the objective function by CDA and ADMM, respectively. Figures 5 and 6 show differences $\hat{f}_\text {ADMM} - \hat{f}_\text {CDA}$ over 100 repetitions for each $\zeta$. As seen from the figures, most differences are positive and difference becomes larger as $\zeta$ increases. This means that our CDA can better minimize the objective function than ADMM and that their difference becomes larger as $\lambda$ increases. We guess the reason to be that the estimates obtained by ADMM are not exactly equal. Actually, when $\lambda$ is very small, the differences are very small because most estimates are not equal. On the other hand, when $\lambda$ is large, many estimates are equal and so the differences are large. Furthermore, the degrees of freedoms of the estimates obtained by ADMM were always equal to the number of parameters. Therefore, when the optimal tuning parameter is selected based on minimizing a model selection criterion like the EGCV criterion, it is always 0 and a GGFL penalty is meaningless.

Table 4 Runtime comparison

Full size table

Table 5 The number of iterations until convergence

Full size table

Table 4 shows runtime for CDA and ADMM. From the table, we can see that our CDA can optimize the objective function much faster than ADMM. Moreover, the table shows runtime for CDA becomes shorter as n increases, while that for ADMM does not. This fact relates the number of iterations until convergence, which is provided in Table 5. From Tables 4 and 5, we can see that the runtime and the number of iterations are strongly dependent. Recall that the objective function (3) has penalty weights based on the OLS estimator. Since the OLS estimator becomes stable as n increases and the penalty weights perform well, we can say that the number of iterations for CDA decreases as n increases. Although it is the same in ADMM, ADMM has the three parameter vectors, not only ${\varvec{\beta }}$, but also ${\varvec{\gamma }}$ and $\xi$. In terms of ${\varvec{\beta }}$ and ${\varvec{\gamma }}$, we may say the same to CDA. However, the convergences of ${\varvec{\beta }}$ and ${\varvec{\gamma }}$ depend on ${\varvec{\xi }}$, and we do not know whether the convergence of ${\varvec{\xi }}$ becomes stable as n increases. Figure 7 shows paths for convergence when $m=10$, $k = 20$, and $\lambda = \lambda _{\max } / 100$ in which the y-axis shows a distance between a solution for each iteration and the optimal solution and the x-axis shows the number of iterations, where the distance is defined as follows: for any parameter vector ${\varvec{\mu }}$, we write a solution in the ith iteration as ${\varvec{\mu }}_i$, and write the optimal solution as $\hat{{\varvec{\mu }}}$. Then, we define the distance for the ith iteration as

$$\begin{aligned} \Vert {\varvec{\mu }}_i - \hat{{\varvec{\mu }}} \Vert / \Vert {\varvec{\mu }}_0 - \hat{{\varvec{\mu }}} \Vert \quad (i = 0, 1, \ldots ), \end{aligned}$$

where ${\varvec{\mu }}_0$ is an initial vector. From the figure, we can see that both of ${\varvec{\beta }}$s in ADMM and CDA go for convergence directly, while paths for ${\varvec{\gamma }}$ and ${\varvec{\xi }}$ in ADMM are bumpy. Here, the lines for ${\varvec{\beta }}$ in CDA are truncated at 4 because CDA converged at the fourth iteration for all ns. The ${\varvec{\gamma }}$ is only once away from the optimal solution at the first iteration and its distance becomes larger as n increases. We can guess that the reason for this relates a constraint of ADMM which is ${\varvec{\beta }}_j = {\varvec{\gamma }}_{j \ell }$. Although the initial vectors hold the constraint, it breaks down if ${\varvec{\beta }}$ or ${\varvec{\gamma }}$ is updated even once. Afterwards, ${\varvec{\gamma }}$ goes for convergence directly like ${\varvec{\beta }}$, but its speed is slow. The ${\varvec{\xi }}$ is also away from the optimal solution in the middle of the process. Hence, we can conclude that ${\varvec{\gamma }}$ and ${\varvec{\xi }}$ prevent steady convergence.

3.2 A real data example

In this section, we present an application of the method proposed in this paper to actual data. The dataset we use is similar to that used in Ohishi et al. (2021); it consists of rental prices—and additional data describing environmental conditions—for studio apartments in Tokyo’s 23 wards as observed between April 2014 and April 2015. The dataset, compiled by Tokyo Kantei Co., Ltd., has a sample size of $n = 61{,}999$ and contains $m = 852$ groups; the specific data items it covers are listed in Table 6, where Y and A1 through A4 are continuous variables, and B1 through B6 are dummy variables that take the value of 1 or 0.

Table 6 Data items

Full size table

For this dataset, the territory covered by Tokyo’s 23 wards was divided into 852 geographical subregions, corresponding to the 852 groups in the dataset. In our analysis, we take the monthly apartment rent to be the response variable, using all other data items as explanatory variables; however, problems arise when dummy variables are modeled group-wise. For example, if all apartments in group j have a parking lot, a rank deficient occurs in ${\varvec{X}}_j$. For this reason, our dummy variables are common to all groups. More specifically, our modeling proceeds as follows. For group j, let ${\varvec{y}}_j$ be an $n_j$-dimensional vector of the response variable for group j, let ${\varvec{X}}_j$ be an $n_j \times 5$ matrix of explanatory variables, whose first column is ${\varvec{1}}_{n_j}$ and remaining 4 columns correspond to items A1 through A4, and let ${\varvec{Z}}_j$ be an additional $n_j \times 6$ matrix of explanatory variables, whose columns correspond to items B1 through B6. Then we consider the model ${\varvec{y}}_j = {\varvec{X}}_j {\varvec{\beta }}_j + {\varvec{Z}}_j {\varvec{\gamma }}+ {\varvec{\varepsilon }}_j\ (j = 1, \ldots , m)$, where ${\varvec{\beta }}_j$ and ${\varvec{\gamma }}$ are five- and six-dimensional vectors of regression coefficients, respectively. This means that Tokyo’s 23 wards are represented by $m\ (= 852)$ submodels. We estimate ${\varvec{\beta }}_j$ and ${\varvec{\gamma }}$ as follows:

$$\begin{aligned} \hat{{\varvec{\beta }}}_\lambda&= \arg \min _{{\varvec{\beta }}} \left\{ \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j - {\varvec{Z}}_j \hat{{\varvec{\gamma }}}_\lambda \Vert ^2 + \lambda \sum _{j=1}^m \sum _{\ell \in D_j} w_{j \ell } \Vert {\varvec{\beta }}_j - {\varvec{\beta }}_\ell \Vert \right\} ,\\ \hat{{\varvec{\gamma }}}_\lambda&= \arg \min _{{\varvec{\gamma }}} \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j \hat{{\varvec{\beta }}}_{\lambda , j} - {\varvec{Z}}_j {\varvec{\gamma }}\Vert ^2, \end{aligned}$$

where ${\varvec{\beta }}= ({\varvec{\beta }}_1', \ldots , {\varvec{\beta }}_m')'$ and $\hat{{\varvec{\beta }}}_\lambda = (\hat{{\varvec{\beta }}}_{\lambda , 1}', \ldots , \hat{{\varvec{\beta }}}_{\lambda , m}')'$. Here, the updates of ${\varvec{\beta }}$ by our CDA and ${\varvec{\gamma }}$ must be alternately repeated until they converge. The estimates for ${\varvec{\beta }}$ and ${\varvec{\gamma }}$ are obtained for each $\lambda$ value, and then the optimal $\lambda$ value is selected by minimizing the EGCV criterion with the penalty strength $\log n$, where candidate $\lambda$ values are given by $\lambda _{\max } (3/4)^j\ (j = 0, \ldots , 99)$.

By GGFL estimation, the 852 subregions in Tokyo’s 23 wards were joined into 166 clusters (Fig. 8).

Here, we show the results in comparison with GWR. Table 7 summarizes the estimates for the common regression coefficients. There is not much difference between the estimates by GGFL and GWR.

Table 7 Estimates for common regression coefficients

Full size table

Figures 9, 10, 11, 12 and 13 are choropleth maps (for GGFL) and scatter plots (for GWR) of estimates for varying coefficients. The GGFL results are given as choropleth maps because GGFL gives an estimate for each subregion, which can be assigned to a color, except regions where apartments cannot exist, which are indicated by grey. On the other hand, the GWR results are displayed as scatter plots because GWR gives an estimate for each observation point. From these figures, we can see that the obtained GGFL and GWR results follow similar trends. Nevertheless, clear differences between the two methods are also visible.

Table 8 shows the summary of the comparison. Although GWR has a good model fit, GGFL fitting is also not bad. On the other hand, GGFL was much faster than GWR. Furthermore, PE of the table expresses a prediction error. We calculate prediction squared errors from 100 repetitions of the holdout method, where test data are 100 random observations. PE is then defined as the square root of the mean value of the prediction squared errors. We found that although GWR has better prediction accuracy than GGFL, the difference is not very large.

Table 8 Summary of the comparison

Full size table

4 Conclusion

In this paper, we dealt with the GGFL optimization problem. Since the problem cannot be solved by existing algorithms, we proposed a CDA. To address multiple non-differentiable points well, we invoked two steps: judgment as to whether the objective function attains the minimum at a non-differentiable point; searching for the solution when the objective function is not minimized at any non-differentiable points. We also invoked a fusion cycle to avoid a solution getting stuck, similar to as in Friedman et al. (2007) and Ohishi et al. (2021). Furthermore, our CDA is guaranteed to converge and give the optimal solution, for each coordinate direction. In a simulation study, it was found that our CDA performs well and has good properties, such as consistency, with respect to the selection of true joins. Moreover, our CDA performed better than ADMM (Hallac et al. 2015). One of our aims was the clustering of m submodels, and hence we adopted a GGFL penalty which is based on the $\ell _1$ norm. However, numerical error in estimates obtained by ADMM prevents clustering. Hence, we needed to propose CDA. If clustering is not needed, i.e., shrinkage is a purpose, this is not a problem. Our CDA also performed well for an actual dataset of a large sample. We have stated only the good points of our CDA, but we should also note here that the algorithm is slightly complicated. In addition, although ADMM by Hallac et al. (2015) can be extended to using nonconvex penalties, it is difficult for our method to that. In a real data example, we compared GGFL with GWR. Although GWR was slightly superior than GGFL, GWR necessarily requires coordinates for all sample points for its implementation. It may be practically difficult to get all coordinates. In such a case, GGFL is useful since it does not require coordinates.

Data availability

Not applicable.

References

Alaíz CM, Barbero Á, Dorronsoro JR (2013) Group fused Lasso. In: Artificial neural networks and machine learning—ICANN 2013. Springer, Berlin, pp 66–73. https://doi.org/10.1007/978-3-642-40728-4_9
Arnold T, Tibshirani R (2019) genlasso: path algorithm for generalized Lasso problems. R package version 1.4. https://CRAN.R-project.org/package=genlasso
Bleakley K, Vert JP (2011) The group fused Lasso for multiple change-point detection. arXiv:1106.4199v1
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3:1–122. https://doi.org/10.1561/2200000016
Article Google Scholar
Brunsdon C, Fotheringham S, Charlton M (1996) Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr Anal 28:281–298. https://doi.org/10.1111/j.1538-4632.1996.tb00936.x
Article Google Scholar
Cao P, Liu X, Liu H, Yang J, Zhao D, Huang M, Zaiane O (2018) Generalized fused group Lasso regularized multi-task feature learning for predicting cognitive outcomes in alzheimer’s disease. Comput Methods Progr Biomed 162:19–45. https://doi.org/10.1016/j.cmpb.2018.04.028
Article Google Scholar
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332. https://doi.org/10.1214/07-AOAS131
Article MathSciNet Google Scholar
Fu WJ (1998) Penalized regressions: the bridge versus the Lasso. J Comput Graph Stat 7:397–416. https://doi.org/10.1080/10618600.1998.10474784
Article MathSciNet Google Scholar
Geng J, Cao K, Yu L, Tang Y (2011) Geographically weighted regression model (GWR) based spatial analysis of house price in Shenzhen. In: 2011 19th International Conference on Geoinformatics, pp 1–5. https://doi.org/10.1109/GeoInformatics.2011.5981032
Hallac D, Leskovec J, Boyd S (2015) Network Lasso: clustering and optimization in large graphs. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15. Association for Computing Machinery, New York, pp 387–396. https://doi.org/10.1145/2783258.2783313
Lu B, Charlton M, Fotheringhama AS (2011) Geographically weighted regression using a non-Euclidean distance metric with a study on London House price data. Proc Environ Sci 7:92–97. https://doi.org/10.1016/j.proenv.2011.07.017
Article Google Scholar
Ohishi M, Yanagihara H, Fujikoshi Y (2020) A fast algorithm for optimizing ridge parameters in a generalized ridge regression by minimizing a model selection criterion. J Stat Plan Inference 204:187–205. https://doi.org/10.1016/j.jspi.2019.04.010
Article MathSciNet Google Scholar
Ohishi M, Fukui K, Okamura K, Itoh Y, Yanagihara H (2021) Coordinate optimization for generalized fused Lasso. Commun Stat Theory Methods 50(24):5955–5973. https://doi.org/10.1080/03610926.2021.1931888
Article MathSciNet Google Scholar
Qian J, Su L (2016) Shrinkage estimation of regression models with multiple structural changes. Econom Theor 32:1376–1433. https://doi.org/10.1017/S0266466615000237
Article MathSciNet Google Scholar
Rockafellar RT (1970) Convex analysis. Princeton University Press, New Jersey
Book Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Stat Methodol 58:267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Article MathSciNet Google Scholar
Tibshirani R, Taylor J (2011) The solution path of the generalized Lasso. Ann Stat 39:1335–1371. https://doi.org/10.1214/11-AOS878
Article MathSciNet Google Scholar
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused Lasso. J R Stat Soc Ser B Stat Methodol 67:91–108. https://doi.org/10.1111/j.1467-9868.2005.00490.x
Article MathSciNet Google Scholar
Wang M, Wang H (2020) Spatial distribution patterns and influencing factors of pm 2.5 pollution in the Yangtze River Delta: empirical analysis based on a GWR model. Aisa-Pac J Atmos Sci 57:63–75. https://doi.org/10.1007/s13143-019-00153-6
Article Google Scholar
Xin B, Kawahara Y, Wang Y, Gao W (2014) Efficient generalized fused Lasso and its application to the diagnosis of Alzheimer’s disease. In: Proceedings of the 28th AAAI conference on artificial intelligence. AAAI Press, California, pp 2163–2169
Yanagihara H, Oda R (2021) Coordinate descent algorithm for normal-likelihood-based group Lasso in multivariate linear regression. In: Czarnowski I, Howlett RJ, Jain LC (eds) Intelligent decision technologies. Springer Singapore, Singapore, pp 429–439, https://doi.org/10.1007/978-981-16-2765-1_36
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol 68:49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Article MathSciNet Google Scholar
Zhu Y (2017) An augmented ADMM algorithm with application to the generalized Lasso problem. J Comput Graph Stat 26:195–204. https://doi.org/10.1080/10618600.2015.1114491
Article MathSciNet Google Scholar
Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101:1418–1429. https://doi.org/10.1198/016214506000000735
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors thank the associate editor and the two reviewers for their valuable comments. Moreover, the authors thank FORTE Science Communications (https://www.forte-science.co.jp/) for English language editing of the first draft.

Funding

This work was partially supported by JSPS KAKENHI Grant Number JP18K03415, JP20H04151, JP21K13834, JSPS Bilateral Program Grant Number JPJSBP120219927, and ISM Cooperative Research Program (2023-ISMCRP-4105).

Author information

Authors and Affiliations

Center for Data-driven Science and Artificial Intelligence, Tohoku University, Sendai, Japan
Mineaki Ohishi
Tokyo Kantei Co., Ltd., Shinagawa, Japan
Kensuke Okamura & Yoshimichi Itoh
Graduate School of Advanced Science and Engineering, Hiroshima University, Higashi-Hiroshima, Japan
Hirofumi Wakaki & Hirokazu Yanagihara

Authors

Mineaki Ohishi
View author publications
You can also search for this author in PubMed Google Scholar
Kensuke Okamura
View author publications
You can also search for this author in PubMed Google Scholar
Yoshimichi Itoh
View author publications
You can also search for this author in PubMed Google Scholar
Hirofumi Wakaki
View author publications
You can also search for this author in PubMed Google Scholar
Hirokazu Yanagihara
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design.

Corresponding author

Correspondence to Mineaki Ohishi.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Communicated by Joe Suzuki.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: The proof of Theorem 2

When $r = 1$, the algorithm proposed in this paper is essentially the same as that proposed in Yanagihara and Oda (2021), which gives the convergence to the minimizer. Therefore, we now consider the case when $r \ge 2$. First, we present some lemmas that will be used to prove Theorem 2 in Appendix A.1 and then prove the theorem in Appendix A.1. The proofs of the lemmas are given in Appendix B.

1.1 A.1 Lemmas for proving Theorem 2

To prove Theorem 2, the behavior of ${\varvec{\eta }}({\varvec{\theta }})$, which is a function updating a solution given by (13), is important. We first prepare some expressions for describing properties of ${\varvec{\eta }}({\varvec{\theta }})$. For ${\varvec{\theta }}\notin \mathcal {R}\ (= \{ {\varvec{a}}_1, \ldots , {\varvec{a}}_r \})$, we define $w_j ({\varvec{\theta }})\ (j \in D = \{ 1, \ldots , r \})$ and ${\varvec{D}}({\varvec{\theta }})$ as

$$\begin{aligned} w_j ({\varvec{\theta }}) = \dfrac{ \psi _j }{ s ({\varvec{\theta }}) \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert }, \quad {\varvec{D}}({\varvec{\theta }}) = s ({\varvec{\theta }})^{-1} {\varvec{Q}}({\varvec{\theta }}) = s ({\varvec{\theta }})^{-1} {\varvec{M}}+ {\varvec{I}}_k, \end{aligned}$$

(A.1)

where ${\varvec{M}}$ and $\psi _j$ are given by (7) and (8), respectively, and $s ({\varvec{\theta }})$ and ${\varvec{Q}}({\varvec{\theta }})$ are given by (14). We also define $\zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star )\ (j \in D)$ and $t ({\varvec{\theta }})$ as

$$\begin{aligned} \zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star ) = \dfrac{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert - \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert }{ \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert }, \quad t ({\varvec{\theta }}) = \dfrac{ \Vert {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star \Vert }{ \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert }, \end{aligned}$$

(A.2)

where ${\varvec{\theta }}^\star$ is given by (9). Moreover, for a positive value $\delta$ and a k-dimensional vector ${\varvec{a}}$, let $\mathcal {B}_k ({\varvec{a}}, \delta )$ be the set defined by

$$\begin{aligned} \mathcal {B}_k ({\varvec{a}}, \delta ) = \left\{ {\varvec{\theta }}\in \mathbb {R}^k \mid 0< \Vert {\varvec{\theta }}- {\varvec{a}}\Vert < \delta \right\} . \end{aligned}$$

(A.3)

The following lemma gives properties of ${\varvec{\eta }}({\varvec{\theta }})$ (the proof is given in Appendix B.1).

Lemma A.1

Suppose that ${\varvec{\theta }}^\star \notin \mathcal {R}$. Then, ${\varvec{\eta }}({\varvec{\theta }})$ in (13) has the following properties:

(1)
The ${\varvec{\eta }}({\varvec{\theta }})$ is a continuous function.
(2)
The $\Vert {\varvec{\eta }}({\varvec{\theta }}) \Vert \le \displaystyle \max _{j \in D} \Vert {\varvec{a}}_j \Vert$ holds. Particularly, $\Vert {\varvec{\eta }}({\varvec{\theta }}) \Vert < \displaystyle \max _{j \in D} \Vert {\varvec{a}}_j \Vert$ holds for ${\varvec{\theta }}\notin \mathcal {R}.$
(3)
For ${\varvec{\theta }}\notin \mathcal {R}$, ${\varvec{\eta }}({\varvec{\theta }})$ can be expressed ${\varvec{\eta }}({\varvec{\theta }}) = {\varvec{\theta }}- {\varvec{Q}}({\varvec{\theta }})^{-1} {\varvec{g}}({\varvec{\theta }})$, where ${\varvec{g}}({\varvec{\theta }})$ is given by (12).
(4)
The ${\varvec{\theta }}^\star$ and ${\varvec{a}}_1, \ldots , {\varvec{a}}_r$ are the fixed points of ${\varvec{\eta }}({\varvec{\theta }})$.
(5)
For ${\varvec{\theta }}\notin \mathcal {R}$, ${\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star$ can be expressed as
$$\begin{aligned} {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star = {\varvec{D}}({\varvec{\theta }})^{-1} \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \sum _{j=1}^r w_j ({\varvec{\theta }}) \zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j), \end{aligned}$$
where ${\varvec{\alpha }}({\varvec{\theta }})$ is given by (16).
(6)
Suppose that ${\varvec{\theta }}\notin \mathcal {R}$. Then, $t ({\varvec{\theta }}) \le s ({\varvec{\theta }}) / \{ s ({\varvec{\theta }}) + \phi _{\min } ({\varvec{M}}) \}$ and $\Vert {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert$ hold, where $\phi _{\min } ({\varvec{A}})$ is the minimum eigenvalue of a matrix ${\varvec{A}}$.
(7)
$\forall {\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j), {\varvec{\eta }}({\varvec{\theta }}) \ne {\varvec{a}}_j\ (j \in D)$.
(8)
$\forall {\varvec{\theta }}\in \mathcal {B}_k \left( {\varvec{a}}_j, \min \left\{ d_j, \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert \right\} \right) ,\ \Vert {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert \ (j \in D)$.

In the update given by (15), if ${\varvec{\eta }}({\varvec{\theta }}_i) = {\varvec{a}}_j\ (i \in \mathbb {N}\cup \{ 0 \}; j \in D)$, the solution is updated by shifting it slightly from ${\varvec{a}}_j$. The shift in (15) is based on the following lemma (the proof is given in Appendix B.2).

Lemma A.2

Suppose that $d_j / 4 > \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert \ (j \in D)$, where $d_j$ is given by (16). Then, we have

$$\begin{aligned} \forall {\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j / 4) \backslash \mathcal {B}_k ({\varvec{a}}_j, \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert ),\ \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert < \min _{\ell \in D \backslash \{ j \}} \Vert {\varvec{a}}_\ell - {\varvec{\theta }}^\star \Vert . \end{aligned}$$

Strictly speaking, in the case of ${\varvec{\eta }}({\varvec{\theta }}_i) = {\varvec{a}}_j$, we divide $d_j$ by 5, but any number greater than or equal to 4 can be used in this division. Also, ${\varvec{v}}_j$ is just the maximum directional derivative of $\tilde{f} ({\varvec{\theta }})$ at ${\varvec{\theta }}= {\varvec{a}}_j$; it does not need to be the same ${\varvec{v}}_j$ as in the update equation, where $\tilde{f} ({\varvec{\theta }})$ and ${\varvec{v}}_j$ are given by (8) and (10), respectively. Furthermore, for the update given by (15), the following lemma holds (the proof is given in Appendix B.3).

Lemma A.3

Suppose that there exist $i_0 \in \mathbb {N}\cup \{ 0 \}$ and $j_0 \in D$ such that ${\varvec{\eta }}({\varvec{\theta }}_{i_0}) = {\varvec{a}}_{j_0}$ and $d_{j_0} / 5 \ge \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert$. Then, for all $i > i_0$, the following relationships hold:

$$\begin{aligned} \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert < \min \left\{ \min _{j \in D \backslash \{ j_0 \}} \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert , \dfrac{2}{5} d_{j_0} \right\} , \quad {\varvec{\theta }}_i \in \mathcal {B}_k \left( {\varvec{a}}_{j_0}, \dfrac{3}{5} d_{j_0} \right) . \end{aligned}$$

(A.4)

1.2 A.2 The proof

Let $\tau _i = \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert \ (i = 0, 1, \ldots )$, where ${\varvec{\theta }}_i$ is given by (15). To prove Theorem 2, we need to show the following two statements:

(S1)
The $\tau _i$ converges.
(S2)
The value $\tau _i$ converges to is 0 if $\tau _i$ converges.

We show the above by the following steps:

Step 1.
Proof of (S1): Show that there exists $i_\star \in \mathbb {N}\cup \{ 0 \}$ such that $\tau _{i+1} < \tau _i$ holds for $i > i_\star$. The discussion here consists based on the following statement obtained by Lemma A.1 (6) and the assumption of reductio ad absurdum:
$$\begin{aligned} \exists i_1 \in \mathbb {N}\cup \{ 0 \}, i_2 \in \mathbb {N}, j_0 \in D\ s.t.\ {\varvec{\eta }}({\varvec{\theta }}_{i_1}) = {\varvec{\eta }}({\varvec{\theta }}_{i_2}) = {\varvec{a}}_{j_0}, \end{aligned}$$
where $i_1 < i_2$ and $i \in (i_1, i_2) \Rightarrow {\varvec{\eta }}({\varvec{\theta }}_i) \ne {\varvec{a}}_{j_0}$.
1. Step 1-1.
  The case that $d_{j_0} / 5 \ge \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert$.
2. Step 1-2.
  The case that $d_{j_0} / 5 < \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert$.
Step 2.
Proof of (S2): Let $\mathcal {B}_\star = \mathcal {B}_k ({\varvec{\theta }}^\star , \tau _\infty ) \cup \{ {\varvec{\theta }}^\star \}$.
1. Step 2-1.
  The case that for all $j \in D$, ${\varvec{a}}_j \notin \partial \mathcal {B}_\star$.
2. Step 2-2.
  The case that there exists $j_0 \in D$ such that ${\varvec{a}}_{j_0} \in \partial \mathcal {B}_\star$.

Here, $d_j$ is given by (16).

For Step 1, it is sufficient from Lemma A.1 (6) to show that there exists $i_\star \in \mathbb {N}\cup \{ 0 \}$ such that

$$\begin{aligned} \forall i \in \mathbb {N},\ i > i_\star \Rightarrow {\varvec{\theta }}_{i+1} = {\varvec{\eta }}({\varvec{\theta }}_i), \end{aligned}$$

(A.5)

where ${\varvec{\eta }}({\varvec{\theta }})$ is given by (13). We show it by reductio ad absurdum, i.e., assume that $\forall i \in \mathbb {N}\cup \{ 0 \}, \exists i_0 > i\ s.t.\ {\varvec{\eta }}({\varvec{\theta }}_{i_0}) = {\varvec{a}}_j$. Then, we have

$$\begin{aligned} \exists i_1 \in \mathbb {N}\cup \{ 0 \}, i_2 \in \mathbb {N}, j_0 \in D\ s.t.\ {\varvec{\eta }}({\varvec{\theta }}_{i_1}) = {\varvec{\eta }}({\varvec{\theta }}_{i_2}) = {\varvec{a}}_{j_0}, \end{aligned}$$

where $i_1 < i_2$ and $i \in (i_1, i_2) \Rightarrow {\varvec{\eta }}({\varvec{\theta }}_i) \ne {\varvec{a}}_{j_0}$.

Regarding Step 1-1, for $i > i_1$, from Lemma A.3, ${\varvec{\theta }}_i \in \mathcal {B}_k ({\varvec{a}}_{j_0}, d_{j_0})$ holds, where $\mathcal {B}_k ({\varvec{a}}, \delta )$ is given by (A.3). It follows from this result and Lemma A.1 (7) that ${\varvec{\eta }}({\varvec{\theta }}_i) \ne {\varvec{a}}_{j_0}$. Hence, we have the contradiction ${\varvec{a}}_{j_0} = {\varvec{\eta }}({\varvec{\theta }}_{i_2}) \ne {\varvec{a}}_{j_0}$. Next, we consider Step 1-2. If there exists $i_*\in (i_1, i_2)$ and $j_*\in D \backslash \{ j_0 \}$ such that ${\varvec{\eta }}({\varvec{\theta }}_{i_*}) = {\varvec{a}}_{j_*}$ and $d_{j_*} / 5 \ge \Vert {\varvec{a}}_{j_*} - {\varvec{\theta }}^\star \Vert$, ${\varvec{\eta }}({\varvec{\theta }}_i) \ne {\varvec{a}}_{j_*}$ holds for $i > i_*$ and it follows from ${\varvec{\theta }}_i \notin \mathcal {R}\ (i > i_*)$ and Lemmas A.1 (6) and A.3 that

$$\begin{aligned} \Vert {\varvec{\eta }}({\varvec{\theta }}_i) - {\varvec{\theta }}^\star \Vert< \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert < \min _{j \in D \backslash \{ j_*\}} \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert . \end{aligned}$$

Hence, ${\varvec{\eta }}({\varvec{\theta }}_i) \notin \mathcal {R}$ holds for $i > i_*$ and we again have the contradiction ${\varvec{a}}_{j_0} = {\varvec{\eta }}({\varvec{\theta }}_{i_2}) \ne {\varvec{a}}_{j_0}$. Then, we assume that for all $i \in (i_1, i_2)$, if there exists $j_*\in D \backslash \{ j_0 \}$ such that ${\varvec{\eta }}({\varvec{\theta }}_i) = {\varvec{a}}_{j_*}$, $d_{j_*} / 5 < \Vert {\varvec{a}}_{j_*} - {\varvec{\theta }}^\star \Vert$ holds. Notice that ${\varvec{a}}_{j_0} - d_{j_0} {\varvec{\alpha }}({\varvec{v}}_{j_0}) / 5 \in \mathcal {B}_k ({\varvec{a}}_{j_0}, \min \{ d_{j_0}, \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert \})$ and ${\varvec{\theta }}_{i_1 + 1} = {\varvec{\eta }}({\varvec{a}}_{j_0} - d_{j_0} {\varvec{\alpha }}({\varvec{v}}_{j_0}) / 5)$. It follows from these results and Lemma A.1 (6) and (8) that

$$\begin{aligned} \Vert {\varvec{\theta }}_{i_1 + 1} - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert = \Vert {\varvec{\eta }}({\varvec{\theta }}_{i_1}) - {\varvec{\theta }}^\star \Vert \le \Vert {\varvec{\theta }}_{i_1} - {\varvec{\theta }}^\star \Vert . \end{aligned}$$

In a similar way, for $i \in (i_1, i_2)$ such that $j_*$ exists, we have $\Vert {\varvec{\theta }}_{i+1} - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{a}}_{j_*} - {\varvec{\theta }}^\star \Vert \le \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert$. On the other hand, for $i \in (i_1, i_2)$ such that $j_*$ does not exist, it follows from Lemma A.1 (6) that $\Vert {\varvec{\theta }}_{i+1} - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert$. Thus, we have that $\Vert {\varvec{\theta }}_{i+1} - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert$ for all $i \in [i_1, i_2)$ and hence $\Vert {\varvec{\eta }}({\varvec{\theta }}_{i_2}) - {\varvec{\theta }}^\star \Vert \le \Vert {\varvec{\theta }}_{i_2} - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert$ holds from Lemma A.1 (6). This contradicts that ${\varvec{\eta }}({\varvec{\theta }}_{i_2}) = {\varvec{a}}_{j_0}$. Therefore, (A.5) is satisfied and hence $\tau _i$ converges.

Next, we show Step 2 by reductio ad absurdum. We assume that $\lim _{i \rightarrow \infty } \tau _i = \tau _\infty > 0$. From Lemma A.1 (6), we have

$$\begin{aligned} \tau _\infty > 0 \Longrightarrow \lim _{i \rightarrow \infty } t ({\varvec{\theta }}_i) = 1 \Longrightarrow \lim _{i \rightarrow \infty } s ({\varvec{\theta }}_i) = \infty \Longrightarrow \lim _{i \rightarrow \infty } \min _{j \in D} \Vert {\varvec{\theta }}_i - {\varvec{a}}_j \Vert = 0, \end{aligned}$$

where $s ({\varvec{\theta }})$ and $t ({\varvec{\theta }})$ are given by (14) and (A.2), respectively. Then, we have $\min _{{\varvec{x}}\in \partial \mathcal {B}_\star } \Vert {\varvec{x}}- {\varvec{\theta }}_i \Vert \rightarrow 0$ as $i \rightarrow \infty$, where $\partial \mathcal {B}_\star$ is the boundary of $\mathcal {B}_\star$.

Regarding Step 2-1, let $\delta = \min _{j \in D} \min _{{\varvec{x}}\in \partial \mathcal {B}_\star } \Vert {\varvec{x}}- {\varvec{a}}_j \Vert$ ($> 0$). Then, for $\epsilon = \delta / 2$, there exists $i_0 \in \mathbb {N}\cup \{ 0 \}$ such that $i > i_0 \Rightarrow \min _{{\varvec{x}}\in \partial \mathcal {B}_\star } \Vert {\varvec{x}}- {\varvec{\theta }}_i \Vert < \epsilon$. Let ${\varvec{x}}_i^\star = \arg \min _{{\varvec{x}}\in \partial \mathcal {B}_\star } \Vert {\varvec{x}}- {\varvec{\theta }}_i \Vert$. Notice that $\Vert {\varvec{x}}_i^\star - {\varvec{a}}_j \Vert \ge \delta> \delta / 2 > \Vert {\varvec{x}}_i^\star - {\varvec{\theta }}_i \Vert$ holds for $i > i_0$. Hence, we have

$$\begin{aligned} \Vert {\varvec{\theta }}_i - {\varvec{a}}_j \Vert> \Vert {\varvec{x}}_i^\star - {\varvec{a}}_j \Vert - \Vert {\varvec{\theta }}_i - {\varvec{x}}^\star \Vert > \delta / 2 = \epsilon . \end{aligned}$$

This contradicts that $\lim _{i \rightarrow \infty } \min _{j \in D} \Vert {\varvec{\theta }}_i - {\varvec{a}}_j \Vert = 0$. Next, we consider Step 2-2. We assume that $\liminf _{i \rightarrow \infty } \Vert {\varvec{\theta }}_i - {\varvec{a}}_{j_0} \Vert = 0$. Then, there exists $i_0 > i_\star$ such that ${\varvec{\theta }}_{i_0} \in \mathcal {B}_k ({\varvec{a}}_{j_0}, \min \{ d_{j_0}, \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert \})$, where $i_\star$ is given by (A.5). It follows from Lemma A.1 (8) that

$$\begin{aligned} \tau _{i_0 + 1} = \Vert {\varvec{\theta }}_{i_0 + 1} - {\varvec{\theta }}^\star \Vert = \Vert {\varvec{\eta }}({\varvec{\theta }}_{i_0}) - {\varvec{\theta }}^\star \Vert < \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert = \tau _\infty . \end{aligned}$$

From this result and $\tau _i$ being a strictly monotonically decreasing sequence for $i > i_\star$ we have the contradiction $\tau _\infty > \tau _{i_0 + 1} \ge \lim _{i \rightarrow \infty } \tau _i = \tau _\infty$. Hence, $\liminf _{i \rightarrow \infty } \Vert {\varvec{\theta }}_i - {\varvec{a}}_{j_0} \Vert > 0$ holds. Furthermore, $\lim _{i \rightarrow \infty } \min _j \Vert {\varvec{\theta }}_i - {\varvec{a}}_j \Vert> \min _j \min _{{\varvec{x}}\in \partial \mathcal {B}_\star } \Vert {\varvec{x}}- {\varvec{a}}_j \Vert / 2 > 0$ holds for $j \in D$ such that ${\varvec{a}}_j \notin \partial \mathcal {B}_\star$. Therefore, we have a contradiction to $\lim _{i \rightarrow \infty } \min _{j \in D} \Vert {\varvec{\theta }}_i - {\varvec{a}}_j \Vert = 0$.

The above results imply that $\lim _{i \rightarrow \infty } \tau _i = 0$. Consequently, we have proved Theorem 2.

Appendix B: The proofs of lemmas in Appendix A.1

1.1 B.1 The proof of Lemma A.1

We prove Lemma A.1 using the following lemma (the proof is given in Appendix B.4).

Lemma B.2

The weight function $w_j ({\varvec{\theta }})\ (j \in D = \{ 1, \ldots , r \})$ in (A.1) satisfies the following properties:

(1)
$0< w_j ({\varvec{\theta }}) < 1$ and $\displaystyle \sum _{j=1}^r w_j ({\varvec{\theta }}) = 1$.
(2)
$\displaystyle \lim _{{\varvec{\theta }}\rightarrow {\varvec{a}}_j} w_j ({\varvec{\theta }}) = 1$ and $\displaystyle \lim _{{\varvec{\theta }}\rightarrow {\varvec{a}}_\ell } w_j ({\varvec{\theta }}) = 0\ (\ell \in D \backslash \{j\})$.
(3)
For all ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j)$, $w_j ({\varvec{\theta }}) > \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }})$ holds, where $d_j$ and $\mathcal {B}_k ({\varvec{a}}, \delta )$ are given by (16) and (A.3), respectively.

First, we prove property (1). For ${\varvec{\theta }}\notin \mathcal {R}= \{ {\varvec{a}}_1, \ldots , {\varvec{a}}_r \}$, we rewrite ${\varvec{\eta }}({\varvec{\theta }})$ in (13) as

$$\begin{aligned} {\varvec{\eta }}({\varvec{\theta }}) = {\varvec{Q}}({\varvec{\theta }})^{-1} \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } {\varvec{a}}_j = {\varvec{D}}({\varvec{\theta }})^{-1} \sum _{j=1}^r w_j ({\varvec{\theta }}) {\varvec{a}}_j, \end{aligned}$$

where ${\varvec{Q}}({\varvec{\theta }})$ and ${\varvec{D}}({\varvec{\theta }})$ are given by (14) and (A.1), respectively. Notice that $\lim _{{\varvec{\theta }}\rightarrow {\varvec{a}}_j}$ ${\varvec{D}}({\varvec{\theta }})^{-1} = {\varvec{I}}_k$. From this result and Lemma B.2 (2), we have $\lim _{{\varvec{\theta }}\rightarrow {\varvec{a}}_j} {\varvec{\eta }}({\varvec{\theta }}) = {\varvec{a}}_j = {\varvec{\eta }}({\varvec{a}}_j)$. Thus, we have proved property (1).

Next, we prove property (2). If ${\varvec{\theta }}\in \mathcal {R}$, it is obvious that $\Vert {\varvec{\eta }}({\varvec{\theta }}) \Vert \le \max _{j \in D} \Vert {\varvec{a}}_j \Vert$. Now, we consider when ${\varvec{\theta }}\notin \mathcal {R}$. Let $\phi _{\min }({\varvec{A}})$ and $\phi _{\max }({\varvec{A}})$ be the minimum and maximum eigenvalues of a matrix ${\varvec{A}}$. Recall that $s ({\varvec{\theta }}) > 0$ and that ${\varvec{M}}$ is a positive definite matrix, where $s ({\varvec{\theta }})$ is given by (14). Therefore, we have

$$\begin{aligned} \phi _{\min }({\varvec{D}}({\varvec{\theta }})) = 1 + \dfrac{ \phi _{\min }({\varvec{M}}) }{ s ({\varvec{\theta }}) } > 1. \end{aligned}$$

(B.1)

Notice that $\Vert {\varvec{A}}{\varvec{u}}\Vert = \Vert {\varvec{u}}\Vert \ \Vert {\varvec{A}}{\varvec{\alpha }}({\varvec{u}}) \Vert \le \phi _{\max }({\varvec{A}}) \Vert {\varvec{u}}\Vert \ (\forall {\varvec{u}}\in \mathbb {R}^k)$ and $\phi _{\max }({\varvec{D}}({\varvec{\theta }})^{-1}) = \phi _{\min }({\varvec{D}}({\varvec{\theta }}))^{-1} < 1$. Then, it follows from these results and Lemma B.2 (1) that

$$\begin{aligned} \Vert {\varvec{\eta }}({\varvec{\theta }}) \Vert \le \dfrac{ 1 }{ \phi _{\min }({\varvec{D}}({\varvec{\theta }})) } \left\| \sum _{j=1}^r w_j ({\varvec{\theta }}) {\varvec{a}}_j \right\| < \sum _{j=1}^r w_j ({\varvec{\theta }}) \Vert {\varvec{a}}_j \Vert \le \max _{j \in D} \Vert {\varvec{a}}_j \Vert . \end{aligned}$$

Thus, we have proved property (2).

Next, we prove property (3). Notice that for ${\varvec{\theta }}\notin \mathcal {R}$,

$$\begin{aligned} {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}&= {\varvec{Q}}({\varvec{\theta }})^{-1} \left\{ \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } {\varvec{a}}_j - {\varvec{Q}}({\varvec{\theta }}) {\varvec{\theta }}\right\} \\&= {\varvec{Q}}({\varvec{\theta }})^{-1} \left\{ \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } {\varvec{a}}_j - {\varvec{M}}{\varvec{\theta }}- s ({\varvec{\theta }}) {\varvec{\theta }}\right\} = - {\varvec{Q}}({\varvec{\theta }})^{-1} {\varvec{g}}({\varvec{\theta }}), \end{aligned}$$

where ${\varvec{g}}({\varvec{\theta }})$ is given by (12). Thus, we have proved property (3).

Next, we prove property (4). It is obvious that ${\varvec{a}}_1, \ldots , {\varvec{a}}_r$ are fixed points of ${\varvec{\eta }}({\varvec{\theta }})$. Since ${\varvec{\theta }}^\star \notin \mathcal {R}$, $g ({\varvec{\theta }}^\star ) = {\varvec{0}}_k$ holds. By using property (3), we obtain ${\varvec{\eta }}({\varvec{\theta }}^\star ) = {\varvec{\theta }}^\star - {\varvec{Q}}({\varvec{\theta }}^\star )^{-1} {\varvec{g}}({\varvec{\theta }}^\star ) = {\varvec{\theta }}^\star$. Hence, ${\varvec{\theta }}^\star$ is a fixed point of ${\varvec{\eta }}({\varvec{\theta }})$. Thus, we have proved property (4).

Next, we prove property (5). Notice that ${\varvec{g}}({\varvec{\theta }}^\star ) = {\varvec{0}}_k$ and that for ${\varvec{\theta }}\notin \mathcal {R}$,

$$\begin{aligned} {\varvec{g}}({\varvec{\theta }})&= {\varvec{g}}({\varvec{\theta }}) - {\varvec{g}}({\varvec{\theta }}^\star ) \\&= {\varvec{M}}({\varvec{\theta }}- {\varvec{\theta }}^\star ) + \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } ({\varvec{\theta }}- {\varvec{a}}_j) - \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert } ({\varvec{\theta }}^\star - {\varvec{a}}_j) \\&= {\varvec{M}}({\varvec{\theta }}- {\varvec{\theta }}^\star ) + \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } ({\varvec{\theta }}- {\varvec{a}}_j) \\&\quad + s ({\varvec{\theta }}) ({\varvec{\theta }}^\star - {\varvec{\theta }}^\star ) - \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } \dfrac{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert }{ \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert } ({\varvec{\theta }}^\star - {\varvec{a}}_j) \\&= {\varvec{M}}({\varvec{\theta }}- {\varvec{\theta }}^\star ) + s ({\varvec{\theta }}) ({\varvec{\theta }}- {\varvec{\theta }}^\star ) \\&\quad + \sum _{j=1}^r \dfrac{ \psi _j \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j) - \sum _{j=1}^r \dfrac{ \psi _j \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j) \\&= {\varvec{M}}({\varvec{\theta }}- {\varvec{\theta }}^\star ) + s ({\varvec{\theta }}) ({\varvec{\theta }}- {\varvec{\theta }}^\star ) + \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } ( \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert - \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j) \\&= {\varvec{Q}}({\varvec{\theta }}) ({\varvec{\theta }}- {\varvec{\theta }}^\star ) - \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } \zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j), \end{aligned}$$

where $\alpha ({\varvec{\theta }})$ and $\zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star )$ are given by (16) and (A.2), respectively. It follows from this result and Lemma A.1 (3) that

$$\begin{aligned} {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star&= {\varvec{\theta }}- {\varvec{\theta }}^\star - {\varvec{Q}}({\varvec{\theta }})^{-1} {\varvec{g}}({\varvec{\theta }}) \\&= {\varvec{Q}}({\varvec{\theta }})^{-1} \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \sum _{j=1}^r \dfrac{ \psi _j }{ \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert } \zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j) \\&= {\varvec{D}}({\varvec{\theta }})^{-1} \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \sum _{j=1}^r w_j ({\varvec{\theta }}) \zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j). \end{aligned}$$

Thus, we have proved property (5).

Next, we prove property (6). Notice that $| \zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star ) | \le 1$ from the triangle inequality, as

$$\begin{aligned} \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \ge \left| \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert - \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert \right| . \end{aligned}$$

Hence, for ${\varvec{\theta }}\notin \mathcal {R}$, property (5), Lemma B.2 (1), and (B.1) imply that

$$\begin{aligned} \Vert {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star \Vert&\le \phi _{\min }({\varvec{D}}({\varvec{\theta }}))^{-1} \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \sum _{j=1}^r w_j ({\varvec{\theta }})\ |\zeta _j ({\varvec{\theta }}, {\varvec{\theta }}^\star )|\ \Vert {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j) \Vert \\&\le \dfrac{ s ({\varvec{\theta }}) }{ s ({\varvec{\theta }}) + \phi _{\min }({\varvec{M}}) } \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert < \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \end{aligned}$$

Thus, we have proved property (6).

Next, we prove property (7). Notice that for ${\varvec{\theta }}\notin \mathcal {R}$,

$$\begin{aligned} {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{a}}_j&= {\varvec{Q}}({\varvec{\theta }})^{-1} \left\{ \sum _{\ell =1}^r \dfrac{ \psi _\ell }{ \Vert {\varvec{\theta }}- {\varvec{a}}_\ell \Vert } {\varvec{a}}_\ell - \{ {\varvec{M}}+ s ({\varvec{\theta }}) {\varvec{I}}_k \} {\varvec{a}}_j \right\} \\&= {\varvec{Q}}({\varvec{\theta }})^{-1} \left\{ - {\varvec{M}}{\varvec{a}}_j - \sum _{\ell \ne j}^r \dfrac{ \psi _\ell }{ \Vert {\varvec{\theta }}- {\varvec{a}}_\ell \Vert } ({\varvec{a}}_j - {\varvec{a}}_\ell ) \right. \\&\quad \left. - \sum _{\ell \ne j}^r \dfrac{\psi _\ell }{\Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert } ({\varvec{a}}_j - {\varvec{a}}_\ell ) + \sum _{\ell \ne j}^r \dfrac{\psi _\ell }{\Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert } ({\varvec{a}}_j - {\varvec{a}}_\ell ) \right\} \\&= {\varvec{Q}}({\varvec{\theta }})^{-1} \left\{ - {\varvec{v}}_j + \sum _{\ell \ne j}^r \psi _\ell \left( \dfrac{ 1 }{ \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert } - \dfrac{ 1 }{ \Vert {\varvec{\theta }}- {\varvec{a}}_\ell \Vert } \right) ({\varvec{a}}_j - {\varvec{a}}_\ell ) \right\} \\&= {\varvec{Q}}({\varvec{\theta }})^{-1} \left\{ - {\varvec{v}}_j + \sum _{\ell \ne j}^r \dfrac{\psi _\ell }{\Vert {\varvec{\theta }}- {\varvec{a}}_\ell \Vert } ( \Vert {\varvec{\theta }}- {\varvec{a}}_\ell \Vert - \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert ) {\varvec{\alpha }}({\varvec{a}}_j - {\varvec{a}}_\ell ) \right\} \\&= {\varvec{Q}}({\varvec{\theta }})^{-1} \left\{ - {\varvec{v}}_j + {\varvec{q}}_j ({\varvec{\theta }}) \right\} , \end{aligned}$$

where ${\varvec{v}}_j$ is given by (10) and ${\varvec{q}}_j ({\varvec{\theta }})$ is the k-dimensional vector defined by

$$\begin{aligned} {\varvec{q}}_j ({\varvec{\theta }}) = s ({\varvec{\theta }}) \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert \ \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }}) \zeta _\ell ({\varvec{\theta }}, {\varvec{a}}_j)\ {\varvec{\alpha }}({\varvec{a}}_j - {\varvec{a}}_\ell ). \end{aligned}$$

From property (4) and the result that ${\varvec{Q}}({\varvec{\theta }})$ is a positive definite matrix, we have

$$\begin{aligned} {\varvec{\eta }}({\varvec{\theta }}) = {\varvec{a}}_j \Longleftrightarrow {\varvec{\theta }}\in \mathcal {S}_j = \left\{ {\varvec{\theta }}\in \mathbb {R}^k \mid {\varvec{q}}_j ({\varvec{\theta }}) = {\varvec{v}}_j \right\} . \end{aligned}$$

This implies that

$$\begin{aligned} {\varvec{\theta }}\in \mathcal {S}_j^c \backslash \{ {\varvec{a}}_j \} \Longrightarrow {\varvec{\eta }}({\varvec{\theta }}) \ne {\varvec{a}}_j. \end{aligned}$$

Since ${\varvec{\theta }}^\star \notin \mathcal {R}$, $\Vert {\varvec{v}}_j \Vert > \psi _j$ holds from Theorem 1. This implies that $\Vert {\varvec{q}}_j ({\varvec{\theta }}) \Vert \le \psi _j \Rightarrow {\varvec{\theta }}\in \mathcal {S}_j^c$. For any ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j)$, it follows from Lemma B.2 (1) and (3) that

$$\begin{aligned} \Vert {\varvec{q}}_j ({\varvec{\theta }}) \Vert \le s ({\varvec{\theta }}) \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }}) < s ({\varvec{\theta }}) \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert w_j ({\varvec{\theta }}) = \psi _j, \end{aligned}$$

where $d_j$ and $\mathcal {B}_k ({\varvec{a}}, \delta )$ are given by (16) and (A.3), respectively. Notice that ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j) \Rightarrow {\varvec{\theta }}\notin \mathcal {R}$. Thus, we have proved property (7).

Finally, we prove property (8). Notice that ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j^\star )$ has representation ${\varvec{\theta }}= {\varvec{a}}_j + \delta {\varvec{\alpha }}$ for some $\delta \in (0, d_j^\star )$ and ${\varvec{\alpha }}\in \mathcal {A}_k$, where $d_j^\star = \min \{ d_j, \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert \}$ and $\mathcal {A}_k$ is given by (5). Then, for ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j^\star )$, from property (5), we have

$$\begin{aligned} \begin{aligned} {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star&= {\varvec{D}}({\varvec{\theta }})^{-1} \left\{ w_j ({\varvec{\theta }}) (\delta - \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j) \right. \\&\quad \left. + \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }}) \zeta _\ell ({\varvec{\theta }}, {\varvec{\theta }}^\star ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_\ell ) \right\} . \end{aligned} \end{aligned}$$

(B.2)

Moreover, we have $\Vert {\varvec{a}}_j + \delta {\varvec{\alpha }}- {\varvec{\theta }}^\star \Vert \le \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert + \delta$. By using these results, (B.2), and Lemma B.2 (1) and (3), we can derive

$$\begin{aligned}&\Vert {\varvec{\eta }}({\varvec{\theta }}) - {\varvec{\theta }}^\star \Vert \\&\hspace{0.5cm}< \left\| w_j ({\varvec{\theta }}) (\delta - \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_j) + \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }}) \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert \zeta _\ell ({\varvec{\theta }}, {\varvec{\theta }}^\star ) {\varvec{\alpha }}({\varvec{\theta }}^\star - {\varvec{a}}_\ell ) \right\| \\&\hspace{0.5cm} \le w_j ({\varvec{\theta }}) \left| \delta - \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert \right| + \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }}) \Vert {\varvec{a}}_j + \delta {\varvec{\alpha }}- {\varvec{\theta }}^\star \Vert \\&\hspace{0.5cm} \le w_j ({\varvec{\theta }}) \left( \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert - \delta \right) + \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }}) \left( \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert + \delta \right) \\&\hspace{0.5cm} = \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert \sum _{\ell =1}^r w_\ell ({\varvec{\theta }}) - \delta \left\{ w_j ({\varvec{\theta }}) - \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }}) \right\} < \Vert {\varvec{\theta }}^\star - {\varvec{a}}_j \Vert . \end{aligned}$$

Thus, we have proved property (8).

Consequently, we have proved Lemma A.1.

1.2 B.2 The proof of Lemma A.2

The ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j / 4) \backslash \mathcal {B}_k ({\varvec{a}}_j, \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert )$ has representation ${\varvec{\theta }}= {\varvec{a}}_j + \delta {\varvec{\alpha }}$ for some $\delta \in [ \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert , d_j / 4 )$ and ${\varvec{\alpha }}\in \mathcal {A}_k$, where $\mathcal {A}_k$, $d_j$, and $\mathcal {B}_k ({\varvec{a}}, \delta )$ are given by (5), (16), and (A.3), respectively. Notice that for any $\ell \in D \backslash \{j\}$, $\Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert \le \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert + \Vert {\varvec{a}}_\ell - {\varvec{\theta }}^\star \Vert$, and

$$\begin{aligned} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert \ge \min _{\ell \in D \backslash \{ j \}} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert > \dfrac{ \psi _j }{ \sum _{\ell =1}^r \psi _\ell } \min _{\ell \in D \backslash \{ j \}} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert = d_j. \end{aligned}$$

Hence, we have

$$\begin{aligned} \Vert {\varvec{a}}_\ell - {\varvec{\theta }}^\star \Vert \ge \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert - \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert> d_j - \dfrac{1}{4} d_j = \dfrac{3}{4} d_j > 3 \delta . \end{aligned}$$

From this result, for ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j / 4) \backslash \mathcal {B}_k ({\varvec{a}}_j, \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert )$, we have

$$\begin{aligned} \Vert {\varvec{\theta }}- {\varvec{\theta }}^\star \Vert = \Vert {\varvec{a}}_j + \delta {\varvec{\alpha }}- {\varvec{\theta }}^\star \Vert \le \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert + \delta \le 2 \delta< 3 \delta < \Vert {\varvec{a}}_\ell - {\varvec{\theta }}^\star \Vert . \end{aligned}$$

Consequently, we have proved Lemma A.2.

1.3 B.3 The proof of Lemma A.3

First, we consider the case that $i = i_0 + 1$. Since ${\varvec{\theta }}_i = {\varvec{\eta }}({\varvec{a}}_{j_0} - d_{j_0} {\varvec{\alpha }}({\varvec{v}}_{j_0}) / 5)$, we have ${\varvec{\theta }}_i \ne {\varvec{a}}_{j_0}$ from Lemma A.1 (7), where ${\varvec{\eta }}({\varvec{\theta }})$, $d_j$, and ${\varvec{\alpha }}({\varvec{\theta }})$ are given by (13) and (16). Notice that ${\varvec{a}}_{j_0} - d_{j_0} {\varvec{\alpha }}({\varvec{v}}_{j_0}) / 5 \in \mathcal {B}_k ({\varvec{a}}_{j_0}, d_{j_0} / 4) \backslash \mathcal {B}_k ({\varvec{a}}_{j_0}, \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert )$. It follows from Lemma A.1 (6) and Lemma A.2 that

$$\begin{aligned} \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert< \Vert {\varvec{a}}_{j_0} - d_{j_0} {\varvec{\alpha }}({\varvec{v}}_{j_0}) / 5 - {\varvec{\theta }}^\star \Vert < \min _{j \in D \backslash \{ j_0 \}} \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert . \end{aligned}$$

Moreover, we have

$$\begin{aligned} \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert&< \Vert {\varvec{a}}_{j_0} - d_{j_0} {\varvec{\alpha }}({\varvec{v}}_{j_0}) / 5 - {\varvec{\theta }}^\star \Vert \\&\le \Vert {\varvec{a}}_{j_0} - d_{j_0} {\varvec{\alpha }}({\varvec{v}}_{j_0}) / 5 - {\varvec{a}}_{j_0} \Vert + \Vert {\varvec{a}}_{j_0} - {\varvec{\theta }}^\star \Vert \le \dfrac{2}{5} d_{j_0}, \\ \Vert {\varvec{\theta }}_i - {\varvec{a}}_{j_0} \Vert&\le \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert + \Vert {\varvec{\theta }}^\star - {\varvec{a}}_{j_0} \Vert < \dfrac{3}{5} d_{j_0}. \end{aligned}$$

Hence, (A.4) holds for $i = i_0 + 1$.

Next, we assume that (A.4) holds for $i = i_*\ (\ge i_0 + 1)$. Notice that ${\varvec{\theta }}_i \notin \mathcal {R}$. It follows from Lemma A.1 (6) that

$$\begin{aligned} \Vert {\varvec{\eta }}({\varvec{\theta }}_i) - {\varvec{\theta }}^\star \Vert< \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert < \min _{j \in D \backslash \{ j_0 \}} \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert . \end{aligned}$$

This implies ${\varvec{\eta }}({\varvec{\theta }}_i) \notin \mathcal {R}\backslash \{ {\varvec{a}}_{j_0} \}$. Moreover, we can see that ${\varvec{\eta }}({\varvec{\theta }}_i) \ne {\varvec{a}}_{j_0}$ from Lemma A.1 (7). Hence, ${\varvec{\theta }}_{i+1}$ is updated by ${\varvec{\theta }}_{i+1} = {\varvec{\eta }}({\varvec{\theta }}_i)\ (\notin \mathcal {R})$. From the above, we have

$$\begin{aligned} \Vert {\varvec{\theta }}_{i+1} - {\varvec{\theta }}^\star \Vert&< \Vert {\varvec{\theta }}_i - {\varvec{\theta }}^\star \Vert< \min \left\{ \min _{j \in D \backslash \{ j_0 \}} \Vert {\varvec{a}}_j - {\varvec{\theta }}^\star \Vert , \dfrac{2}{5} d_{j_0} \right\} , \\ \Vert {\varvec{\theta }}_{i+1} - {\varvec{a}}_{j_0} \Vert&\le \Vert {\varvec{\theta }}_{i+1} - {\varvec{\theta }}^\star \Vert + \Vert {\varvec{\theta }}^\star - {\varvec{a}}_{j_0} \Vert < \dfrac{3}{5} d_{j_0}. \end{aligned}$$

Hence, (A.4) holds for $i = i_*+ 1$ if (A.4) holds for $i = i_*$.

Consequently, we have proved Lemma A.3 by mathematical induction.

1.4 B.4 The proof of Lemma B.2

Since property (1) is obvious, we start by proving property (2). Notice that,

$$\begin{aligned} w_j ({\varvec{\theta }}) = \dfrac{ \psi _j }{ \psi _j + \Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert \sum _{\ell \ne j}^r \psi _\ell / \Vert {\varvec{\theta }}- {\varvec{a}}_\ell \Vert }. \end{aligned}$$

This implies that $\lim _{{\varvec{\theta }}\rightarrow {\varvec{a}}_j} w_j ({\varvec{\theta }}) = 1$ and $\lim _{{\varvec{\theta }}\rightarrow {\varvec{a}}_\ell } w_j ({\varvec{\theta }}) = 0$.

Next, we prove property (3). The ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j)$ has representation ${\varvec{\theta }}= {\varvec{a}}_j + \delta {\varvec{\alpha }}$ for some $\delta \in (0, d_j)$ and ${\varvec{\alpha }}\in \mathcal {A}_k$, where $\mathcal {A}_k$, $d_j$, and $\mathcal {B}_k$ are given by (5), (16), and (A.3), respectively. Since $d_j < \min _{\ell \in D \backslash \{j\}} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert$, we can see that $\Vert {\varvec{a}}_j - {\varvec{a}}_\ell + \delta {\varvec{\alpha }}\Vert > \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert - \delta$. This implies that

$$\begin{aligned} \sum _{\ell \ne j}^r \dfrac{ \psi _\ell }{ \Vert {\varvec{a}}_j - {\varvec{a}}_\ell + \delta {\varvec{\alpha }}\Vert }< \sum _{\ell \ne j}^r \dfrac{ \psi _\ell }{ \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert - \delta } < \dfrac{ \sum _{\ell \ne j}^r \psi _\ell }{ \min _{\ell \in D \backslash \{ j \}} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert - \delta }. \end{aligned}$$

Notice that $\Vert {\varvec{\theta }}- {\varvec{a}}_j \Vert = \delta$. From these results and property (1), for ${\varvec{\theta }}\in \mathcal {B}_k ({\varvec{a}}_j, d_j)$, we have

$$\begin{aligned} w_j ({\varvec{\theta }})> \sum _{\ell \ne j}^r w_\ell ({\varvec{\theta }})&\Longleftrightarrow \dfrac{\psi _j}{\delta }> \sum _{\ell \ne j}^r \dfrac{\psi _\ell }{\Vert {\varvec{a}}_j - {\varvec{a}}_\ell + \delta {\varvec{\alpha }}\Vert } \\&\Longleftarrow \dfrac{\psi _j}{\delta } > \dfrac{ \sum _{\ell \ne j}^r \psi _\ell }{ \min _{\ell \in D \backslash \{ j \}} \Vert {\varvec{a}}_j - {\varvec{a}}_\ell \Vert - \delta } \Longleftrightarrow \delta < d_j. \end{aligned}$$

Thus, we have proved property (3).

Consequently, we have proved Lemma B.2.

Appendix C: ADMM implementation

ADMM solves the following minimization problem:

$$\begin{aligned} \text{ minimize }&\quad \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2 + \lambda \sum _{j=1}^m \sum _{\ell \in D_j} w_{j \ell } \Vert {\varvec{\gamma }}_{j \ell } - {\varvec{\gamma }}_{\ell j} \Vert \\ \text{ subject } \text{ to }&\quad {\varvec{\beta }}_j = {\varvec{\gamma }}_{j \ell }\ (j = 1, \ldots , m;\ \ell \in D_j). \end{aligned}$$

Note that the above minimization problem is equivalent to that of (3). To solve the above problem, we minimize the following scaled form of its augmented Lagrangian:

$$\begin{aligned} L_\rho ({\varvec{\beta }}, {\varvec{\gamma }}, {\varvec{\xi }})&= \sum _{j=1}^m \Vert {\varvec{y}}_j - {\varvec{X}}_j {\varvec{\beta }}_j \Vert ^2 \\&\quad + \sum _{j=1}^m \sum _{\ell \in D_j} \left\{ \lambda w_{j \ell } \Vert {\varvec{\gamma }}_{j \ell } - {\varvec{\gamma }}_{\ell j} \Vert - \dfrac{\rho }{2} \Vert {\varvec{\xi }}_{j \ell } \Vert ^2 + \dfrac{\rho }{2} \Vert {\varvec{\beta }}_j - {\varvec{\gamma }}_{j \ell } + \xi _{j \ell } \Vert ^2 \right\} , \end{aligned}$$

where $\rho > 0$ is the penalty parameter and ${\varvec{\xi }}$ is the scaled dual variable. Then, ADMM consists of repeating the following updates (Hallac et al. 2015):

$$\begin{aligned} {\varvec{\beta }}_j^\text {new}&= \arg \min _{{\varvec{\beta }}_j} L_\rho ({\varvec{\beta }}, {\varvec{\gamma }}^\text {old}, {\varvec{\xi }}^\text {old}) \\&= \left( {\varvec{X}}_j' {\varvec{X}}_j + \dfrac{ \rho r_j }{ 2 } {\varvec{I}}_k \right) ^{-1} \left( {\varvec{X}}_j' {\varvec{y}}_j + \dfrac{\rho }{2} \sum _{\ell \in D_j} {\varvec{b}}_{j \ell } \right) \quad (j = 1, \ldots , m), \\ {\varvec{\gamma }}_{j \ell }^\text {new}&= \arg \min _{{\varvec{\gamma }}_{j \ell }} L_\rho ({\varvec{\beta }}^\text {new}, {\varvec{\gamma }}, {\varvec{\xi }}^\text {old}) \quad (j = 1, \ldots , m;\ \ell \in D_j) \\&= \theta _{j \ell } ({\varvec{\beta }}_j^\text {new} + {\varvec{\xi }}_{j \ell }^\text {old}) + (1 - \theta _{j \ell }) ({\varvec{\beta }}_\ell ^\text {new} + {\varvec{\xi }}_{\ell j}^\text {old}), \\ {\varvec{\xi }}_{j \ell }^\text {new}&= {\varvec{\xi }}_{j \ell }^\text {old} + {\varvec{\beta }}_j^\text {new} - {\varvec{\gamma }}_{j \ell }^\text {new} \quad (j = 1, \ldots , m;\ \ell \in D_j), \end{aligned}$$

where $r_j = \# (D_j)$, ${\varvec{b}}_{j \ell } = {\varvec{\gamma }}_{j \ell }^\text {old} - {\varvec{\xi }}_{j \ell }^\text {old}$, and $\theta _{j \ell }$ is given by

$$\begin{aligned} \theta _{j \ell } = \max \left\{ 1 - \dfrac{ 2 \lambda w_{j \ell } }{ \rho \Vert {\varvec{\beta }}_j^\text {new} + {\varvec{\xi }}_{j \ell }^\text {old} - ({\varvec{\beta }}_\ell ^\text {new} + {\varvec{\xi }}_{\ell j}^\text {old}) \Vert }, \dfrac{1}{2} \right\} . \end{aligned}$$

In Sect. 3.1.2, we adopted the updating strategy of $\rho$ proposed by Zhu (2017). Moreover, in the above update, ${\varvec{\beta }}_j$ and ${\varvec{\beta }}_\ell$ are separately estimated without any constraints. Hence, numerical error occurs between ${\varvec{\beta }}_j^\text {new}$ and ${\varvec{\beta }}_\ell ^\text {new}$ and they are not to be exact equal.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Ohishi, M., Okamura, K., Itoh, Y. et al. Coordinate descent algorithm for generalized group fused Lasso. Behaviormetrika (2024). https://doi.org/10.1007/s41237-024-00233-6

Download citation

Received: 23 October 2023
Accepted: 12 May 2024
Published: 28 May 2024
DOI: https://doi.org/10.1007/s41237-024-00233-6

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Coordinate descent algorithm for generalized group fused Lasso

Abstract

Similar content being viewed by others

Generalized fused Lasso for grouped data in generalized linear models

Coordinate Descent Algorithm for Normal-Likelihood-Based Group Lasso in Multivariate Linear Regression

Some properties of generalized fused lasso and its applications to high dimensional data

1 Introduction

2 Main results

2.1 Conditions indicating the presence of groups to be joined

Theorem 1

2.2 Searching minimizer in the absence of groups to be joined

Theorem 2

2.3 Coordinate descent algorithm for GGFL

Algorithm 1

Algorithm 2

3 Numerical studies

3.1 Simulation

3.1.1 Performance of GGFL

3.1.2 Comparison with ADMM

3.2 A real data example

4 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendices

Appendix A: The proof of Theorem 2

1.1 A.1 Lemmas for proving Theorem 2

Lemma A.1

Lemma A.2

Lemma A.3

1.2 A.2 The proof

Appendix B: The proofs of lemmas in Appendix A.1

1.1 B.1 The proof of Lemma A.1

Lemma B.2

1.2 B.2 The proof of Lemma A.2

1.3 B.3 The proof of Lemma A.3

1.4 B.4 The proof of Lemma B.2

Appendix C: ADMM implementation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation