Generalized Fused Lasso for Grouped Data in Generalized Linear Models

Generalized fused Lasso (GFL) is a powerful method based on adjacent relationships or the network structure of data. It is used in a number of research areas, including clustering, discrete smoothing, and spatio-temporal analysis. When applying GFL, the speciﬁc optimization method used is an important issue. In generalized linear models, e ﬃ cient algorithms based on the coordinate descent method have been developed for trend ﬁltering under the binomial and Poisson distributions. However, to apply GFL to other distributions, such as the negative binomial distribution, which is used to deal with overdispersion in the Poisson distribution, or the gamma and inverse Gaussian distributions, which are used for positive continuous data, an algorithm for each individual distribution must be developed. To unify GFL for distributions in the exponential family, this paper proposes a coordinate descent algorithm for generalized linear models. To illustrate the method, a real data example of spatio-temporal analysis is provided. (


Introduction
Assume we have grouped data such that y j1 , . . ., y jn j are observations of the jth group ( j ∈ {1, . . ., m}) for m groups.Further, assume the following generalized linear models (GLMs; Nelder & Wedderburn, 1972) with canonical parameter θ ji and dispersion parameter ϕ > 0: where y ji is independent with respect to j and i, a ji is a constant defined by where h(•) is a known differentiable function, β j is an unknown parameter, and q ji is a known term called the offset, which is zero in many cases.Although θ ji depends not only on the group but also on the individual, the jth group is characterized by a common parameter β j .We are thus interested in describing the relationship among the m groups.Here, the expectation of y ji is given by where µ(•) is a known function and ḃ(•) is a derivative of b(•), i.e., ḃ(θ) = db(θ)/dθ.Furthermore, µ −1 (•) is a link function, and h(•) is an identify function, i.e., h(η) = η, when µ −1 (•) is a canonical link.Tables 1, 2, and 3 summarize the relationships between model (1.1) and each individual distribution.In this paper, we consider clustering for m groups or discrete smoothing via generalized fused Lasso (GFL; e.g., Höfling et al., 2010;Ohishi et al., 2021).

Gamma
Ga(1/ϕ, ζϕ) (shape: 1/ϕ; scale: ζϕ) Inverse Gaussian IG(ζ, 1/ϕ) (shape: 1/ϕ) GFL is an extension of fused Lasso (Tibshirani et al., 2005) which can incorporate relationships among multiple variables, such as adjacent relationships and network structure, into parameter estimation.For example, Xin et al. (2014) applied GFL to the diagnosis of Alzheimer's disease by expressing the structure of structural magnetic resonance images of human brains as a 3D grid graph; Ohishi et al. (2021) applied GFL to model spatial data based on geographical adjacency.Although the GFL in these particular instances is based on one factor (brain structure or a geographical relationship), it can deal with relationships based on multiple factors.For example, we can define an adjacent relationship Table 2. GLM components for canonical link (h(η) = η) (1) V(•) is a variance function.
Table 3. GLM components for log-link (µ for spatio-temporal cases based on two factors by combining geographical adjacency and the order of time.Yamamura et al. (2021), Ohishi et al. (2022), andYamamura et al. (2023) dealt with multivariate trend filtering (e.g., Tibshirani, 2014) based on multiple factors via GFL and applied it to the estimation of spatio-temporal trends.Yamamura et al. (2021) and Ohishi et al. (2022) used a logistic regression model, which coincides with model (1.1) when n j = 1, q ji = 0 (∀ j ∈ {1, . . ., m}; ∀i ∈ {1, . . ., n j }) under a binomial distribution.Since this relationship holds by the reproductive property of the binomial distribution, their methods can also be applied to grouped data.Although Yamamura et al. (2021) minimized the coordinate-wise objective functions using linear approximation, Ohishi et al. (2022) showed numerically that direct minimization can provide the solution faster and more accurately than minimization using a linear approxima-tion.Ohishi et al. (2021) also derived an explicit update equation for the coordinate descent algorithm, which corresponds to model (1.1) under the Gaussian distribution.As described, coordinate descent algorithms have been developed to produce GFL estimators for three specific distributions; however, none have been proposed for other distributions.For example, we have an option of using the negative binomial distribution to deal with overdispersion in the Poisson distribution (e.g., Gardner et al., 1995;Ver Hoef & Boveng, 2007), or the gamma or inverse Gaussian distribution for positive continuous data.To apply GFL to these distributions, it is necessary to derive update equations for each distribution individually.
In this paper, we propose a coordinate descent algorithm to obtain GFL estimators for model (1.1) in order to unify the GFL approach for distributions in the exponential family.The negative log-likelihood function for model (1.1) is given by We estimate parameter vector β = (β 1 , . . ., β m ) ′ by minimizing the following function defined by removing terms that do not depend on β from the above equation and by adding a GFL penalty: where λ is a non-negative tuning parameter, D j ⊆ {1, . . ., m}\{ j} is an index set expressing adjacent relationship among groups and satisfying ℓ ∈ D j ⇔ j ∈ D ℓ , and w jℓ is a positive weight satisfying w jℓ = w ℓ j .The GFL penalty shrinks the difference between two adjacent groups |β j − β ℓ | and often gives a solution satisfying |β j − β ℓ | = 0 (⇔ β j = β ℓ ).That is, GFL can estimate some parameters to be exactly equal, thus enabling the clustering of m groups or the accomplishment of discrete smoothing.To obtain the GFL estimator for β, we minimize the objective function (1.2) via a coordinate descent algorithm.As Ohishi et al. (2022) and Yamamura et al. (2023), we directly minimize coordinate-wise objective functions without the use of approximations.For ordinary situations, where a canonical link (h(η) = η) is used and there is no offset (q ji = 0), and for several other situations, the update equation of a solution can be derived in closed form.means we can easily obtain the solution by a simple numerical search.The remainder of the paper is organized as follows: In Section 2, we give an overview of coordinate descent algorithm and derive the objective functions for each step.In Section 3, we discuss coordinate-wise minimization of the coordinate descent algorithm and derive update equations in closed form in many cases.In Section 4, we evaluate the performance of the proposed method via numerical simulation.In Section 5, we provide a real data example.Section 6 concludes the paper.Technical details are given in the Appendix.

Preliminaries
As in Ohishi et al. (2022) and Yamamura et al. (2023), we minimize the objective function (1.2) using a coordinate descent algorithm.Algorithm 1 gives an overview of the algorithm.The descent cycle updates the parameters separately, and several parameters are often updated Algorithm 1 Overview of the coordinate descent algorithm Require: λ and initial vector for β repeat execute descent cycle if some parameters are exact equal then execute fusion cycle end if until solution converges to be exactly equal.If several parameters are exactly equal, their updates can become stuck.To avoid this, the fusion cycle simultaneously updates equal parameters (Friedman et al., 2007).In each cycle of the coordinate descent, the following function is essentially minimized: where a i and w ℓ are positive constants and z ℓ (ℓ = 1, . . ., r) are constants satisfying The minimization of f (x) is described in Section 3, and the following subsections show that an objective function in each cycle is essentially equal to f (x).

Descent cycle
The descent cycle repeats coordinate-wise minimizations of the objective function L(β) in (1.2).To obtain a coordinate-wise objective function, we extract terms that depend on β j ( j ∈ {1, . . ., m}) from L(β).As described in Ohishi et al. (2021), the penalty term can be decomposed as Then, only the first term depends on β j .By regarding terms that do not depend on β j as constants and removing them from L(β), the coordinate-wise objective function is obtained as where βℓ indicates β ℓ is given.By sorting elements of D j in increasing order of βℓ (∀ℓ ∈ D j ), we can see that L j (β) essentially equals f (x) in (2.1).If there exist ℓ 1 , ℓ 2 ∈ D j (ℓ 1 ℓ 2 ) such that βℓ 1 = βℓ 2 , we can temporarily redefine D j and w jℓ as Since GFL estimates several parameters as being equal, this redefinition is required in most updates.

Fusion cycle
In the fusion cycle, equal parameters are replaced by a common parameter and L(β) is minimized with respect to the common parameter.Let β1 , . . ., βm be current solutions for β 1 , . . ., β m , and ξ1 , . . ., ξt (t < m) be their distinct values.The relationship among the current solutions and their distinct values is specified as That is, the following statements are true: Ohishi, M.
Then, the β j (∀ j ∈ E k ) are replaced by a common parameter ξ k and L(β) is minimized with respect to ξ k .Hence, to obtain a coordinate-wise objective function, we extract terms that depend on ξ k (k = 1, . . ., t) from L(β).
We can decompose the first term of L(β) as Furthermore, as Ohishi et al. (2021), the penalty term of L(β) can be decomposed as By regarding terms that do not depend on ξ k as constants and removing them from L(β), the coordinate-wise objective function is obtained as As in the descent cycle, we can see that L * k (ξ) essentially equals f (x) in (2.1).

Main results
In this section, to obtain update equations for the descent and fusion cycles of the coordinate descent algorithm, we describe the minimization of f (x) in (2.1).Following Ohishi et al. (2022) and Yamamura et al. (2023), we directly minimize f (x).One of the difficulties of the minimization of f (x) is that f (x) has multiple non-differentiable points z 1 , . . ., z r .We cope with this difficulty by using a subdifferential.The subdifferential of f (x) at x ∈ R is given by where g − (x) and g + (x) are left and right derivatives defined by Then, x is a stationary point of f (x) if 0 ∈ ∂ f ( x).For details of a subdifferential, see, e.g., Rockafellar (1970), Parts V and VI.In the following subsections, we separately describe the minimization of f (x) in cases where a canonical link and a general link are used.

Canonical link
We first describe the minimization of f (x) in (2.1) with a canonical link, i.e., h(η) = η.That is, the update equation of the coordinate descent algorithm is given by minimizing the following function: First, based on this relationship, we derive the condition that f (x) attains the minimum at a non-differentiable point z ℓ .The subdifferential of f (x) at z ℓ is given by Hence, if there exists ℓ ⋆ ∈ {1, . . ., r} such that 0 ∈ ∂ f (z ℓ ⋆ ), f (x) attains the minimum at x = z ℓ ⋆ and ℓ ⋆ uniquely exists because of the strict convexity of f (x).
On the other hand, when ℓ ⋆ does not exist, we can specify an interval that includes the minimizer by checking the signs of the left and right derivatives at each non-differentiable point.Let s(x) = (sign(g − (x)), sign(g + (x))).From z 1 < • • • < z r and the strict convexity of f (x), we have .
Then, the minimizer of f (x) exists in the following interval: Hence, it is sufficient to search for the minimizer in R * .For all x ∈ R * , the following equation holds: This result allows us to rewrite the penalty term in f (x) as Hence, f (x) is rewritten in non-absolute form as The f (x) is differentiable when x ∈ R * and its derivative is given by Then, the solution x * of d f (x)/dx = 0 is the minimizer of f (x).Hence, we have the following theorem.
Theorem 1.Let x be the minimizer of f (x) in (3.1).Then, x is given by x * (ℓ * exists) , where ℓ * exists if and only if ℓ ⋆ does not exist.
Algorithm 2 The coordinate descent algorithm for a canonical link Require: λ and initial vector for β repeat (descent cycle) for j ∈ {1, . . ., m} do update β j by applying Theorem 1 to (2.2) We can execute Algorithm 1 by applying Theorem 1 to (2.2) and (2.3) in the descent and fusion cycles, respectively.Thus, a detailed implementation of Algorithm 1 when using a canonical link is provided in Algorithm 2. To apply Theorem 1, we need to obtain x * .In many cases, x * can be obtained in closed form according to the following proposition.
Proposition 1.Let x * be the solution of d f (x)/dx = 0 and q 0 be a value such that q 1 = • • • = q d = q 0 .Then, x * is given as follows: • When q 0 exists, x * is given in a general form as • Even when q 0 does not exist, x * for the Gaussian and Poisson distributions is given by For example, q 0 exists and q 0 = 0 holds for GLMs without an offset.When q 0 does not exist, x * can be obtained for each distribution.For the Gaussian and Poisson distributions, since µ(x + q) can be divided with respect to x and q, x * can be obtained in closed form.Note that x * for a Gaussian distribution when q 0 exists and equals 0 coincides with the result in Ohishi et al. (2021).For distributions for which such a decomposition is impossible, such as the binomial distribution, a numerical search is required to obtain x * .However, we can easily obtain x * by a simple algorithm, such as a line search, because f (x) is strictly convex and has its minimizer in the interval R * .

General link
Here, we consider the minimization of f (x) in (2.1) with a general link, i.e., h(•) is a generally differentiable function.Then, although strict convexity of f (x) is not guaranteed, its continuity is maintained.This means the uniqueness of the minimizer of f (x) is not guaranteed, but we can obtain minimizer candidates by using the same procedure as in the previous subsection.
The subdifferential of f (x) at z ℓ is given by where ḣ(x) = dh(x)/dx.Since z ℓ satisfying 0 ∈ ∂ f (z ℓ ) is a stationary point of f (x), such points are minimizer candidates of f (x).Next, we define intervals as R ℓ = (z ℓ , z ℓ+1 ) (ℓ = 0, 1, . . ., r).For x ∈ R ℓ , f (x) can be written in non-absolute form as We can then search for minimizer candidates of f (x) by piecewise minimization.That is, x ∈ R ℓ minimizing f ℓ (x) is a minimizer candidate.Hence, we have the following theorem.
Theorem 2. Let x be the minimizer of f (x) in (2.1) and define a set S by } .

Now, suppose that
where ḟℓ (x) = d f ℓ (x)/dx.Then, S is the set of minimizer candidates of f (x) and x is given by The assumption (3.2) excludes the case in which f (x) attains the minimum at x = ±∞.Moreover, we have the following corollary (the proof is given in Appendix A.1).
Corollary 1. Suppose that for all ℓ ∈ {0, 1, . . ., r}, is true, and that (3.2) holds.Then, f (x) is strictly convex and #(S) = 1, where S is given in Theorem 2.Moreover, the unique element of S is the minimizer of f (x) and is given as in Theorem 1.
To execute Algorithm 1 for GLMs with a general link, we can replace Theorem 1 with Theorem 2 or Corollary 1 in Algorithm 2. The next subsection gives specific examples of using a general link.

Examples
This subsection focuses on the negative binomial, gamma, and inverse Gaussian distributions with a log-link as examples of using a general link.In the framework of regression, the negative binomial distribution is often used to deal with overdispersion in Poisson regression, making it natural to use a log-link.Note that NB-C and NB2 indicate negative binomial regression with canonical and log-links, respectively (for details, see, e.g., Hilbe, 2011).The gamma and inverse Gaussian distributions are used to model positive continuous data.Their expectations must be positive.However, their canonical links do not guarantee that their expectations will, in fact, be positive.Hence, a log-link rather than a canonical link is often used for these distributions (e.g., Algamal, 2018;Dunn & Smyth, 2018, Chap. 11).Here, we consider coordinate-wise minimizations for the three distributions with a log-link.
For x ∈ R ℓ , f (x) in (2.1) is given by , Hence, the first-and second-order derivatives of f ℓ (x) are given by ḟℓ .
We can see that fℓ (x) > 0 holds for all ℓ ∈ {0, 1, . . ., r}, for NB2 and the gamma distribution.Hence, the minimizers of f (x) can be uniquely obtained from Corollary 1.On the other hand, the uniqueness of the minimizer for the inverse Gaussian distribution is not guaranteed; however, we have v 0 < 0, v r > 0, and Hence, the minimizer for the inverse Gaussian distribution can be obtained by Theorem 2. We now give specific solutions.From above, we have the following proposition.
Proposition 2. Let xℓ be a stationary point of f ℓ (x).If xℓ exists, it is given by .
Moreover, a relationship between xℓ and the minimizer of f (x) is given by xℓ is a minimizer candidate of f (x) (Inverse Gaussian) .

Dispersion parameter estimation
In the previous subsections, we discussed the estimation of β j which corresponds to the estimation of the canonical parameter θ ji .The GLMs in (1.1) also have dispersion parameter ϕ.Although ϕ is fixed at one for the binomial and Poisson distributions, it is unknown for other distributions, and, hence, we need to estimate the value of ϕ.The Pearson estimator is often used as a suitable estimator (e.g., Dunn & Smyth, 2018, Chap. 6).Let β1 , . . ., βm be estimators of β 1 , . . ., β m , t be the number of distinct values of them, and ζ ji = µ( β j + q ji ).Then, the Pearson estimator of ϕ is given by φ where V(•) is a variance function (see Table 2).For distributions other than the negative binomial distribution, the estimator of ϕ can be obtained after β is estimated since the estimation of β does not depend on ϕ.For the negative binomial distribution, the estimation of β depends on ϕ because µ(•) and b(•) depend on ϕ.Hence, we need to add a step updating ϕ and repeat updates of β and ϕ alternately.Moreover, this Pearson estimator is used for the diagnosis of overdispersion in the binomial and Poisson distributions.If φ > 1, it is doubtful that the model is appropriate.

Penalty weights
The objective function L(β) in (1.2) includes penalty weights, and the GFL estimation proceeds with the given weights.Although setting w jℓ = 1 is usual, this may cause a problem of over-shrinkage because all pairs of parameters are shrunk uniformly by the common λ.As one option to avoid this problem, we can use the following weight based on adaptive-Lasso (Zou, 2006): where β j is an estimator of β j and the maximum likelihood estimator (MLE) may be a reasonable choice for it.If there exists q j0 such that q j1 = • • • = q jn j = q j0 , the MLE is given in the following closed form: For other cases, see Appendix A.2.

Tuning parameter selection
It is important for a penalized estimation, such as GFL estimation, to select a tuning parameter, which, in this paper, is represented as λ in (1.2).Because λ adjusts the strength of the penalty against a model fitting, we need to select a good value of λ in order to obtain a good estimator.The optimal value of λ is commonly selected from candidates based on the minimization of, e.g., cross-validation and a model selection criterion.For a given λ max , candidates for λ are selected from the interval [0, λ max ].Following Ohishi et al. (2021), λ max is defined by a value such that all β j ( j ∈ {1, . . ., m}) are updated as βmax when a current solution of β is βmax = βmax 1 m , where βmax is the MLE under β = β1 m (see Appendix A.2) and 1 m is the m-dimensional vector of ones.When a current solution of β is βmax , the discussion in Subsection 3.2 gives the condition that β j is updated as βmax as Hence, λ max is given by

Simulation
In this section, we focus on modeling using count data and establish whether our proposed method can select the true cluster from the clustering of groups through simulation.For count data, Poisson regression and NB2 are often used.Hence, we compare the performance of the two approaches for various settings of the dispersion parameter.Note that GFL for Poisson regression has already been proposed by Yamamura et al. (2023) and that our contribution is to apply GFL to NB2.Note, too, that simulation studies were not conducted in Yamamura et al. (2023).
The sample sizes for each group are common, i.e., n 1 = • • • = n m = n 0 .Furthermore, the estimation of ϕ, the definition of the penalty weights, and the candidates for λ follow Subsection 3.3, and the optimal value of λ is selected based on the minimization of BIC (Schwarz, 1978) from 100 candidates.Here, the performance of the two methods is evaluated by Monte Carlo simulation with 1,000 iterations.Tables 5 and 6 summarize the results for m = 10, 20, respectively, in which SP is the selection probability (%) of the true cluster, φ is the Pearson estimator of ϕ, and time is runtime (in seconds).First, focusing on ϕ = 0, i.e., the true model according to the Poisson distribution, the value of SP using Poisson regression approaches 100% as n 0 increases.Furthermore, we can say that Poisson regression provides good estimation since φ is approximately 1. On the other hand, NB2 is unable to select the true cluster.The reason for this may be that the dispersion parameter in the negative binomial distribution is positive.Next, we focus on ϕ > 0.
Here, Poisson regression produced overdispersion since φ is larger than 1, and, hence, it is unable to select the true cluster.On the other hand, the SP value for NB2 approaches 100% as n 0 increases.Furthermore, φ is roughly the true value, indicating that NB2 can provide good estimation.Finally, it is apparent that Poisson regression is always faster than NB2.The reason for this may be that Poisson regression requires only the estimation of β, whereas NB2

Real data example
In this section, we apply our method to the estimation of spatio-temporal trend using real crime data.The data consist of the number of recognized crimes committed in the Tokyo area as collected by the Metropolitan Police Department, available at TOKYO OPEN DATA (https://portal.data.metro.tokyo.lg.jp/)1 .Although these data were aggregated for each chou-chou (level 4), the finest regional division, we integrate the data for each chou-oaza (level 3) and apply our method by regarding level 3 as individuals and the city (level 2) as the group (see Figure 1).There are 53 groups as a division of space, and spatial adjacency is level 2 level 3 level 4 Figure 1.Divisions of Tokyo defined by the regional relationships of level 2. We use six years of data, from 2017 to 2022.The sample size is n = 9,570.Temporal adjacency is defined using a chain graph for the six time points.According to Yamamura et al. (2021), we can define adjacent spatio-temporal relationships for m = 318 (= 53 × 6) groups by combining spatial and temporal adjacencies.Furthermore, following Yamamura et al. (2023), we use population as a variable for the offset.The population data were obtained from the results of the population census, as provided in e-Stat (https://www.e-stat.go.jp/en).Since the population census is conducted every five years, we use the population in 2015 for the crimes in 2017 to 2019 and the population in 2020 for the crimes in 2020 to 2022.
In this analysis, we apply our method to the above crime data, with n = 9,570 individuals aggregated into m = 318 groups, and estimate the spatio-temporal trends in the data.Specifically, y ji , the number of crimes in the ith region of the jth group, is modeled based on the Poisson and negative binomial distributions, respectively, as Ohishi, M.
where q ji is a logarithm transformation of the population and canonical and log-links are used, respectively.Estimation of the dispersion parameter, the setting of penalty weights, and the candidates for the tuning parameter follow Subsection 3.3.The optimal tuning parameter is selected from 100 candidates based on the minimization of BIC.value of φ in the Poisson regression is far larger than 1, there is overdispersion, and we can say that using Poisson regression is inappropriate.To cope with this overdispersion, we adopted NB2.The cluster value in the table indicates the number of clusters using GFL.Poisson regression and NB2 clustered the m = 318 groups into 160 and 109 groups, respectively.Figure 2 is a yearly choropleth map of the GFL estimates of β using NB2.The map shows that the larger the value, the easier it is for crime to occur, and that the smaller the value, the harder it is.As in this figure, we can visualize the variation of trend with respect to time and space.

Conclusion
To unify models based on a variety of distributions, we proposed a coordinate descent algorithm to obtain GFL estimators for GLMs.Although Yamamura et al. (2021), Ohishi et al. (2022), andYamamura et al. (2023) dealt with GFL for the binomial and Poisson distributions, our method is more general, covering both these distributions and others.The proposed algorithm repeats the partial update of parameters and directly solves sub-problems without any approximations of the objective function.In many cases, the solution can be updated in closed form.Indeed, in the ordinary situation where a canonical link is used and there is no offset, we can always update the solution in closed form.Moreover, even when an explicit update is impossible, we can easily update the solution using a simple numerical search since the interval containing the solution can be specified.Hence, our algorithm can efficiently search the solution.
Other distributions, including the inverse Gaussian distribution with log-link, require a numerical search.Furthermore, the negative binomial distribution requires the repeated updating of β and ϕ alternately.
Next, we describe the derivation of β max .The β max is the MLE of β under β = β1 m , and for distributions with a convex likelihood function, its value is obtained by solving Notice that this is essentially equal to the derivation of the MLE of β j .Hence, β max is given in closed form in the following cases: ḣ(q ji ) exp(q ji ) (Poisson or Gamma with log-link) .

Table 4 .
Table4summarizes relationships between an individual distribution and an update equation.Here, ⃝ indicates that the update equation can be obtained in closed form, and × indicates that it cannot.Even when the update equation cannot be obtained in closed form, the proposed algorithm can specify an interval that includes the solution, which Whether the update equation can be obtained in closed form

Table 5 .
Simulation results when m = 10 Table 7 summarizes the estimation results.The φ indicates the Pearson estimator of the dispersion parameter.Since the Table 7. Summary of real data example