Two-parameter link functions, with applications to negative binomial, Weibull and quantile regression

One-parameter link functions play a fundamental role in regression via generalized linear modelling. This paper develops the general theory for two-parameter links in the very large class of vector generalized linear models by using total derivatives applied to a composite log-likelihood within the Fisher scoring/iteratively reweighted least squares algorithm. We solve a four-decade old problem with an interesting history as our first example: the canonical link for negative binomial regression. The remaining examples are fitting Weibull regression using both the mean and quantile directly compared to GAMLSS, and performing quantile regression based on the Gaussian distribution. Numerical examples based on real and simulated data are given. The methods described here are implemented by the VGAM and VGAMextraR packages, available on CRAN. Supplementary materials for this article are available online.


Introduction
For 50 years now generalized linear models (GLMs; Nelder and Wedderburn 1972) have been the foundational building block of generalized regression. Its central formula is 1 3 Two-parameter link functions, with applications to negative… such as the NBD with its canonical link. Indeed, the methodology can be applied to any function of = 1 , 2 T with a tractable inverse, as defined in Sect. 3.1. In this paper we present three diverse applications, e.g., the mean W and the quantile functions of the two-parameter Weibull distribution, and the normal distribution's quantile function.
The new link functions have the form = G W ( 1 , 2 ) or = G th ( 1 , 2 ) quantile . However, our approach is broader still. In theory, one can directly model, say, the variance as an alternative to study a model's underlying homoscedasticity assumptions. A third benefit is that the methodology can be readily extended to three-parameter VGLMs, and theoretically to M-parameter link functions, and so introducing more flexibility by handling statistical measures of distributions having location, scale and shape parameters (or LSS-links). The basic motivation for using LSS-links rather than simple one-parameter links is that the stochastic relationship between predictors and response can be modelled much better and with greater accuracy. Thus this work points towards j = G j ( ; x), j = 1, … , M , with = 1 , … , M T , as being the ultimate set of link functions. An outline of this paper is as follows. Sect. 2 summarizes skeletal details of VGLMs needed here. Sect. 3 gives the general theory for two-parameter links. Sect. 4 solves the NBD canonical link problem and proposes two-parameter Weibull mean and quantile regression as well as Gaussian quantile regression. Numerical examples involving real and simulated data including a comparison to generalized additive models for location, scale and shape (GAMLSS) appear in Sect. 5. The paper ends with a discussion. An R script file and online appendices are available as supplementary material.
Notationally, we use ' ' for partial derivatives and 'd' for total derivatives (see, e.g., Loomis and Sternberg 1990). The digamma and trigamma functions are denoted by and ′ . The Hadamard (element-by-element) and Kronecker products of two matrices A and B are A•B = [(a js ⋅ b js )] and A ⊗ B = [(a js ⋅ B)] respectively.

VGLM review
VGLMs operate on data (x i , y i ) , i = 1, … , n , independently with y i a Q-dimensional response and covariates x i = x i1 , x i2 , … , x id T with x i1 = 1 if there is an intercept. A VGLM is defined in terms of M linear predictors as a model where the conditional distribution of y given x has the form (dropping the subscript i for simplicity) In the usual case the jth linear predictor is j (x) = T j x and can be tied in to the parameters j of any distribution as  j ( j ) = j = T j x , j = 1, … , M.

3
Fitting VGLMs produce full maximum likelihood estimates (MLEs). The 'general' VGLM log-likelihood is for known fixed positive prior weights w i . A Newton-like algorithm for maximizing (2.2) has the form (a) = (a−1) + I (a−1) −1 U (a−1) at iteration a . For VGLMs, the vector of coefficients (a) is obtained by iteratively regressing the working responses z (a−1) on the 'big' model matrix In (2.3), U = u 1 , … , u n T comprises the individual score vector u i whose jth . Using the observed information matrix (OIM) corresponds to the Newton-Raphson algorithm. Fisher scoring is primarily used within the VGLM framework over Newton-Raphson because the expected information matrices (EIMs) are positive-definite over a larger portion of the parameter space.
With multiple j one can enforce linear relationships between them via for known constraint matrices H k of full column-rank (i.e., rank H k = R k = (H k ) ), and * (k) is a possibly reduced set of regression coefficients to be estimated. The matrices H k can constrain the effect of a covariate over some j and to have no effect for others. Trivial constraints are attained with H k = I M for all k, where X VLM = X LM ⊗ I M is preserved. Other common examples include parallelism ( H k = M ), exchangeability, and intercept-only parameters j = * (j)1 . For a general review on VGAMs see Yee (2015, Sec. 1.3.2).

3
Two-parameter link functions, with applications to negative… 3 General theory for two-parameter links

Problem formulation
Given the preceding background, our attention focusses on M = dim( ) = 2 trans- We assume that G j can be solved at least iteratively for j (cf. (3.9)), allowing (3.2)-(3.4) to be expressed as The interdependency between the j enforces a reparametrized log-likelihood. For example, for the general case (3.7) above we have We call (3.8) the modified VGLM log-likelihood as it reflects the composite structure of two-parameter linear predictors. Fisher scoring must be consequently adjusted.

Assumptions
(i) For each G j in (3.1) the derivatives G j ∕ k , 2 G j ∕ 2 k and 2 G j ∕ l k exist over either A j or accordingly. (ii) For each equation in (3.1), G j can be solved for j , in the form When G * j , j = 1, 2, in (3.1) are not tractable then G * j and G in (3.10) can be computed via iterative methods. Optionally, implicit differentiation applied to (3.1) can be used to obtain (3.11) and (3.12) if (3.1) cannot be explicitly solved for j . In particular, 2 j ∕( G 1 G 2 ) = 0.

Solution
Expressions (3.5)-(3.7). shed light on the interdependence between j and k and its pivotal role when computing from the modified log-likelihood (3.8). Note that (3.9) 1 = 1 G 1 , 2 = G * 1 ( G 1 , 2 ) and 2 = 2 1 , G 2 = G * 2 1 , G 2 , , 2 ) and 2 G = 2 1 , 0 The solution to incorporating two-parameter linear predictors to the VGLM framework relies on computing (3.13) by appropriately embedding (3.8) and (3.14). Total derivatives are necessary since 1 and 2 now vary simultaneously, and the solution produces a new set of expressions for the score vector and working weight matrices. For the score vector, each component is a total derivative d ∕d j given by which apply to VGLMs with the two linear predictors as in case (3.4). For the special cases (3.2) and (3.3), at least one linear predictor is univariate, that is, a function of either 1 or 2 (but not both), hence 2 ∕ 1 = 0 or 1 ∕ 2 = 0 . Table 9 in Appendix A shows the expressions for the three cases (3.2)-(3.4). Likewise, the EIMs for VGLMs with M = 2 linear predictors is given by Applying the chain rule with the inclusion of total derivatives, (3.14) Under regularity conditions, the terms in braces vanish after taking the expectation, giving place to the new expressions for the EIMs: Again, new complementary expressions for d 2 ∕d 2 j and d 2 ∕d j d k , j, k = 1, 2 , are required. For the the general case (3.4), where each linear predictor depends on 1 and 2 , these are respectively. Tables 10 and 11 outline the expressions for the three cases (3.2)-(3.4).

Some applications
We present direct applications of the previous section, tying it in with software implementations both new and old. The first shows that the NB-C can be straightforwardly estimated as a VGLM by adjusting the score vector and working weight matrices of the composite log-likelihood. The second considers two variants of Weibull regression: the mean parameterization coincides with WEI3() in gamlss. dist whereas the quantile parameterization is novel. The third sketches the details for directly fitting the two-parameter quantile function of the normal distribution, as implemented by a new link function normalQlink() from VGAMextra 0.0-5.

3
Two-parameter link functions, with applications to negative…

Negative binomial canonical link
The estimation of the NB-C has a somewhat interesting history. If we take the problem as beginning with the admission of McCullagh and Nelder (1983, p.195) [see also McCullagh and Nelder (1989, p.374)] that the NB is little used in applications and has a "problematical" canonical link then to our knowledge only one other publicized attempt has been made since to solve the problem seriously. It eluded McCullagh and Nelder because the model makes a function of a parameter of the variance. Hilbe (2011, pp. 210,309,315-6) sheds more light on the problem as well as proposing an adhoc method that tends to work in most cases. He writes "In discussing this statement with Nelder, I found that the foremost problem they had in mind was the difficulty they experienced in attempting to estimate the model. They do not state it in their text, but the problem largely disappears when k is entered into the GLM estimating algorithm as a constant" [italics added]. He notes that having k in the link and variance can result in estimation difficulties in a full MLE algorithm, with it being sensitive to initial values and having tedious convergence with Newton-Raphson-type algorithms. To our knowledge his R package COUNT function ml.nbc() is the only other NB-C MLE-implementation. It treats k as constant at each iteration but it is iteratively estimated in the process. It also treats k as an additional scalar parameter to be estimated, hence is constrained to be intercept-only. A general optimizer stats::optim() is invoked and it has by default some prechosen constants for initial values ( k (0) = 2 and (0) = e −1 ≈ 0.368 ) so that it will be unreliable for large and k. In contrast, we believe our solution to be nondefective. Prior to this work, the NB-C had also been unsatisfactorily implemented in VGAM 1.0-3 or earlier (see, e.g., Yee 2015, Sec.11.3.4) but has since been corrected in functions negbinomial() and nbcanlink()-more details are given in Miranda-Soberanis (2018). Ordinarily, NB regression operates with and adopting the R parametrization with 0 < and 0 < k , if the ancillary parameter k is intercept-only then this is referred to as the NB-2 model. The NB-C is of the first case of Table 9, and is The ith contribution to the log-likelihood is given by With the ordinary j of (4.1), one has However, with the j as in (4.3), the relationship = k∕[e − 1 − 1] implies (the subscript i is dropped for simplicity) The partial derivatives here are directly computed from (4.4), resulting in and ∕ k as in (4.5). Substituting these into (4.6) gives which is (4.5) without its last term. Also, d ∕d = ∕ is given in (4.7), completing the score vector.
The new EIMs are fully specified by (3.17) (see Table 11). Unlike other NB variants the NB-C EIM is nondiagonal. The adjusted nondiagonal component is given by Moreover, while d 2 ∕d 2 remains as usual, the other diagonal element needs adjustment: where W kk is the usual 2-2 element. Combining everything, the working weight matrix is

3
Two-parameter link functions, with applications to negative… when G 2 = log k . Because the EIM of all NB variants except for the NB-C is diagonal the alternating algorithm adopted by Hilbe and MASS::glm.nb() is less prone to failure. But with NB-C, and k are asymptotically dependent therefore the alternating algorithm is more likely to fail.

Weibull regression
The VGAM family function weibullR() follows from base R's Table 12.3). For this, 1 = log b and 2 = log s. Adapted for the Weibull, the methodology of this paper results in the addition of two new 2-parameter links for mean and quantile modelling, called weibullMlink() and weibullQlink(). To our knowledge the latter is first implementation to exist in such general form. They are used in conjunction with the newly written weibullRff() in VGAMextra 0.0-5. These links apply only to 1 . Table 1 It is noted that Noufaily and Jones (2013) concerns parametric quantile regression like ours but based on a generalized gamma distribution.

Distribution-specific quantile regression: the normal distribution
Another example is a two-parameter link for the quantiles of the normal distribution defined as where is the standard normal CDF and pre-specified quantiles of interest = 1 , … , s T . Its implementation is uninormalQlink(), available in Our work on quantile regression have the following advantages over Koenker and Bassett (1978): (i) Parametric quantile regression provides more accurate inference when the data comes from the stipulated distribution. In contrast, theirs is a nonparametric L 1 -norm method based on linear programming which is less familiar (4.10) G( ; , x) = + −1 ( ), (4.11) 1 ( ; , x) = G ( ; , x) = + −1 ( ), 2 ( ;x) = log ,

3
Two-parameter link functions, with applications to negative… to statisticians than L 2 methods like IRLS. Moreover, VGAM computes confidence intervals based on the well-known Wald and score tests. In theirs the confidence intervals are based on piecewise linear approximations and when using large number of predictors the algorithm may become unstable, see, e.g., Kneib (2013). (ii) The VGLM/VGAM framework can circumvent the quantile crossing problem by choosing appropriate constraint matrices H k (Eq. (2.4); see also Yee (2015, Sec. 3.3)). Indeed, for some VGLMs such as the exponential and Maxwell the j are naturally parallel and so constraints need not to be enforced specifically (Miranda-Soberanis and Yee 2019). (iii) While their methodology is less sensitive to extreme values than least squares, ours is more amenable to skewed distributions such as the lognormal (theoretically, we can now implement quantile link functions for as often as needed), thus providing a more suitable framework to handle various asymmetric data such as income or wealth. (iv) VGAM has infrastructure to accommodate spline modelling, so matching Koenker and et al. (2020) which includes linear as well as nonlinear effects.

NB canonical link
In this section we present two numerical examples of NB-C regression. The first aims to demonstrate the instability of Fisher scoring without total derivative adjustment. We also compare our results to quantile regression using quantreg (Koenker and et al. 2020). The second uses the data set medpar from COUNT to compare the Hilbe's and our software. We aim to demonstrate the advantages conferred by negbinomial() and its ability to handle the size parameter as a covariate-specific linear predictor. All the R code is available as a supplement.

Simulated data
To illustrate the unreliability of ml.nbc() we adapted code from the COUNT online help and generated n = 1000 random variates for 1 = (−3, 1.25, 0.1) T , k = e 7 ≈ 1100 , and x 2 and x 3 ∼ Unif(0, 1) independently. Upon fitting the model, several warnings were issued by ml.nbc() from attempts to evaluate (4.2) outside the parameter space. When k = e 8 ≈ 2980 was attempted ml.nbc() issued an error message. A brief comparison of the fits is as follows. The log-likelihood of the ml. nbc() fit, −3917.8 , is grossly suboptimal compared to VGAM ( −3880.2 ). Also, its estimated regression coefficients differ much from the true values (Table 2). In comparison, the results of VGAM fare well in Table 2 as well as the Wald 95% confidence intervals (Table 3) which cover the true values. For the latter, ml.nbc() fails on all counts. From experience, about 5-8 iterations is typical for well-fitting VGLMs and the VGAM fit took 5 iterations to converge, and logk ≈ 6.66082 with SE 0.22791. In VGAM 1.0-6 or earlier, 'convergence' was not achieved within 30 iterations. Dedual et al. (2000) describe a quantitative study of the ecology of brown (Salmo trutta) and rainbow trout (Oncorhynchus mykiss) in the central North Island of New  Zealand. We fit four NB-variants to closely related data, which may be found in the trapO data frame in VGAMdata. Comprising 1226 rows of daily captures of the two species by gender at the Te Whaiau Trap at Lake Otamangakau, the data were collected during the main spawning period over 8 consecutive years by the Department of Conservation. The primary aim of this analysis is to model Y = the number of male brown trout captured as a function of the day of the year (e.g., 1 = Jan 1, 244 = Sep 1), i.e., doy is the sole (primary) explanatory variable. It is well known that spawning peaks around the second half of May.

Lake Otamangakau trout data
The variance-to-mean ratio of 12.12 indicates overdispersion relative to the Poisson. Following the same nomenclature as Hilbe (2007), the four NB-variants fitted here are abbreviated NB-1, NB-2, NB-H and NB-C. The former has var(Y) ∝ , and the NB-H has 2 = log k = T 2 x while NB-2 has an intercept-only 2 . All four models have 1 = T 1 x . Because is clearly unimodal, we fit a NB-H VGAM to explore the data; the component functions are overlaid in Fig. 1a. Interestingly, both have an approximate quadratic shape with different peaks and curvature. It was considered safest to model the nonlinearity in both j nonparametrically.
To compare the variants more rigourously we replace the cubic (vector) smoothing splines (Yee and Wild 1996) by regression splines because inference is more standard (Figure 1). In R this was the term bs(doy) used by vglm() rather than s(doy) within vgam(). The term offers 3 degrees of freedom excluding an intercept. Table 4 summarizes the results ranked by AIC. The NB-C was superior followed by the NB-H, and this suggests that the day of the year strongly affects both and k-the other models are too simple. That the NB-C performed best suggests that the relationship between the mean and variance is not as simple as what a basic loglinear relationship can allow. Indeed, appears to be coupled with k in the more complex nonlinear manner provided by the canonical link. Other NB analyses, including the NB-C, on real data are presented in Yee (2020).

Distribution-specific quantile regression: the Weibull distribution
We use the Munich rental guide data (Fahrmeir et al. 2013, Page 5) available in the gamlss R package (data set rent99) which has been analysed using a Box-Cox Cole and Green (BCCG) regression with GAMLSS by Rigby et al. (2013) (Table 5).
As an alternative, we re-analyse this data using Weibull quantile regression with weibullQlink() for two reasons: as a demonstration of the new link function, and to show that the Weibull distribution appears to work well for the target variable rent, even though it is traditionally used more in survival analysis. The proposed GAMLSS model is     where G is either the BCCG or the Weibull distribution. Intercept-only models for and are set for comparison purposes.
To check on the distributional assumptions we fit BCCG and Weibull regressions to the data as in (5.1). The estimated BCCG and Weibull QQ-plots shown in Figure 2, and AICs and BICs from both models in Table 6 show the Weibull distribution performs as well as or better than the BCCG method.
We now compare Weibull quantile regression with = c(0.5, 0.25, 0.50, 0.75, 0.95) T and weibullQlink() from VGAMextra to the quantile of rent under GAMLSS given by y = ⋅qBCCG( , 1, , ) , where qBCCG ( , 1, , ) is the quantile of the BCCG distribution. Figure 3 shows the estimated quantile curves and Table 7 gives the empirical quantiles. Both methods are effectively comparable and perform equally well.

Quantile modelling with the normal distribution
This example is based on Koenker and Hallock (2001) who performed conditional quantile regression of covariates associated to birth weight of live babies to demonstrate its better performance over ordinary least squares (OLS) when estimating the effects on the lower tail of the skewed birthweight distribution. This relationship was initially explored by Abreveya (2001). Koenker and Hallock (2001) used a sample ( n = 198, 377 ) of the June 1997 Detailed Natality Data published by the National Centre for Health Statistics (NCHS) that contains information from live, singleton births, mothers aged 18-45 and residing in USA.

An example using uninormalQlink()
To test the quantile regression framework introduced in Sect. 4.3 we carry out VGLM quantile modelling with uninormaQlink() and uninormalff(), and compare our results to Koenker and Hallock (2001) via rq() from quantreg. Due to availability constraints, we restrict ourselves to a n = 50, 000 subsample of the the 1997 NCHS data (Koenker and Hallock 2001) stored in the file BWeights.csv and incidentally obtained from the SAS ® file Sashelp.BWeights. Section D of the Online Supplements gives a short description of BWeights.csv and SAS ® code used to generate it, as well as supplementary code pertaining this section. BWeights.csv is available in the Supplementary Materials.
Following Koenker and Hallock (2001) and Abreveya (2001), the response is birthweight (recorded in grames) and the remaining factors are added into the model as covariates including a quadratic term for the mother's age and the mother's reported weight gain during pregnancy: We set = (0. 05, 0.25, 0.50, 0.75, 0.95).
The trace output from vglm() looks natural and correct. We have set zero = NULL to allow both linear predictors to be regressed on all the covariates. Note, G 1 in (4.11) is managed by the new link function uninormalQlink(). The function Q.reg() is required to create a multiple responses matrix spanning r = length(mytau) columns. Figure 4 shows the estimated parameters by quantile level from VGAMextra and quantreg for a few relevant covariates. Results conform with Koenker and Hallock (2001, p.150) to a great extent showing similar trends, except by Mom weight gain perhaps due the large range covered by this covariate (largest range among all covariates), as shown in Table 8.
Infants born to black mothers appear to weigh less, by between 100 and 300 grams, than newborns from white mothers across 5% to 95% quantiles. Likewise, although to a smaller extent, 'smoking' (this is Cigarettes per day  1 3 Two-parameter link functions, with applications to negative… (CigsPerDay) coded as '1' if the mother smokes more than five cigarettes p/day and '0' otherwise) is associated with newborns' weight loss. With smaller effects, maternal age appears to top birth weight up incrementally, by 4-6 grams, from the 5% quantile to the 95% quantile, while every kilogram gained during pregnancy is associated with gradual reductions in the infant's weight (4-5 grams). The model summary is in Section D of the Online Supplements.

Discussion
This paper has developed the general methodology for two-parameter link functions and used the NB-C and quantile modelling as the primary examples. The approach taken here is the first of two general options: (i) offer one family function that handles several different link parameterizations; (ii) offer several family functions corresponding to several different parameterizations of the same distribution. Each option has its advantages and disadvantages. Writing a new link function arguably involves less work and also the EIM may be too difficult to derive directly. However, given a choice, it is likely that many practitioners would choose a particular family function and use that solely. Over many years the NB-C has been widely referred to in mathematical statistics because of its connections with core concepts such as sufficient statistics,GLMs and variance functions. Despite this, Hilbe (p. 315, 2011) writes concerning its practical use: "Little work as been done with NB-C models. No research has been published using an NB-C model on data." This might be partially explained by the observation that almost all practitioners use existing implementations written by others and that the NB-C is a model with practical shortcomings such as G 1 < 0 in (4.3) rather than being unbounded. It is disappointing that, after several decades, our implementation is the first to fit the NB-C by a 'proper' algorithm because some packages simply call a general optimizer such as optim(). Our solution is naturally flexible too, as seen by the NB-C-H VGAM fitted in Sect. 5.1.2.
Our results on quantile regression showing that it is loosely 'regression estimated on multiple quantiles' are obviously dependent on a strong distributional assumption, however there are real practical benefits and realistic applications, as discussed in Sects. 4.3 and 5. The Weibull distribution is used invariably to model observed failures in survival analysis and reliability, and the normal distribution is almost foundational for many natural phenomena. Covariates can also be included and their effects on the distributions examined. We believe that drawbacks from a distribution-specific framework are ameliorated by smoothing-based infrastructure capable of identifying nonlinearity automatically and graphically, as well as the handling of a very broad range of response types such as categorical and survival data.
It is not surprising that this work is seemingly related to that of others. For example, Efron (1986) and Smyth and Verbyla (1999) model the mean and dispersion simultaneously in what are called 'double exponential families' and 'double GLMs' respectively. However, both apply separate ordinary links to each of the first two moments rather than including a single link function of both parameters. Likewise, Cepeda-Cuervo et al. (2014) develop 'double GLMs' with random-effects utilizing the same type of ordinary link functions.
There is scope for future work. Slightly more interpretable than the NB-C link might be G = log − log{ ∕( + k)} , and practically, other alternatives include logit{ ∕( + k)} = log{ ∕k} and −1 ( ∕( + k) ; they could easily be estimated with the present methodology. More generally, NB regression has the limitation that it does not handle underdispersion relative to the Poisson, hence other alternatives such as the Conway-Maxwell-Poisson distribution have gained popularity (see, e.g., Sellers and Shmueli 2010). Of course, the methodology could be generalized to M > 2 parameters and in particular, the M = 3 case would match distributions having a location, scale and shape parameter.
Quantile and mean modelling is another area to be further exploited in the short-term. We are applying this methodology to several two-parameter distributions which will soon have, as the Weibull example, links of the form distribution Qlink() or distribution Mlink() alluding the 'quantile' and 'mean' link respectively. We have already commenced work in this direction, e.g., with the 'mean link' for the 2-parameter gamma distribution, viz. VGAMextra::gammaRMlink(). However, the VGLM framework is broader, with further options such as additive models (Yee and Wild 1996) and reduced-rank regression (Yee and Hastie 2003) over the same . We hope to investigate these options too, including nlrq() from quantreg for nonlinear quantile regression.

3
Two-parameter link functions, with applications to negative…