A generalized Hosmer–Lemeshow goodness-of-fit test for a family of generalized linear models

Generalized linear models (GLMs) are very widely used, but formal goodness-of-fit (GOF) tests for the overall fit of the model seem to be in wide use only for certain classes of GLMs. We develop and apply a new goodness-of-fit test, similar to the well-known and commonly used Hosmer–Lemeshow (HL) test, that can be used with a wide variety of GLMs. The test statistic is a variant of the HL statistic, but we rigorously derive an asymptotically correct sampling distribution using methods of Stute and Zhu (Scand J Stat 29(3):535–545, 2002) and demonstrate its consistency. We compare the performance of our new test with other GOF tests for GLMs, including a naive direct application of the HL test to the Poisson problem. Our test provides competitive or comparable power in various simulation settings and we identify a situation where a naive version of the test fails to hold its size. Our generalized HL test is straightforward to implement and interpret and an R package is publicly available. Supplementary Information The online version contains supplementary material available at 10.1007/s11749-023-00912-8.


Introduction
Since their original formulation by Nelder and Wedderburn (1972), generalized linear models (GLMs), which include the linear, logistic, and Poisson regression models, among others, have been used within a vast number of application domains, and are extremely popular in medical and biological applications.Examples can be found, for example, in Myers et al. (2012) and Agresti (2015).One of the most common GLMs, the logistic regression model, is often used when a binary response is to be regressed upon one or more explanatory variables.For example, this may occur when the response represents the survival or death of a patient, and the explanatory variables might be characteristics of the individual or various treatment methods.Other GLMs, such as the Poisson regression model, can be used to model count data.This can occur, for example, when the response is a count of different species of hummingbirds captured during mist-netting sessions in a cloud forest (see Becker et al., 2020).
Naturally, it is desirable to have a model that fits the observed data well.Insight as to whether a model is a poor fit for a given dataset can be provided by a variety of measures, including goodness-of-fit (GOF) tests, as described by Bilder and Loughin (2014).In particular, global (omnibus) goodness-of-fit tests inspect the validity of the GLM model as a whole, with the drawback of not providing much insight into the source of any potential deviation from the null model.Hosmer and Lemeshow (1980) constructed a global GOF test for the logistic regression model, the Hosmer-Lemeshow (HL) test, which applies a Pearson test statistic to differences of observed and expected event counts from data grouped based on the ordered fitted values from the model.As a result, their test is very easy to interpret and is extremely popular, particularly in medical applications.Due to the simplicity of the test, it is tempting to apply it to other GLMs.However, as simulation results in Section 5 suggest, direct application of the HL test to GLMs with many parameters and a non-binomial response results in a distribution of the test statistic that is not adequately represented by a chi-squared distribution in finite samples, thereby adversely affecting the type 1 error rate and power of the test.
A considerable number of the global GOF tests available for GLMs with a non-binomial or non-normal response (or other regression models beyond GLMs) require kernel bandwidth selection-for example, the tests of Cheng and Wu (1994) and Stute and Zhu (2002).However, González-Manteiga and Crujeiras (2013) mention that the selection of the bandwidth is "a broadly studied problem in regression estimation but with serious gaps for testing problems".Also, the introduction of continuous covariates in a GLM model can render certain basic tests, such as the Pearson chi-squared test, invalid, due to theoretical issues that arise (see Pulkstenis and Robinson, 2002).In this paper, we address these issues and develop an omnibus GOF test for many GLMs that is both easy to compute and adaptable to many types of covariates.Specifically, we derive a modification to the HL test statistic that allows it to be generalized to a wide variety of GLMs.The modification is based on application of theory developed by Stute and Zhu (2002), who work in a very general framework.We show that the test statistic has an asymptotic chi-squared distribution for many GLMs in the exponential dispersion family, adding to the appealing simplicity of the test.
We begin Section 2 with an overview of previously developed global GOF tests.Section 3 provides a more detailed description of certain existing global GOF tests, along with our new test.The design for our simulation study comparing these tests is laid out in Section 4, and the results are provided in Section 5.An application to a dataset is also performed in Section 5. We find that our test provides competitive or comparable power in various simulation settings, is computationally efficient, and avoids the use of kernelbased estimators.Finally, we discuss the results and potential future work in Section 6.The proofs of several results are provided in the supplementary material.

Background and Notation
Throughout this paper, we let Y be a response variable that is associated with a covariate vector, X, where X ∈ R d .We write (X i , Y i ), i = 1, . . ., n, to denote a random sample where each (X i , Y i ) has the same distribution as (X, Y ), and provides observed data (x i , y i ).
The Hosmer-Lemeshow GOF test assesses departures between observed and expected event counts from data grouped based on the fitted values from the model.For the moment, we restrict our attention to binary response logistic regression models, in which each y i ∈ {0, 1}, although the HL test can be used in a binomial setting by treating each y i as the count of successes resulting from all trials observed at the corresponding x i .In the binary case, we assume that for some β ∈ R d .The likelihood function is then given by from which a maximum likelihood estimate (MLE), β n , of β can be obtained.
Computing the HL test statistic starts with partitioning the data into G groups.This is often done in a way so that the groups are of approximately equal size and fitted values within each group are similar.The partition fixes interval endpoints, −∞ = k 0 < k 1 < • • • < k G−1 < k G = ∞; the k g are often set to be equal to the logit of equally-spaced quantiles of the fitted values, πi = π(β n x i ).Then, we define i , and πg = E g /n g , where 1(A) is the indicator function on a set A. Then, n g represents the number of observations in the gth group, and πg represents the average of the fitted values within the gth group.Finally, we can define the HL test statistic as . (1) The theory behind the regular HL test, based on work by Moore and Spruill (1975), suggests that the asymptotic null distribution of the test statistic follows the distribution of a weighted sum of chi-squared random variables.Hosmer and Lemeshow (1980) approximated this complex distribution with a single chisquared, where the degrees of freedom were determined partly by simulation.That this is indeed only an approximation of the asymptotic null distribution can be clearly seen first in Hosmer and Lemeshow (1980), Lemeshow and Hosmer (1982), and in many of the aforementioned papers.We show in our own simulations that a direct application of similarly constructed tests to other GLMs, such as the Poisson regression model, can perform poorly due to improper theoretical justification, and that our generalization of the HL test addresses this issue.
Another grouped GOF test for logistic regression models was introduced by Tsiatis (1980) around the same time as the original HL test.This test partitions the covariate space, thereby grouping the data, from which a test statistic can be calculated.Expanding the definition of I (g) i to more generically indicate that the ith observation is in the gth group, then The matrix of the quadratic form for the HL test statistic is diagonal, as seen later in Section 3, whereas for the Tsiatis test statistic it is not.Thus, if groups are formed in the same way for both tests, the main difference between the HL test and the Tsiatis test is that the latter accounts for correlations between groups of residuals.The Tsiatis test statistic, with covariate space partitionings defined before observing the data, can be easily transferred to other GLMs with a canonical link, such as the Poisson regression model with a log link.Canary et al. (2016) constructed a generalized Tsiatis test statistic for binary regression models with a noncanonical link, and then made use of an HL-like partitioning scheme, as in the first case of (2).Their approach is very interesting, but the theoretical validity of such a partitioning method for the generalized Tsiatis test should be verified, similar to what was done by Halteman (1980) in the case of logistic regression.The generalized Tsiatis test statistic for binary regression models is equivalent to our proposed test statistic when both tests use the same partitioning scheme, i.e., either of the cases in (2).We formalize the theoretical validity of the asymptotic chi-squared distribution for the generalized Tsiatis test statistic when the first grouping method of (2) is used.We also allow for extensions to a much wider class of GLMs, as was briefly suggested by Canary (2013).
Previous attempts have been made to generalize the HL test to models other than the logistic regression GLM.For example, Blizzard and Hosmer (2006) and Quinn et al. (2015) apply the HL test to binomial regression models with a log link, and Canary et al. (2016) consider several GOF tests, including the HL test, for binomial regression models with various noncanonical links.The HL test has also been transferred to the multinomial regression model, discussed by Fagerland et al. (2008), and the proportional odds and ordinal logistic regression models, as described by Fagerland andHosmer (2013, 2016).
Tests of fit have also been constructed that can be used with broader classes of GLM models.An early contribution came from Su and Wei (1991), who considered the supremum of a multivariate process based on a cumulative sum of residuals.Stute and Zhu (2002) presented a test statistic based on the Cramérvon Mises statistic applied to a transformed residual process.Other GOF tests for GLMs can be found in Cheng and Wu (1994); Lin et al. (2002); Liu et al. (2005);Rodríguez-Campos et al. (1998); Xiang and Wahba (1995).A nice review of GOF tests for regression models is given by González-Manteiga and Crujeiras (2013).While many of these tests appear to have merit, they do not seem to have been widely adopted in practice.Unfortunately, the p-values accompanying some tests need to be obtained through simulation.Christensen and Lin (2015) noted that with several explanatory variables, simulations required by the Su-Wei GOF test can be "prohibitively time consuming".Aside from computational efficiency, the use of smoothing-based methods in some tests-for example, in Rodríguez-Campos et al. (1998)-raises the question of how these tests perform in higher-dimensional situations.Other concerns with some tests are the interpretability and ease of use, since some tests appear in mathematical context and do not have readily available software.As a consequence, such tests might be inaccessible to users in certain fields.We believe that our new test is fairly accessible, and we show that it works in high-dimensional settings and is computationally efficient.
To generalize the HL test to other GLMs, we use notation similar to that used by Stute and Zhu (2002).We assume under the null hypothesis, given by ( 5) below, that E(Y 2 ) < ∞.As a consequence, E(Y ) < ∞, and we may define m * (x) = E(Y |X = x), and σ 2 (x) = Var(Y |X = x).We also assume that σ 2 (x) > 0 for all x.
The GLM that we fit and test assumes that the conditional density of Y given X = x is an exponential family member with inverse link function, m.Specifically, the conditional density of Y given X = x has the form where the vector β 0 , θ, and ), and ν is a suitable dominating measure.In such a model the variance is a function of the mean and we may write σ 2 (x) = v(m(β x)), for a smooth function v determined by the function b.In some cases it is of interest to add an (unknown) dispersion parameter, φ, to the model and assume further that there is a scalar φ 0 such that the density of Y given X = x has the form with respect to a (possibly different) dominating measure ν, where β 0 , φ 0 , θ, and x are related by similar equalities as before.For simplicity, we focus on the form given by (3), i.e., the dispersion parameter is known.
We later offer a discussion on how to apply our GOF test to models with an unknown dispersion parameter.We test the null hypothesis that m * (x) = m(β x) for some β ∈ R d , and for all values of x that are of interest, for a specified inverse link function m(•), where • denotes the Euclidean norm.We denote the maximum likelihood estimate of β by β n .We specify in Section 3.3 conditions on m(•) and f Y |X , and lay out more precisely to which GLMs our new GOF test may be applied.Also, note that our test is truly a global or "omnibus" test, since rejecting the null hypothesis (5) could suggest that an incorrect link was specified, a variable is missing from the model, or that an entirely different model should be used.Such a test is in line with the global HL test, which does not test against any specific alternative.

Methods and Test Statistics
Tests for GOF require statistics that measure departures from the null hypothesis.To this end, define the true error, a random quantity, as = Y − m(β 0 X).Also relevant to our test is the following process, for u ∈ R, as defined in Stute and Zhu (2002): The HL test can be rewritten as a quadratic form, in terms of the residual process, R 1 n (u).We first define the G-vector Then, the HL test statistic can be rewritten as Provided that G > d-a requirement not often cited in references to the HL test-and upon verifying certain conditions of Theorem 5.1 in Moore and Spruill (1975), Hosmer and Lemeshow (1980) conclude that their test statistic is asymptotically distributed under the null hypothesis as a weighted sum of chi-squared random variables.That is, where each χ 2 1j is a chi-squared random variable with 1 degree of freedom, and each λ j is an eigenvalue of a particular matrix that depends on β 0 and the distribution of X.Then, through simulations, they conclude that the term

∼ χ 2
G−2 .However, in certain settings with a finite sample size this does not serve as a good approximation, as we discuss in Section 5.

A Naive Generalization of the Hosmer-Lemeshow Test
The HL test statistic depends on the binomial assumption only through D in (8), with the gth diagonal element representing an estimate of the variance of the counts in the gth group, divided by n.To extend this test to other GLMs, it is tempting to define a "naive" HL test statistic where similar to the estimates of the variances of group counts given in the original HL test through D. For example, for Poisson regression models one would use since the conditional variance of the response is equal to the conditional mean.This idea is very briefly suggested in Agresti (1996) on p. 90, and in Bilder and Loughin (2014).Implementing this test for Poisson regression models, our simulation results suggest that as the number of estimated parameters in the model increases, the mean and variance of the C * G test statistic tend to decrease for a fixed sample size.Further properties of this test are discussed in Section 5.1.

The Generalized HL Test Statistic
We now provide the test statistic for our generalized HL test.First, we define some important matrices.Let , for i = 1, . . ., n, and g = 1, . . ., G. Also, let X * be the n × d matrix whose ith row is given by x i , for i = 1, . . ., n. Define W 1/2 n and V * 1/2 n the same as W 1/2 and V * 1/2 , respectively, but evaluated at β n instead of β 0 .Note that if there is a dispersion parameter in the model then W n will involve φ 0 and W * n will involve a consistent estimate, φ n .Similarly, φ n will be included in an estimate of V * 1/2 n . It is also possible to replace k g with random interval endpoints, k n,g , provided that these are consistent estimators of k g and P (β 0 X = k g ) = 0 for all g.This is discussed in more detail in Section 3.3.In our simulation study, described in Section 4, we use random interval endpoints so that is approximately equal across groups.Details of this implementation are given in the supplementary material. Define where I n is the n × n identity matrix, and n , also referred to as the generalized hat matrix.We denote the Moore-Penrose pseudoinverse of a matrix A by A + .Our "generalized HL" (GHL) test statistic is then given as Under some conditions, described in detail in the Theorem of the Appendix, we have that with ν = rank(Σ), where Σ is specified in the Theorem and supplementary material.

GLMs for which the GHL Test is Valid
Formally, conditions (A), (B), (B'), (C), and (D) of the Appendix should be verified before using the generalized HL test statistic, X 2 GHL .From the Theorem, the conditions below are sufficient for the validity of conditions (B), (B'), and (C): (i) One of the distribution / link function combinations from Table 1 is used.(The conditions may hold for other distributions or link functions, but separate verification would be required.) (ii) The joint probability distribution of the explanatory variables, X, has compact support, and the distribution of β 0 X is absolutely continuous.Also, the support of β X is an interval for all β in a neighbourhood of β 0 .
We reiterate that these are sufficient conditions; our test is not necessarily limited to such GLMs.The supplementary material outlines how to verify conditions (A), (B), (B'), (C), and (D), how to weaken the assumption of compact support from (ii), and how to extend our test to other GLMs.
It is also generally possible to apply the GHL test to GLMs in which the dispersion parameter associated with the response is unknown by adding appropriate conditions, such as consistency of the estimator of the dispersion parameter.In this case, we have found that a larger sample size is usually needed to ensure that the sampling distribution of the test statistic in finite samples is well approximated by its limiting chi-squared distribution.

Related GOF Tests
For the simulation study described in Section 4, we compare our test to tests given by Su and Wei (1991) and Stute and Zhu (2002).
Using our notation, the Su-Wei (SW) test statistic is defined as where and 1(x ≤ u) is an indicator for the event that each component of x is less than or equal to each respective component of u.With continuous covariates, finding the supremum can require R n (u) to be evaluated at approximately n d−1 values of u.Adding to the complexity, obtaining p-values relies on a simulation procedure, described in more detail in Su and Wei (1991).Even for a relatively small sample size, such as n = 100, computing the SW test statistic and p-value can be computationally intensive with several predictors.Stute and Zhu (2002) present a test statistic based on the Cramér-von Mises statistic applied to a transformed version of the R 1 n process, T n R 1 n .Setting x 0 to be, say, the 99th percentile of the observed linear predictors, β n x i , i = 1, . . ., n, the Stute-Zhu (SZ) test statistic can be defined as where and σ 2 n (u) is a consistent estimator of Var(Y |β 0 x = u), satisfying properties mentioned in their paper.The limiting sampling distribution of the test statistic is described in Stute and Zhu (2002).
Although Stute and Zhu (2002) pose the null hypothesis in terms of the mean only and do not require a distributional assumption in the null hypothesis, we will generally add such an assumption.A distributional assumption places restrictions on the forms of nuisance parameters in the SZ test statistic.For example, in the case of Poisson regression, σ 2 n (u) = e u for all n.We also modify the SZ test statistic in the Poisson model by having This modification allows the SZ test to detect violations of our narrower null hypothesis, such as overdispersion, that its original formulation would not permit.For the selection of a kernel bandwidth in T n R 1 n , we use a bandwidth of 0.5/ √ n, as used by Stute et al. (1998).
We denote the percentile of the linear predictors used to define x 0 by p x0 .We find that calculating T n R 1 n involves inverting matrices that might not be invertible even when p x0 is much lower than 0.99, particularly when there are binary covariates or many variables present.We therefore try values of p x0 in {0.99, 0.98, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7} in a decreasing order, and use the same value of p x0 across all realizations in a given simulation setting.It is our opinion that having p x0 < 0.7 does not include enough of the data for the test statistic to be truly meaningful, and in such cases we omit the calculation of the statistic unless otherwise stated.

Simulation Study Design
In order to assess the performance of our proposed global GOF test, we perform a simulation study.We focus on applications to data generated from Poisson distributions, since Poisson appears to be a common non-binomial GLM.We compare rejection rates under different null and alternative hypothesis settings for four different global GOF tests: the naive generalized HL test, our new generalized HL test (GHL), the SZ test, and the SW test.The naive generalized HL test is included to demonstrate that a test without proper theoretical justification may fail.We believe that the SZ test has an appealing construction, but is perhaps not as well known as the SW test, which can be found in other simulation studies.These two tests are included because they do not rely heavily on kernel-based density estimation and are among the tests that are relatively straightforward to implement.
Under the null hypothesis we consider six settings with a log link and three with a square root link, varying the distribution of explanatory variables and the true parameter values in each case.To examine the power of each of the GOF tests we consider four different settings representing different model flaws, each with various sub-settings.Finally, we compare the empirical size of tests under the null hypothesis as the number of parameters is increased, keeping the sample size fixed, to illustrate that the naive generalized HL test does not perform well in this setting.Details of all of these cases are given in the next subsections and the supplementary material.
We use sample sizes of 100 and 500 throughout this simulation study, representing moderate and large sample sizes in many studies in medical and other disciplines where GLMs are used.For each simulation setting we produce 2500 realizations.Only the results for the sample size of 100 are reported for the null and power simulation settings, and important differences between the two settings are summarized in Section 5. Approximate 95% confidence intervals for the type 1 error rates in the null simulations can be obtained from the observed rejection rates by adding and subtracting 0.009, whereas conservative 95% confidence intervals for power can be obtained from the observed rejection rates by adding and subtracting 0.02, accounting for the widest interval, when a proportion is equal to 0.5.For power comparisons among tests, pairwise McNemar tests are used, along with Bonferroni corrections.For power settings 1, 2, 3, and 4, we account for the 24, 24, 24, and 12 possible comparisons, respectively.We therefore check for p-values less than 2.08 × 10 −3 or 4.16 × 10 −3 , depending on the power setting.The simulations are performed using R.

Null Distribution
In the first three null settings with a log link, a model with a single covariate and an intercept term is used.These settings serve to examine the effect of small and large fitted values on the null distribution of the test statistics.For these settings, the true regression model is E(Y |X = x) = exp(β 0 + β 1 x).The conditional distribution of Y given X is Poisson, and realizations of X are drawn from U (−3, 3), where U (a, b) denotes the uniform distribution on [a, b].The values of β 0 and β 1 are chosen so that the fitted values take on a wide range of values (approximately 0.1 to 100) in the first setting, are moderate in size (approximately 1 to 10) for the second setting, and are very small (approximately 0.1 to 1) for the third setting.
For settings 4, 5, and 6, coefficients are chosen so that the fitted values are moderate in size (rarely less than 1, with an average of approximately 4 or 5), so that other sources of potential problems for the GOF tests can be identified.The fourth setting examines a model including both continuous and dichotomous covariates (the "Normal-Bernoulli" model), and is similar to the one used in Hosmer and Hjort (2002).For this setting the true regression model is E and (X 1 , X 2 |B = 1) ∼ N ((1, 1), Σ), with Var(X 1 ) =  Var(X 2 ) = 1, and Corr(X 1 , X 2 ) = 0.5.The fifth and sixth simulation settings examine the effects of correlated and right-skewed covariates, respectively.It is well known that in the presence of multicollinearity the variance of regression parameter estimates can become inflated.The correlated covariates setting is included to assess the impact of multicollinearity on the GOF tests, since the estimated covariance matrix of the regression coefficients is used in the calculation of the GHL test statistic.For this setting, the true and Corr(X 1 , X 2 ) = 0.7.The right-skewed covariate in setting 6 is included in order to assess its potential impact on the SZ test, since this test makes use of kernel-based density estimation as a part of the calculation of the test statistic, albeit in a one-dimensional case.This setting considers X ∼ Exp(1), where Exp(λ) is the exponential distribution with rate λ.The true regression model is as in settings 1 to 3.
The three settings for the square root link Poisson regression simulations are the same as settings 1, 2, and 3 with the log link, except that the true regression model is E(Y |X = x) = (β 0 + β 1 x) 2 , and the coefficients β 0 and β 1 are modified so that fitted values are approximately within the same three intervals as for settings 1 to 3. All model coefficients and further details of the simulation study are given in the online supplemental materials.

Power
To examine the power of the GOF tests, we consider four types of deviations from the null model: a missing quadratic term, overdispersion, a missing interaction term, and an incorrectly specified link function.These settings are similar to those used in Hosmer and Hjort (2002) and are realistic model misspecifications.In all four settings we choose appropriate regression coefficients so that the fitted values are moderate in size, rarely less than 1 and often smaller than 10 (in setting 3, some fitted values are larger than 10), to ensure that a large rejection rate is not simply due to small fitted values in the Pearson-like test statistics.The size of the deviation from the null model in settings 1 to 3 is indexed by J.All four power simulation settings are described in detail in the supplementary material.

Performance with Large Models
To assess the performance of each of the tests with large models, for a model with d parameters, we generate data so that E(Y Realizations of Y are again drawn from a Poisson distribution with the given mean, and we use n = 100 and 500 for our sample sizes.In order for the distribution of the fitted values to remain approximately constant as the number of parameters, d, is varied, we set β 0 = 1.67, and β 1 = . . .= β d−1 = 0.0717/(d − 1), resulting in a distribution of fitted values mostly within the interval [1, 10], ensuring that expected counts within each group used in the calculation of the Pearson statistic are sufficiently large.We use d = 2, 10, 20, 30, 40, 50.The SW test is omitted due to computational challenges that arise with this test with large models.Also, the SZ test is omitted since we observe that the choice of p x0 sometimes has to be less than 0.7 when d is large and n = 100.This is because at least d − 1 of the matrices that need to be inverted in the calculation of the SZ test statistic have less than full rank.

Null Distribution
From the null simulation results in Table 2, we see that the estimated type 1 error rate for our GHL test does not significantly differ from the nominal level, since all values are in the interval (0.041, 0.059).On the other hand, the naive generalized HL test falls out of this interval in three settings, whereas the SW and SZ tests fall out of this interval in two and four settings, respectively.Interestingly, even with d = 4 in setting 4, we begin to see a decreased type 1 error rate for the naive generalized HL test, a phenomenon discussed in more detail later in this section.However, with a sample size of 500, the naive generalized HL test and the SZ test generally have better empirical rejection rates, whereas the SW test has similar poor performance even with a larger sample size.

Power
From the power simulation results displayed in Figure 1, we see that our new test does, indeed, have power to detect each of the violated model assumptions that we tested.However, for those model flaws that are detectable by the SZ and SW tests, these two tests generally have better power than the tests based on grouped residuals.Using pairwise McNemar tests with a Bonferroni correction, described at the beginning of Section 4, we see that the SZ test performs significantly better than all of the other tests in the first setting (p < 4.9 × 10 −6 for 11 out of 3 × 4 = 12 comparisons), with only the comparison between the SZ and the SW tests at J = 4 being not significant (p = 8.3 × 10 −3 ).In the second setting, we see that our test has the greatest power for detecting overdispersion.The difference is significant in comparison to the naive generalized HL test for J = 1/8, 1/4, 1/2 (p < 9.0 × 10 −5 for all three comparisons).The difference in power between the GHL test and the SW and SZ tests is significant for all J in this setting (p < 3.9 × 10 −34 for all 8 comparisons).We note that with the original ψ n (x 0 ) from (15) the SZ test is not able to detect overdispersion.Interestingly, in the setting with a missing interaction term, our test performs better than the SW test for J = 16, 20, with the smaller sample size of 100 (p < 4.4 × 10 −11 for both comparisons).However, with a sample size of 500, not displayed in Figure 1, the SW test seems to outperform our test.The SZ test has greater power to detect a missing interaction than the GHL test for J = 12, 16, 20 (p < 7.3 × 10 −14 for all three comparisons), and the difference between the GHL test and the naive generalized HL test is not significant with this sample size (p > 0.04 for all four comparisons).The results from the fourth setting are provided in the supplementary material in Table 9, but are not displayed here.The SZ test seems to detect an incorrectly specified link function with the most power, although all tests have relatively low power to detect the incorrectly specified link function, possibly due to our choices of alternative link functions, which did not deviate drastically from the fitted model over the range of means chosen.We speculate that global GOF tests should generally be able to detect a misspecified link very well, perhaps better than other model misspecifications, due to trends that can arise in the residuals in such a scenario, provided that the range of means is sufficiently large.

Large Models
The null distribution of the naive generalized HL test statistic, C * G , is not well-approximated by the usual χ 2 G−2 distribution in this particular setting with a finite sample size.The impact of the number of parameters on the estimated mean of the naive generalized HL test statistic can be seen in Figure 2, where the G − 2 degrees of freedom approximation for this test deteriorates as the model size grows, relative to the sample size.This adverse effect is less pronounced with a larger sample size.The estimated type 1 error rates are omitted, but they steadily decrease for the naive generalized HL test from about 0.050 to 0.001 as d grows from 1 to 50 with a sample size of 100, and down to 0.030 with a sample size of 500.Our proposed test does not seem to be affected by the number of parameters present in the model.Similar results were obtained for both tests with G = 50, used to ensure that G > d.

Application
We study the alcohol consumption dataset used in the study of DeHart et al. (2008), as described in Bilder and Loughin (2014).The goal of the study was to assess how the number of alcoholic drinks consumed is associated with factors such as self-esteem and negative romantic-relationship events.Bilder and Loughin (2014) performed a Poisson regression analysis on a subset of the data.The variable NUMALL, representing the number of alcoholic drinks consumed by a subject on their first Saturday, was regressed against several variables.After a fairly extensive analysis of the data, the authors arrived at a final model containing all of the main effects of the variables in Table 3, including the following interactions: ROSN × PREL, AGE × ROSN, DESIRED × GENDER, DESIRED × AGE, and STATE × NEGEVENT.
Understanding the questionable validity of performing GOF tests following model selection on the same data, we examine the fit of three different models using the GOF tests discussed in this paper.Additionally, the fit of several other models is examined in the supplementary material.We have n = 89, and we use G = 10 and G = 18 groups.The larger number of groups is included to ensure that G > d, as was mentioned in Section 3, while still maintaining an average of about 5 observations per group.Because of the relatively high dimension of the data, the SW test is excluded in the evaluations of all three models.
We first examine the overall fit of the model mentioned at the beginning of this subsection (case 1).As seen in Table 4, none of the tests considered reject the null hypothesis at the α = 0.05 significance level.The naive HL gives somewhat larger p-values than the generalized version, matching expectations due to the moderately large number of variables relative to the sample size.For case 2 we fit a square root link instead of the original log link, retaining all of the variables from the model in case 1.The GHL results with both values of G and the naive HL with G = 18 suggest that the square root link may provide a poorer fit than the log link.However, the naive HL with G = 10 and the SZ test do not reject the null hypothesis.Finally, in case 3, we omit all interactions from the the model in case 1.Here, we expect that the tests should detect a poor fit due to missing variables.In this case we see that the GHL test with 18 groups, the naive generalized HL test with 10 and 18 groups, and the SZ test reject the null hypothesis.While the GHL test with 10 groups fails to reject the null hypothesis that the model is correct, the p-value is still very close to 0.05.

Discussion
We have introduced several global GOF tests for GLMs and assessed some of their properties.With the simulation results of Section 5, we have seen that our test provides competitive or comparable power in various simulation settings.Our test is also computationally efficient, straightforward to implement, and functions in a variety of scenarios.There is no need for a choice of a kernel bandwidth (although the choice of number of groups, G, can play a role in determining the outcome of the test), and the output can be interpreted in a meaningful way by assessing differences between observed and expected counts in each of the groups.
In many of the null and power simulation settings, d was relatively small and the covariates were often continuous.This choice was deliberate, so that power comparisons would be possible, because it is sometimes not possible to calculate the SW and SZ test statistics in the presence of many variables or binary covariates.Also, fewer alternatives are tested by the SW test than by our proposed GHL test.Thus, while SZ and SW were more powerful than the GHL test in certain settings friendly to these tests, there is room for all of these tests in an analyst's toolkit.
The naive generalization of the HL test does not work well under certain settings, but the use of the GHL test seems to resolve these issues.However, the choice of the number of groups, G, can have an impact on the corresponding p-value in group-based tests.As a suggestion, one can choose a value of G so that the values of are larger than, say, 5 in each group, if possible.This is analogous to the common recommendation for the finite-sample validity of the Pearson chi-squared test.Large values of G might be able to increase the power of the GHL test when they can be supported, although this should be studied in more detail.
Letting µ denote the distribution of X, there exists a µ-integrable function M (x) such that, for all β in a neighbourhood of β 0 , and for i = 1, . . ., d, |q i (x, β)| ≤ M (x).That is, for some δ > 0 we have E sup
H is uniformly continuous in u at β 0 .This condition requires that β 0 X have a continuous distribution; in particular, β 0 = 0.
under the null hypothesis.The matrix Σ is specified in the supplementary material.
We now state the main theorem of our paper.The proof of this theorem is given in the supplementary material.
Theorem.Suppose that E(Y 2 ) < ∞ and conditions (A), (B), and (C) hold.Then, under the null hypothesis given by ( 5), with f Y |X as in (3), for non-random k g , g = 0, . . ., G (or random k n,g , provided that k n,g is consistent for k g for each g), we have that where S 1 n is as defined in (7), and Σ is given by ( 17) and ( 19) in the supplementary material.Furthermore, if there exists a matrix Σ n that satisfies condition (D), then with ν = rank(Σ).
If conditions (i) and (ii) of Section 3.3 are satisfied, then conditions (B), (B'), and (C) hold.If (A) additionally holds (consistency and asymptotic normality of the MLE), then the matrix Σ n as given by (10) satisfies condition (D)(i) (consistency for Σ) under the null hypothesis, for random or fixed interval endpoints (as described above).In this section we describe the four power simulation settings in greater detail.For the missing quadratic term in setting 1, the true model is but we omit the quadratic term when fitting the model.In this setting, we use X ∼ U (−3, 3).Four subsettings are considered, in order to investigate the impact of varying degrees of deviation from the null model.The coefficients β 0 , β 1 , β 2 are chosen so that the mean of Y when X = −3, 0, 3 is J, 5, and 8, respectively.We use J = 4, 6, 8, 10.
In order to simulate overdispersion for setting 2, we draw realizations of Y given X from a negative binomial distribution with mean and variance for J = 1/16, 1/8, 1/4, 1/2.Larger values of J represent larger deviations from the Poisson distribution and greater overdispersion.In this setting we also use X ∼ U (−3, 3).For setting 3, the true model is with X ∼ U (−3, 3) and B ∼ Bernoulli(0.5).The model that is fit excludes the final interaction term.Four sub-settings are considered for setting 3, similar to setting 1.The coefficients are chosen so that the mean of Y when (X, B) = (−3, 0), (−3, 1), (3, 0), (3, 1) is equal to 5, 5, 7, and J, respectively.We use J = 8, 12, 16, 20, representing varying amounts of deviation from the fitted model.The SZ test is not omitted in setting 3, even though we were unable to perform some of the simulations even with p x0 = 0.7.This can sometimes occur when binary covariates are present in the model because some matrices that should be inverted in the calculation of the test statistic become singular.Instead, data that resulted in an inability to compute the SZ test statistic was omitted, resulting in fewer than 2500 simulation realizations in each sub-setting.Finally, in order to assess the effect of an incorrectly specified link function in setting 4, we consider the square root and identity link functions as two sub-settings, fitting a log-linear Poisson model for both.We have X ∼ U (−3, 3), and the coefficients are chosen so that the conditional mean of Y given X = 0, 3 is 5 and 8, respectively.

Implementation of the Interval Endpoint Selection Method
For the GHL test, a special interval endpoint selection method is used in order to keep n i=1 σ2 (x i )I (g) i roughly constant across groups.The implementation is based on the "weighted.quantile()"function from the "spatstat" package in R. In order to avoid having groups with no observations-this can occur, for example, when a very large fitted value is present-first the weighted (G − 1)/G × 100th percentile is obtained.Then, observations that fall into this group are discarded, and the weighted (G − 2)/(G − 1) × 100th percentile is obtained from the remaining data.This process is repeated until G groups are formed.

Degrees of Freedom for the GHL Test
The GHL test statistic is compared to a chi-squared distribution with a certain number of degrees of freedom in order to obtain a p-value.Since the rank of Σ is generally unknown, we use the rank of Σ n in its place.The rank of Σ n is very often found to be G − 1 in the simulation study, with rare exceptions occurring in some simulation realizations (e.g., on some realizations when a categorical covariate is present in the model).

GLM Convergence with Noncanonical Links
When fitting a Poisson GLM with identity or square-root link in R, certain issues can arise.In general, we aid convergence of the parameter estimates for non-canonical link models by either providing fitted values from a previous model, or by providing rounded versions of the true parameter values to the "glm()" function call.On occasion, fitting the GLM with a noncanonical link results in a warning, in which case the particular simulation realization is omitted, resulting in fewer than 2500 simulation realizations.

Alcohol Consumption Study Additional Results
Interestingly, when an outlier is removed from the alcohol consumption dataset and the same model is used as in Case 1, the fit of the model seems to get worse, as suggested by 3/5 of the GOF tests.The fit from the naive GHL test with G = 18 is only slightly better, but the fit from the SZ test increases reasonably.However, when another variable selection procedure is run without the outlier, the fit seems to become better again for each of the tests.Also, when the variable STATE is additionally removed from the model in Case 3, none of the GOF tests reject the null hypothesis.In this scenario, the naive HL and the GHL tests still have p-values relatively close to 0.05.Table 5: Link and inverse link functions considered

Link function name
Link function form Inverse link function form Table 6: Several conditional distributions of Y given X that can be written in the form of (4).We primarily concern ourselves with the latter half of the Theorem.The first half of the Theorem is implicitly proven in Section 8.2 below.We show that conditions (B), (B'), and (C) are satisfied for GLMs that satisfy both conditions (i) and (ii), mentioned in Section 3.3 of the paper.The results of this paper, however, are not necessarily limited to such GLMs.We also offer general discussion for conditions (A), (B), (B'), (C), and (D).Condition (A) generally holds for maximum likelihood estimators in both standard GLMs and in models with a dispersion parameter.We also provide some comments on condition (D), as well as alternative approaches that can be taken should this condition not be met.Later in the supplementary material we show that the matrix Σ n as given by ( 10) in the paper satisfies condition (D)(i) under the null hypothesis and conditions (i) and (ii).
We begin by restating the model under our null hypothesis.The sequence (X 1 , Y 1 ), . . . is iid.For a GLM we assume the conditional density of Y given X is with respect to some dominating measure, say ν.The mean of Y (conditioned on X) is b (θ) and the variance of Y is b (θ).The mean of Y is related to x via b (θ) = m(β 0 x) for some β 0 and a specified function m.Since b (θ) > 0, the function b is injective on the natural domain Θ = {θ : exp(yθ)ν(dy) < ∞}.This means that we may write θ as a function of m(β 0 x).
Let u be the inverse link function that satisfies (This is the composition of the inverse function of b and the mean function m.Note that u is the identity function if m is the canonical link.)For models without a dispersion parameter, the score function from observation i is U i (β 0 ), given by and the Fisher information matrix is 8.1 Verifying Conditions (A), (B), (B'), (C), and (D) for Various GLMs Conditions for asymptotic normality and weak consistency of the MLE β n for β 0 in GLMs can be found in Fahrmeir and Kaufmann (1985).They show that in a GLM without a dispersion parameter and with the canonical link function there is, under very mild conditions, a root β n of the likelihood equations that is consistent for β 0 in the case of iid covariates.We first consider canonical (natural) link functions, since such links make the log-likelihood convex.If X has compact support and E(XX ) is positive definite (i.e., the covariance matrix of X is positive definite), then β n asymptotically exists and is strongly consistent, by Corollary 3 of Fahrmeir and Kaufmann (1985).We now consider the case where the support of X is not compact and introduce condition (R s ) of Fahrmeir and Kaufmann (1985): (R s )(i) : I 1 (β 0 ) exists and is positive definite, and (R s )(ii) : E max β∈N XX b β X exists for some compact neighbourhood, N , of β 0 .
If condition (R s ) holds, then β n asymptotically exists and is strongly consistent, also by their Corollary 3. The expansion in condition (A) follows.
For link functions other than the canonical link, it can be the case that the score function has multiple roots.In this case, results from Fahrmeir andKaufmann (1985, 1986) assert that there is a consistent root and that this root satisfies the expansion in condition (A).Some additional conditions might be required to establish weak consistency of β n with noncanonical links.

Condition (B)
The inverse link functions considered are mentioned in Section 3.3 of the paper.All of these are continuously differentiable on the whole real line, so the first part of condition (B) is met.
The second part of condition (B) imposes moment conditions on the covariates.Consider first the log link, for which m(µ) = e µ .If there is a neighbourhood of β 0 in which X has a moment generating function then condition (B) holds.For the logit, probit, cauchit, cloglog, and identity links it is enough that X have a finite mean; that is, for each 1 ≤ j ≤ d we have E(|X j |) < ∞.For the square root link we require finite variances: for each j we require E(X 2 j ) < ∞.All of these moment conditions are immediate consequences of an overall assumption that the covariates X lie in some bounded set with probability 1.Here are some details.
Consider first the log link, i.e., m(β x) = exp(β x).Suppose that for some > 0 the random vector X has a finite moment generating function E(exp(β X)) < ∞ for all β such that β − β 0 ≤ .Then it can be shown using the convexity of β → exp(β x) that for any 0 < r < and any finite τ we have E sup Condition (B) then holds with Indeed, for i = 1, . . ., d, we have, for all β with β − β 0 ≤ r, For the identity link we have q i (x, β) = x i .For the four bounded links the derivative m is also bounded and we have For the square root link we see m (µ) = 2µ and

Condition (B')
Let Θ be the set of θ for which exp(yθ)ν(dy) < ∞.Then Θ is an interval in the real line.On the interior of that interval the function b has infinitely many derivatives so in particular b, b and b are all continuous on the interior of Θ. Continuity of the map (x, β) → σ 2 (x) holds, since All of the inverse link functions considered from Section 3.3 are continuous on the real line, and continuity is preserved when continuous functions are composed, multiplied, or added.All M (x) considered below are µ-integrable, since we have assumed that the support of the joint probability distribution of the explanatory variables is compact.Weaker moment conditions like those discussed under Condition (B) above are straightforward, but the details depend on the specific variance function and the specific link; some details are provided below in a case by case list of the models.
We can then set M (x) = exp(( β 0 + r) x ).On the other hand, with the square root link, Thus for this model we get the same moment conditions for these two links as we had when discussing Condition (B).Other link functions can be handled similarly.

Condition (C)
This condition requires that β 0 X have a continuous distribution.In particular, it will not be satisfied with only discrete covariates or with β 0 = 0.This is a real restriction; if the covariates are unrelated to the response then in the limit all of the observations will be in a single cell.

Condition (D)
We later show that if conditions (i), (ii), and (A) hold, then for Σ n given by (10), Σ n p − → Σ, under the null hypothesis.Therefore, condition (D)(i) is satisfied.However, condition (D)(ii), that rank(Σ n ) p − → rank(Σ), can be more difficult to verify.Our simulation results from Section 5 suggest that the verification of condition (D)(ii) should not be a major concern.Nevertheless, we provide an alternative approach that can be taken to avoid this potential problem.
Along the lines of Proposition 2 of Lütkepohl and Burda (1997) and the "trimmed" or "winsorized" tests of Davidov et al. (2018), the main idea is to make use of the eigendecompositions of Σ n and Σ, so that Σ n = E n Λ n E n and Σ = EΛE , where the columns of E, E n are orthogonal, and Λ, Λ n are diagonal matrices.Then, we can "trim" Σ n by setting all entries of Λ n that are smaller than some > 0 to zero.This prevents undesirable instabilities when making use of generalized inverses of Σ n in test statistics.We refer readers to Lütkepohl and Burda (1997) and Davidov et al. (2018) for more information.

Obtaining an Estimator of Σ
We obtain a consistent estimator of Σ. Throughout, we assume that conditions (i), (ii), and (A) are satisfied.This implies that conditions (B), (B'), and (C) are also met.In the following discussion the interval endpoints, k g , are fixed, although the reasoning extends to the case where random interval endpoints are used, as described in the Theorem.It is also assumed that E(Y 2 ) < ∞, f Y |X is as in (3), and that the null hypothesis is true.

Further Notation
We borrow and modify some of the notation used in Stute and Zhu (2002).Under our conditions they show that the sequence of processes R 1 n converges weakly to a centered Gaussian process R 1 ∞ with the structure whose terms we now describe.The process R ∞ is a centered Gaussian process and has covariance kernel K(s, t) = ψ(min{s, t}), where and F β0 denotes the distribution of β 0 X.The vector valued function Finally, Γ is a d-dimensional normal vector with zero means and covariance matrix [I 1 (β 0 )] −1 , the inverse of the Fisher information matrix for a single observation.From Stute and Zhu (2002), For fixed interval endpoints, k g , that do not depend on the observed data, we therefore have where S 1 ∞ is multivariate normal, MVN G (0, Σ).We now specify the elements of the matrix Σ.

Elements of the covariance matrix Σ
We first look at the diagonal elements of Σ.
because it can be shown that We now show that the above equation ( 18) holds.From Agresti (2015), for the models under consideration, Next, for g = g , from the covariance kernel of R ∞ .Also, the last three terms are equal in absolute value, because and by making use of equation ( 18) from above.Similarly, Therefore, The proof of the consistency of Σ n , as defined in (10) of the paper, is given below.In particular, if conditions (i) and (ii) of Section 3.3 hold, and if condition (A) additionally holds, then under the null hypothesis.This result holds for fixed interval endpoints, k g , or random interval endpoints, k n,g , provided that the latter are consistent for k g and P (β 0 X = k g ) = 0 for all g.(Condition (C) guarantees this latter condition.)Finally, under condition (D)(ii) and by Theorem 1 of Andrews (1987), where ν = rank(Σ).

Proof of Consistency of Σ n
We show that Σ n , defined in (11), is a consistent estimator of Σ under some conditions.Define w B (u) = m (u), and Then, for δ > 0, define We now introduce another condition that will help prove the consistency of Σ n .

Condition (E)
There is a δ > 0 such that Remark: As a consequence of this result, for the GLMs listed in Section 3.4, it is not necessary for X to have compact support as in condition (ii).Provided that (i), (A), (B), (B'), and (C) hold, along with the moment conditions listed in Table 10, our estimator Σ n is consistent for Σ under the null hypothesis.
Proof: The estimator Σ n can be written in the form These matrices are given by The matrices G n (β), V n (β), and W n (β) are the same as G * n , V * n , and W n defined in the main text, except that the latter matrices are evaluated at β n .Here we have emphasized the dependence of each entry on β; we will show that under our conditions each of A n , B n , and C n converges to its expected value uniformly in β ∈ N where N = {β : β − β 0 ≤ δ} with δ from Condition (E).We will also show those limits are continuous functions of β.This will finish our proof of consistency.
Each of these three matrices can be written in the form Under our assumptions, the matrix valued functions H involved have finite expectations for all β ∈ N .Our proof then uses Glivenko-Cantelli theorems, that is, uniform laws of large numbers; see the Lemmas 1 and 2 below.

Consistency of A n
For the matrix A n the g, g entry in H(x, β) is which vanishes unless g = g , in which case it is simply We apply Lemma 2. For u ∈ R and β ∈ R d define the function f β,u by and the class of functions F A by We apply the Lemma with β * = β 0 , h(β, x) = v(m(β x)), and M = M A .With these choices, conditions i-iii of the Lemma come immediately from Condition (E).We conclude that We have shown that almost surely.Consistency of A n then follows from continuity in β of J(β, k g ) (for all g).The Dominated Convergence Theorem shows that for any deterministic sequence β n converging to β 0 we have . This last follows from Condition (C).

Consistency of B n
For the matrix B n the g, j entry in As for A n we define, for u ∈ R, β ∈ R d , and j ∈ {1, . . ., d}, the function f β,u,j to be the jth component The argument for A n together with the assumption on M B may be followed to prove that B n (β n ) converges almost surely to its expectation.
The condition is the same as the one associated with M A,δ .Finally, so our condition is that X has a finite moment generating function in a neighbourhood of β 0 .We also need E(e 2β 0 X ) < ∞ and E(e 2β 0 X X 2 ) < ∞.These lead to the same condition as the one associated with both M A,δ and M B,δ .That is: for the Poisson family with log link, Condition (E) holds as long as X has a finite moment generating function in a neighbourhood of 2β 0 .
For the square root link we have m(u) = v(m(u)) = u 2 .We find that Writing β in the form β 0 + ax + bv with v any vector perpendicular to x shows that we can regard b = 0, and then The maximum of this piecewise linear function of a that occurs in the braces must occur on the boundary of the interval imposed on a, so it is easily checked that Thus, M A,δ is square integrable if X has 4 finite moments.We see that m (u) = 2, and thus M B,δ (x) = 2 x 2 , which requires 4 finite moments.Finally, it is easily checked that for this link which does not depend on β.For the Poisson family with a square root link, we also require E((β 0 X) 4 ) < ∞ and E(4(β 0 X) 2 X 2 ) < ∞, but these do not add any further moment conditions.That is: for the Poisson family with square root link, Condition (E) holds as long as X has 4 finite moments.
Bernoulli Family: For the Bernoulli(θ) model we have v(m) = m(1 − m).The inverse links we consider all have the form m(u) = F (u), for a smooth cdf F with corresponding smooth density f .We therefore have which is bounded by 1, and m (u) = f (u).
For all 4 links (logit, probit, cauchit, and cloglog) we find that there is a constant, C, such that, for all u, f (u) ≤ C.
For all 4 links there is also another constant, C 2 , such that, for all u, and so which amounts to 2 finite moments.Also, which amounts to 4 finite moments.
The quantity 2 is bounded by some constant.Therefore, our condition on M C,δ (x) amounts to requiring three finite moments for X.
Gamma Family: Here v(m) = m 2 /k, where k > 0 is the shape parameter in the Gamma distribution.We consider the log link, m(u) = e u .The square root link, m(u) = u 2 , is included to highlight how to deal with other link functions.
For the log link we find that v(m(u)) = e 2u /k, and m (u) = e u , so that which means our condition on M A,δ is that X has a finite moment generating function in a neighbourhood of 4β 0 .We also take M B,δ (x) = x 2 sup e β x : β − β 0 ≤ δ , which leads to the strictly weaker condition that X has a finite moment generating function in a neighbourhood of 2β 0 .Finally, whose derivative is 0 so we need only M C,δ (u) = 0.
Thus, C n (β) does not depend on β, so consistency is a consequence of two finite moments for X.The conditions on v(m(β 0 X)) and on m (β 0 X) are the same as those above; they amount to a finite moment generating function of X at 4β 0 and at 2β 0 .
For the square root link, we have v(m) = m 2 /k, m(u) = u 2 , and v(m(u)) = u 4 /k.We find which means we require X to have 8 finite moments.Evidently, m (u) = 2. Thus, we may take which leads to a weaker condition, namely, 4 finite moments for X.Finally, This function diverges at u = 0, so we need to add an assumption: there is an > 0 and a δ > 0 such that, for all β with β − β 0 ≤ δ, we have In this case, d du and M C,δ (x) ≤ 8 x 3 k/ 3 .This amounts to three finite moments for X.The conditions on v(m(β 0 X)) and on m (β 0 X) amount to E((β 0 X) 8 ) < ∞ and E((β 0 X) 2 X 2 ) < ∞, i.e., 8 finite moments which matches the requirement for M A,δ .
Inverse Gaussian Family: Here v(m) = m 3 /λ.For the log link, m(u) = e u , we find that v(m(u)) = e 3u /λ so that M A,δ (x) = 3 x sup{e 3β x : β − β 0 ≤ δ}/λ, which means our condition on M A,δ is that X has a finite moment generating function in a neighbourhood of 6β 0 .We also have m (u) = m (u) = e u , and therefore take M B,δ (x) = x 2 sup{e β x : β − β 0 ≤ δ}, which leads to the strictly weaker condition that X has a finite moment generating function in a neighbourhood of 2β 0 .Finally, (m (u) and so d du Our condition is that X has a finite moment generating function in some neighbourhood of −β 0 .We also need E(e 6β 0 X ) < ∞, and E( X 2 e 2β 0 X ) < ∞.Our overall condition is therefore that X has a finite moment generating function in some neighbourhood of 6β 0 and in some neighbourhood of −β 0 .We remark that the set of β where a moment generating function is finite is convex and will include both these neighbourhoods and a tube containing them.
Negative Binomial Family: We have v(m) = m + m 2 /k, with k > 0. We consider the log link, although the square root and identity links are also sometimes used.For the log link, m(u) = e u , and |w A (u)| = e u + 2e 2u k , so that M A,δ (x) = x sup{e β x + 2e 2β x /k, β − β 0 ≤ δ}.
Our condition then amounts to X having a finite moment generating function in a neighbourhood of 4β 0 .Also, M B,δ (x) = x 2 sup{e β x , β − β 0 ≤ δ}, which leads to the strictly weaker condition of X having a finite moment generating function in a neighbourhood of 2β 0 .For C n , we see that for some constant C > 0. Therefore, our condition associated with M C,δ (x) is that X has three finite moments.We also require E(v 2 (m(β 0 X))) < ∞, E(e 2β 0 X X 2 ) < ∞, and E( X 2 ) < ∞, but these hold, provided that the above conditions on the moment generating function of X are satisfied.
The proofs above used two lemmas, which in turn depend on a third.All of the lemmas synthesize results in Kosorok (2007).
Lemma 1. Suppose X 1 , X 2 , . . .are iid d-dimensional vectors with the same distribution as X, and β is a d-dimensional vector taking values in some open set O. Suppose h is a real valued function of the pair (x, β) which is continuously differentiable in β for each x.Let K be some compact subset of O with diameter denoted by diam(K).Assume: 1. there is some β * ∈ K such that E {|h(X, β * )|} < ∞, 2. there is a function M (x) such that for all x in the support of X and all β ∈ K ∂ ∂β h(x, β) ≤ M (x), and, 3. the random variable M (X) is integrable: Then, 1.The family of functions F h,K = {x → h(β, x); β ∈ K} is pointwise measurable.
2. This family has bounded uniform entropy integral with respect to the square integrable envelope Proof :

The class
To prove the first statement we must find a countable subset G of F h,K such that every f ∈ F h,K is the pointwise limit of a sequence of elements of G.We find G by a route which helps with our proof below of the second and third statements.
A set of closed balls of radius δ covers a set B ⊂ O if B is contained in the union.The covering number of B, denoted by N (δ, B, • ) is the smallest integer N for which there is a set β 1 , . . ., β N of elements of B such that for every β ∈ B there is a j with β − β j ≤ δ.The set of such balls is said to δ-cover B. We now bound, by a standard volume argument, the value of N when B is a ball.
The volume of a ball of radius r in R d is proportional to r d .Thus if there are N disjoint balls of radius in R d all of which lie in some ball of radius R then the volume of those N small balls is N times the volume of a single one and less than the volume of the ball of radius R.So N ≤ (R/ ) d .Consider now a set of such balls of maximal size; this collection is said to pack B and the corresponding value of N is the packing number.The collection of N balls with the same centers but radius 2 contains the ball of radius R for if not we could fit in another ball of radius .So for any ball B of radius R we get For each > 0 we have identified a finite set say B of points in K such that every point in K is within of some member β of B .Take B to be the union over positive integers n of B 1/n and let G be the corresponding elements of F j,K .Evidently B and G are countable and B is dense in K. Every β ∈ K is thus the limit of a sequence of points β n ∈ B and the corresponding f β is the pointwise limit of f βn because h is continuous in β.This proves Statement 1.Now we turn to the second statement.Take B to be a ball of radius R ≤ diam(K) which contains K. Find β 1 , . . ., β N in B so that N = N ( , B, • ) and the N balls centered at the β j having radius cover B. For each such ball which intersects K let β * j be in the intersection.Every point in K is within the union of the balls centered at those β j for which the intersection with K is not empty.So every point in K is within balls centred at β * j but with radius 2 .Thus, N ( , K, • ) ≤ (4diam(K)/ ) 2 .Now suppose that β ∈ K. Fix > 0 and find N = N ( , K, • ) points, say {β 1 , . . ., β N }, in K such that K is contained in union of the N balls of radius centered at the β j .For each f ∈ F h,K we have f = f β for some β ∈ K. Find j so that β − β j ≤ .If Q is a finite discrete measure on the support of X there is a set of points x 1 , . . ., x k and corresponding probabilities q 1 , . . ., q k such that k 1 q i = 1 and Q(X = x i ) = q i for j = 1, . . ., k.Now f β (x i ) − f βj (x i ) ≤ β − β j M (x i ), by Taylor's theorem.Thus, the L 1 (Q) norm of f β − f βj satisfies Increasing the first argument in the covering number cannot increase the covering number itself.Thus, This proves the bound on the uniform covering numbers sup This is Statement 2.
Since pointwise measurability implies P -measurability for every P , we have verified all the conditions of Theorem 8.14, page 145 in Kosorok (2007).Statement 3 follows.
The fourth assertion of the lemma is an application of the Dominated convergence theorem.If the sequence β n converges to some β ∈ K then |f βn (x) − f β (x)| < sup n β n − β M (x) and the right hand side of this inequality is integrable.Uniform continuity is automatic because is compact.(Indeed, the assumptions on M guarantee that this expectation is a differentiable function of β on the interior of K.) • Our second Lemma deals with processes involving indicators.It deduces Glivenko-Cantelli results from Donsker results; it seems likely that this contributes to an increase in the strength of our moment conditions.Lemma 2. Suppose X 1 , X 2 , . . .are iid d-dimensional vectors with the same distribution as X, and β is a d-dimensional vector taking values in some open set O. Suppose h is a real valued function of the pair (x, β) which is continuously differentiable in β for each x.Let K be some compact subset of O with diameter denoted by diam(K).Assume: i) there is some ii) there is a function M (x) such that for all x in the support of X and all β ∈ K ∂ ∂β h(x, β) ≤ M (x), and, iii) the random variable M (X) is square integrable: Then,

d
j=1 λ d χ 2 1j can be approximated in various settings by a χ 2 d−2 distribution.The results from the simulation study then led to the recommended G − 2 degrees of freedom.In other words, C G .

Figure 1 :
Figure 1: Power simulation results for the first three settings.Solid red lines are 95% Wilson CIs.

Figure 2 :
Figure 2: Estimated mean value of the naive HL (circles) and generalized HL (triangles) statistics for Poisson regression models, using sample sizes of 100 (black) and 500 (blue).Horizontal lines represent the respective true means of the chi-squared distributions to which the test statistics are compared.

Table 2 :
Estimated type 1 error rates (null setting simulation results).Numbers in italics represent cases where the rejection rate falls outside of the approximate 95% interval for the nominal level.

Table 3 :
Bilder and Loughin (2014)used in the alcohol consumption study.Based on the subset of the data used in the Poisson regression analysis ofBilder and Loughin (2014), each variable is derived from measurements for the first Saturday of each subject in the study.

Table 7 :
Null simulation settings

Table 9 :
Power simulation results -incorrect link function

Table 10 :
Moment conditions on X for the consistency of Σ n .The conditions in this table are for when the dispersion parameter is known, i.e., for the Normal, gamma, inverse Gaussian, and negative binomial distributions when σ 2 , k, λ, and k are known.