Bootstrap based goodness-of-fit tests for binary multivariate regression models

We consider a binary multivariate regression model where the conditional expectation of a binary variable given a higher-dimensional input variable belongs to a parametric family. Based on this, we introduce a model-based bootstrap (MBB) for higher-dimensional input variables. This test can be used to check whether a sequence of independent and identically distributed observations belongs to such a parametric family. The approach is based on the empirical residual process introduced by Stute (Ann Statist 25:613–641, 1997). In contrast to Stute and Zhu’s approach (2002) Stute & Zhu (Scandinavian J Statist 29:535–545, 2002), a transformation is not required. Thus, any problems associated with non-parametric regression estimation are avoided. As a result, the MBB method is much easier for users to implement. To illustrate the power of the MBB based tests, a small simulation study is performed. Compared to the approach of Stute & Zhu (Scandinavian J Statist 29:535–545, 2002), the simulations indicate a slightly improved power of the MBB based method. Finally, both methods are applied to a real data set.


Introduction
Binary multivariate regression models are for example used to analyze longitudinal data. Those appear in clinical studies and are used to evaluate the effect of interventions over time. For different individuals, information is collected at several assessment times. To deal with incomplete data in a longitudinal setup, inverse probability weighted generalized estimating equations (WGEE) (Robins et al., 1994) are used. The resulting WGEE provides consistent estimators only if the underlying (binary) process of missing data is properly modeled. Of course, this should be secured in advance.
This paper addresses, in a more general context than just described, the question of how to test the model assumptions of a binary generalized linear regression model.
Mathematically, we describe the data with a sequence of independent and identically distributed (iid) random variables where is a binary or 0 − 1 response variable and X ∈ ℝ d a d-dimensional input with continuous distribution function (df) H. For the binary regression model, denotes the conditional expectation of given X = x . Under the generalized linear model (GLM), one assumes that there exists a link function g, that is an invertible function with measurable inverse, such that for H almost all x ∈ ℝ d and an appropriate 0 ∈ ℝ d . The function g is assumed to be known. Based on this, we set Assuming that the data ( , X) comes from a GLM with link function g now means that m ∈ M ∶= {m( ⊤ ⋅)| ∈ ℝ d }. If one assumes a GLM to analyze a sample ( 1 , X 1 ), ..., ( n , X n ) of iid data, one has to guarantee that the linear part and the assumed link-function are correct or, at least, that the data shows no obvious departure from the model. Thus, we need a goodness-of-fit test to validate the model, i. e., we need a universal test to check the null hypothesis A general approach for model checking in a regression setup was introduced by Stute, (1997). Stute & Zhu, (2002) specialized this approach to GLM, where the response variable is not necessarily binary. In the binary setup of GLM, the underlying probabilistic background is a functional limit result of the marked empirical process with estimated parameters: ( 1 , X 1 ), ..., ( n , X n ), where n is a proper estimator of 0 and I denotes the indicator function, see Stute (1997). With R 1 n (−∞) = 0 and R 1 n (∞) = n −1∕2 ∑ n i=1 ( i − m( ⊤ n X i )) , this process is a random element in the Skorokhod space D ([−∞, ∞]) . Under appropriate conditions, R 1 n converges in distribution against a centered Gaussian process R 1 ∞ , which, however, has a rather complicated, model-dependent covariance structure, cf. Theorem 1 in Stute & Zhu, (2002). To make this result usable for applications in statistics, Stute and Zhu introduced a model-based transformation. Applying this transformation, respectively its estimated version, to R 1 n , this composition converges in distribution against a time-transformed Brownian motion, cf. Theorem 2 in Stute & Zhu, (2002). This framework is then used to get asymptotically distribution-free statistics.
The approach works excellently, but has two weak points. For the transformation, one needs an estimate of the conditional expectation of X given Under general conditions, one must estimate this quantity using a non-parametric procedure. However, such a method always requires a smoothing parameter, but its selection is not unproblematic. However, since the model as a whole is parametric, the question inevitably arises whether this non-parametric method is absolutely necessary. Moreover, a user who wants to check a chosen GLM with this method must implement the model-dependent transformation in each case. This is of course feasible, but goes along with a considerable effort, because the transformation is quite complex especially for nonstatisticians. Of course, parts of this procedure could be automated and implemented as software, but then it will hardly be applicable without the appropriate knowledge about the transformation. It would be nice if all this could be avoided.
To estimate 0 , we use the maximum likelihood estimator (MLE) given by where is the normalized log-likelihood function.
For the bootstrap data, we propose the following model-based (MB) resampling scheme similar to the resampling scheme in Dikta et al., (2006). MBB guarantees that the bootstrap data are always generated according to the null hypothesis.
Definition 1 Let ( 1 , X 1 ), ..., ( n , X n ) be iid observations, where the i are binary and the X i have a continuous distribution function H. Let n be the corresponding MLE. The model-based resampling scheme is then defined as follows: Under this resampling scheme, only the ′ s are resampled, the corresponding X ′ s are taken from the original sample.
To define R 1 * n , the bootstrap analog of R 1 n , we assume a bootstrap sample and set where * n is the MLE corresponding to the log-likelihood function based on the bootstrap sample. Usually the n in the indicator is also replaced by * n . We don't replace it here, since both processes can be shown to be asymptotically equivalent. Furthermore, simulations that use n instead of * n run faster, since n is the same for each bootstrap sample.
We will prove that the cumulative residual process R 1 * n (t) corresponding to the MB bootstrap data behaves asymptotically as R 1 n if the original data satisfy the null hypothesis. Thus, the distribution of any statistic that depends continuously on R 1 n can be approximated by the corresponding distribution based on R * 1 n . This provides the basic asymptotic backup of our method. But in addition to this, an approximation based on R 1 * n also has the advantage that it accurately reflects the fixed sample sizes. Even if the original data come from the alternative, the bootstrap data are always generated under the null hypothesis. Thus, a statistic based on R 1 * n fits the null hypothesis. This is crucial because p-values are based on the distribution under the null hypothesis. Overall, this should lead to a more accurate approximation of the p-values compared to the pure asymptotic one under finite sample size, and, hence, to an improvement of the power. Indeed, we can observe some improvements in the simulation study.
As in Stute & Zhu, (2002), we consider a Kolmogorov-Smirnov (KS) and Cramér-von Mises (CvM) test statistics D n and W n based on R 1 n as and Here H n is the empirical distribution function (edf) of the ⊤ n X sample. Since, under H 0 , R 1 n → R 1 ∞ in distribution, as n → ∞ , the continuous mapping theorem implies that

3
and W n → W ∞ in distribution, as n → ∞.
If under H 0 the process R 1 * n tends in distribution to the same limiting process R 1 ∞ as R 1 n , the p-values corresponding to the KS and CvM test can now be approximated by the typical Monte-Carlo approach (used in bootstrap applications) based on the distribution of and where H * n denotes the edf based on the T n X * sample. This article is organized as follows: In Section 2 we state the main results, which guarantee that the MBB can be used to test our null hypothesis. In Section 3 the approach is applied in a simulation study and a real data application. Here, our approach is also compared to the approach by Stute & Zhu, (2002). The results of Section 3 are discussed in Section 4. The proofs of our main results are provided in Section 5. Additionally, in the Appendix some results used in Section 5 are presented.

Main results
In this chapter, our main result is given in Theorem 2.
To prove Theorem 2, we first show that in the space D[−∞, ∞] , the process where R ∞ is a centered Gaussian process, see Theorem 1. This process is similar to R 1 n (u) , but is replaced with * . Theorem 1 is a stepping stone for proving Theorem 2, in which we also replace n with * n . To prove both theorems, we show that the fidis of both processes converge and that the processes are tight, see Theorem 13.5 of Billingsley, (1999). Lemma 1 (iii) provides a result which is required to prove the convergence of the fidis of the process R * n . Lemma 1 (i) and Lemma 1 (ii) are required to prove Lemma 1 (iii).
Since we finally replace n with * n in Theorem 2, we need to ensure that * n converges to n , which is done in Lemma 2. The proof of Theorem 2 uses a decomposition of the process R 1 * n (u) into R * n (u) and a difference term. To simplify the representation, Lemma 3 is used. With the final decomposition we now prove the tightness and the convergence of the fidis of the process R 1 * n (u) . For Theorem 1 we need the following assumptions: (A1) n → 0 , as n → ∞ , w.p. 1.
Journal of the Korean Statistical Society (2022) 51:308-335 Assumptions (C1), (D1) and (B1) are similar to assumptions (B) and (C) in Stute & Zhu, (2002), but specified to the binary setup. Furthermore, with (A1) we ensure that n → 0 , as n → ∞ , w.p. 1. As mentioned before, the following Lemma is used to prove the convergence of the fidis of the process R * n (u) , which is defined in Theorem 1. Now, in the process R 1 n (u) , we replace with * , where * is generated by using the MB scheme. As stated in the following Theorem, R * n (u) converges.

centered Gaussian process with covariance function
After replacing with * , we need to replace n with * n . For this, we define which is the derivative of the summands of the log-likelihood function and Check that * n l( T n X, * ) = 0. For the following Lemmas and Theorem 2 we need some additional assumptions: T exists and is continuous with respect to for every in a neighborhood of 0 (not depending on x). (E2) There exists a square-integrable function M(x) such that for every x max Assumptions (D2) and (E2) are again similar to assumption (B) in Stute & Zhu, (2002), but specified to the binary setup. Furthermore, assumptions (A2) and (B2) are similar to assumption (A).
Lemma 2 is necessary to ensure that * n converges to n .

1,
where Z is a multivariate normal distribution with zero mean and covariance matrix L( 0 ).
In addition, we need some results for w(x, ) and W(x, ).
Finally, the process R 1 * n (u) converges in distribution.

Simulations
To clarify the results, the Bootstrap approach is compared to the approach introduced by Stute & Zhu, (2002). For the application of their method, we make use of an additional assumption to avoid the non-parametric estimation of (X| T 0 X = v) . As stated in Stute & Zhu, (2002), page 541, we assume that X belongs to a family of elliptically contoured distributions. Note that we do not need this assumption for our bootstrap approach. To calculate the p-values for the approach by Stute and Zhu we use the Karhunen-Loève expansion for a Brownian motion [ (Bass, 2011), formula (6.2)] to approximate the distribution of the integrated squared Brownian motion over the unit interval. In all simulations, the empirical powers and ecdfs of the p-values based on the CvM statistic are calculated from 1000 replications. The sample sizes are set to n = 50 and n = 100 . For the Bootstrap approach, each p-value is based on 200 bootstrap samples. The ecdfs of the 1000 p values per simulation and approach are displayed in a graph together with the uniform distribution function (red: Bootstrap approach, blue: approach by Stute and Zhu, gray: uniform distribution function). In addition, the percentages of rejecting the null hypothesis (at levels = 0.05 and = 0.01 ) are given explicitly.
In the first simulation, we generate uncorrelated X i from a 3-dimensional normal distribution with mean values 0 and variance 1. Based on a chosen ( = (1, 1, 2) T ) we calculate the probability P( = 1|X = x) , assuming a logistic regression model. In our test, we assume that the generated data belong to a GLM with a logistic regression function where is 3-dimensional, which is true. Table 1 shows the results. The two ecdfs of the p values based on the CvM statistic are very similar to the distribution function of a uniform distribution. Thus, the test holds the level.
In the second simulation, the data are generated the same way as in the first simulation, but now the third covariate was squared. We assume that the data belong to a GLM with a logistic regression function where the third component is not squared, which is false. Table 2 shows that both approaches yield similar results. Furthermore, in both cases the power increased with the sample size. In the third simulation, we generate data using a nonparametric mixing of logistic regression, see Agresti, (2002), 13.2.2. The X i are again generated from a 3-dimensional normal distribution with mean values 0 and variance 1 and = (1, 1, 2) T . Furthermore, a Bernoulli variable with p = 0.2 is generated. If this variable is 0, we add 1 to T X . Again we calculate the probability P( = 1|X = x) assuming a logistic regression model. In our test, we assume that the generated data belong to a GLM with a logistic regression function where is 3-dimensional, which is false. Table 3 shows similar results as in the second simulation.
In the last simulation, we generate the data in the same way as in the first simulation again. This time, we assume a probit regression model where is 3-dimensional, which is false. Table 4 shows, that all ecdfs of the p values based on the CvM statistic a are very similar to the distribution function of a uniform distribution. Thus, both tests do not detect this departure from the null hypothesis.

Real data application
We applied the introduced test to the data set reported on by Härdle and Stoker, (1989). This data set consists of 58 measurements on simulated side impact collisions. The fatality (binary 0 − 1 random variable, 1 means the crash resulted in fatality) and three covariates (age of the driver, velocity of the automobile, maximal acceleration measured on the subject's abdomen) were measured. Härdle and Stoker estimated 0 and fitted m in a non-parametric way and concluded that the link function is of "distribution   type", i.e., non-decreasing in T 0 x , as in the logit or probit case. They did not check if a GLM would fit the data at all. We tested, if (after a standardization) a GLM with a logit or probit link function is appropriate for the data set. Based on the bootstrap approach the p value for a logit link function is 0.047, for a probit link function 0.049. Thus, in both cases, the model is rejected. Stute and Zhu (2002) also applied their approach to this data set and came to the same result.

Discussion
Our small simulation study indicates that the bootstrap approach has slightly better empirical power than the Stute & Zhu, (2002) approach. This is noteworthy because the Stute and Zhu approach was conducted here under an additional assumption (elliptically contoured distributions) that is unnecessary for the bootstrap approach. If this additional assumption is not fulfilled, then non-parametric regression estimation has to be applied in the Stute and Zhu procedure, but this entails further problems (choice of smoothing parameter) and can have negative effects on the power of the test. For the bootstrap method all these problems do not exist.
The resampling procedure guarantees that the bootstrap data are always generated under the null hypothesis, regardless of whether the original data satisfy the null hypothesis or not. Consequently, the distribution of a test statistic based on the bootstrap data fits the null hypothesis. If the test statistic based on the original data lies at the edge of this bootstrap-based distribution, then this indicates a violation of the null hypothesis. It is important to note that the sample size is also considered in the approximating distribution by the bootstrap approach. In the approximation with the asymptotic distribution this is not given in the last consequence. We assume that the slight improvement with respect to the empirical power is based on this. That the consideration of the sample size in the approximating distribution can be advantageous compared to the approximation by the limiting distribution, Singh, (1981) was able to prove for the classical bootstrap and the standardized mean. However, this is not studied further in our paper, but should be addressed theoretically in future work.
The bootstrap method is easier to implement because it is not as technically demanding as the method of Stute and Zhu. However, it is more complex in terms of computing time. The latter is always of great importance if the method is to be used on a large scale.

Proof of Lemma 1 Define
For (ii) check that Due to assumption (C1) the dominated convergence theorem yields that the first term converges to 0 as → 0. Denote the second term as sup A( , u) and choose K > 0 and check that , we can find a K > 0 such that A 2 ( , 0 , u, K) ≤ . Due to assumption (B1), H(⋅, 0 ) is uniformly continuous and therefore we can find a > 0 such that A 1,1 < uniformly in u. Furthermore, we can choose such that < min( , ∕K) which yields that A 1,2 ( , 0 , , K) = 0 and, therefore, A( , 0 , u) < 2 . This proves part (ii).

Proof of Theorem 1
To prove the Theorem, we will use Theorem 13.5 of Billingsley (1999). We first show, that the fidis of R * n converge in distribution to the fidis of R ∞ . Obviously, R * n has independent zero-mean summands, since For the covariance of R * n we get for u 1 , u 2 ∈ ℝ where * i and * j are iid. Thus, if i ≠ j , the expectation in the last equation is 0. Therefore, the last equation equals Here, the expectation equals the conditional covariance of a binomial distribution with success probability m( T n X i ) . Thus, for the last equation we get Due to Lemma 1 (iii), this converges to I( T 0 X ≤ u 1 ∧ u 2 )m( T 0 X)m( T 0 X) uniformly in u for n → ∞ , w. p. 1. Thus, w. p. 1, the covariance function of the process R * n (u) converges to Now let k ∈ ℕ and choose −∞ ≤ u 1 < ... < u k ≤ ∞ . Following Cramér-Wold, see Theorem 7.7 of Billingsley, (1999), we have to show that, w.p. 1, for every a ∈ ℝ k , in distribution, with Σ = ( s,t ) 1≤s,t≤k and Here, * 1,n , ..., * n,n are independent and centered, and A 1,n , ..., A n,n are deterministic in the bootstrap setup. To show the asymptotic normality of Z * n , we apply Theorem 1.9.3 of Serfling, (1980) and prove the Lindeberg condition, as n → ∞ , is true w.p. 1 for each > 0.
First, check that Since Σ is positive semi-definite, a T Σa ≥ 0 . If a T Σa = 0 , Tschebyscheff's inequality guarantees that Z * n = o ℙ * n (1) and thus, for n → ∞, Thus, the indicator of the Lindeberg condition equals 0 as n → ∞ and therefore the Lindeberg condition is fulfilled, the finite dimensional distributions converge to N(0, a T Σa) . This is part (i) of Theorem 13.5 of Billingsley, (1999). For the tightness we use a modification of this Theorem, see Corollary 1, where F also depends on n. For this we assume that our process is only defined on the interval [0, 1]. If this is not the case, we can use a transformation to receive such a process.

Use this and check that
where the last equality follows since the i and i are independent and since either I(u 1 < T n X i ≤ u) or I(u < T n X i ≤ u 2 ) equals 0. Now, recall the definition of i and i to get Since H n (u) → K(u, u) , w.p. 1, due to assumption (B1), a continuous, non-decreasing function H with sup u∈ℝ | | H n (u) − H(u) | | → 0 exists. Therefore, following Corollary 1 the process R * n is tight. ◻

Proof of Lemma 2
Following Cramér-Wold, see Theorem 7.7 of Billingsley (1999), due to (B2) we have to show that, w.p. 1, for every a ∈ ℝ d , a ≠ 0, in distribution for n → ∞ . According to Serfling, (1980), Theorem 1.9.3, this follows from the Lindeberg condition, Use (C2) to get as n → ∞ , w.p.1. Furthermore, L( 0 ) is positive definite and a ≠ 0 , thus, a T L( 0 )a > 0 and it suffices to show that The integral equals Since Var * n → a T L( 0 )a we get for the indicator for 1 ≤ i ≤ n and n sufficiently large. Due to assumption (E2), Thus, Borel-Cantelli yields w.p. 1. Therefore, the indicator equals 0 as n → ∞ , and the Lindeberg condition is fulfilled. ◻

Proof of Lemma 3
Since the half-spaces in ℝ d are a GC-class and w j (X, ) is integrable, see assumption (F2), Corollary 9.27 of Kosorok (2008) yields Due to assumption (A1), for every > 0 we get Journal of the Korean Statistical Society (2022) 51:308-335 w.p. 1. Furthermore, the last term on the right side tends to 0 as → 0 . This is part (i).
For part (ii) we get since, due to (A1) and Lemma 2, for > 0 , ℙ * n (| * n − 0 | > ) → 0 as n → ∞ , w.p. 1. Furthermore, as n → ∞ Due to assumption (D2) and (E2), applying the dominated convergence theorem yields that the expectation on the right side tends to 0 as → 0 . ◻ for > 0 . Thus, we can assume that * n and n are in the neighborhood of 0 . Following assumption (D2) we can apply Taylor's expansion to get where ̂ * n (x) is in the line segment connecting * n and n . Thus we can write S * n (u) as follows: Lemma 3 now yields that and with (B2) and (E2) uniformly in u. Now define which is asymptotically equivalent to R 1 * n , see Theorem 4.1 of Billingsley (1999). Furthermore, following the proof of Theorem 1, R * n (u) is tight in D[−∞, ∞] and due to Lemma 2 n −1∕2 ∑ n i=1 l T ( T n X i , * i ) converges to a zero mean multivariate normal distribution with covariance matrix L( 0 ) , w. p. 1.
Furthermore, assumption (F2) yields that W(⋅) is continuous. Now let k ∈ ℕ and choose −∞ ≤ u 1 < ... < u k ≤ ∞ . Following Cramér-Wold, see Theorem 7.7 of Billingsley, (1999), we have to show that, w.p. 1, for every a ∈ ℝ k , a ≠ 0 in distribution, with Σ = ( s,t ) 1≤s,t≤k and We can rearrange the terms to Obviously, those variables are centered and ( * 1,n , * 1,n ), ..., ( * n,n , * n,n ) are independent. Additionally, A i,n and B are deterministic with respect to ℙ * n . Thus, we get for the variance of Z 1 * n In the proof of Theorem 1 we have shown that Furthermore, due to assumption (C2) as n → ∞ , w.p. 1. Now check that A 2 i,n Var * n ( * i,n ) → ∑ 1≤s,t≤n a s K(u s , u t )a t , as n → ∞, w.p. 1.
Thus, for the last term, due to Lemma 2 (i), we get as n → ∞ And finally, as n → ∞ , w.p. 1, Assume that a T Σa > 0 . Then we have to prove the Lindeberg condition as n → ∞ , w.p. 1.

The integral equals
Since Var * n (Z 1 n ) → a T L( 0 )a we get for the indicator * n ( * i,n * i,n ) Var * n (Z 1 * n ) → ∑ 1≤s,t≤n a sK (u s , u t )a t = a T Σa.
1 Var * n (Z 1 * n ) Journal of the Korean Statistical Society (2022) 51:308-335 9.13 of Kosorok, (2008) to get that G is BUEI. Additionally, 0 ≤ mm ≤ 1 and therefore G has the envelope G = 1 . Also, m is continuous and G is PM, thus G is also PM, see Lemma 8.10 of Kosorok, (2008). ◻ Lemma 7 Assume m is continuous and m ′ is bounded. Combine G and H to get F ∶= {m(g)m(g)I H , g ∈ G, I H ∈H} . F is BUEI with envelope F = 1 , PM, Donsker and GC.
This follows from Glivenko-Cantelli (first 2 summands) and the continuity of H (last summand). Thus, for given and we can choose such that the right site of the inequality is less than . ◻