A global goodness-of-fit test for linear structural mean models

Structural mean models (SMMs) have been proposed for estimating causal treatment effects in the presence of non-ignorable non-compliance in clinical trials. To obtain a valid causal estimate, we must impose several assumptions. One of these is the correct specification of the parametric part of the SMMs. Model checking is an important task for data analysts to detect any departure of an assumed model from the true one. However, little work has been done on the goodness-of-fit (GOF) test for the SMMs. In this article, we propose a global GOF test of SMMs. Numerical studies show the proposed test can control type I errors if the SMM is correctly specified. Furthermore, the proposed test detects non-linear effect modification of continuous covariates powerfully, while an existing test does not. We apply the proposed method to data derived from a randomized trial to evaluate the impact of a primary care-based intervention on depression.


Introduction
In a typical clinical trial, patients are randomly assigned to different groups with specific treatments; each patient is expected to receive that treatment throughout follow-up to assess its effect on some outcome. However, most clinical trials are not Communicated by Joe Suzuki. ideal; patients often fail to adhere to their assigned treatment and switch to another trial treatment. Such non-compliance with assigned treatments is a common feature of clinical trials. Robins (1994) developed structural mean models (SMMs) to cope with noncompliance without having to specify the mechanism of non-ignorable noncompliance (Rubin 1976) using randomization as an instrumental variable. One attractive feature of SMMs is their modeling flexibility, which allows for the expression of the causal effect of received treatments as a function of treatments and covariates through a finite number of unknown causal parameters without specifying the conditional expectation of potential outcomes under the control treatment. SMMs have now been proposed for continuous, discrete, and binary outcomes (Robins 1994;Vansteelandt and Goetghebeur 2003), and related structural distribution models have been developed for survival outcomes (Mark and Robins 1993;Loeys and Goetghebeur 2003).
To obtain a valid causal estimate, we must impose several assumptions. One of these is the correct model specification of the structural model. This can be numerically checked by evaluating the goodness of fit (GOF) of postulated SMMs to the data. For continuous outcomes, Comte et al. (2009) developed a test of the interaction between treatment and a baseline covariate, and Fischer et al. (2011) proposed a local GOF test which can detect a linear effect modification with a covariate but cannot detect a non-linear effect modification. Taguri et al. (2014) recently proposed a model selection criterion as an extension of Akaike's information criterion (Akaike 1973) for evaluating the relative fitting of candidate models using the expected Kullback-Leibler distance as a metric. However, none of them proposed a global GOF test which can detect any misspecifications of the assumed model structure.
In general, the validity of the estimating equations depends on whether the parametric part of the SMM is correctly specified. If the SMM is misspecified, the resultant estimating equations deviates in expected value from zero, thus an inconsistent estimator will be yielded. To get a valid inference, it is desirable to assess the unbiasedness of the estimating equations. Diagnostic tools such as residuals have been widely used to assess the appropriateness of a generalized linear model (Su and Wei 1991;Lin et al. 2002). However, such methods cannot apply to non-compliance data with an instrumental variable. The aim of this article is to develop a global GOF test of linear SMMs. The idea is based on testing for the unbiasedness of g-estimating equations (Robins 1994). The residual processes will be constructed in the same spirit of Su and Wei (1991), Lin et al. (2002), Pan and Lin (2005), and Chen and Qin (2014). Under the null hypothesis that g-estimating equations are unbiased, the residual processes will fluctuate about zero. Thus a large absolute value of the residuals leads to the conclusion of model misspecification. Numerical studies show that the proposed test can control type I errors if the SMM is correctly specified. Furthermore, the proposed test detects non-linear effect modification of continuous covariates with high probability, while Fischer et al.'s test does not.
The reminder of this article is as follows. In Sect. 2, we briefly overview the SMMs and the g-estimation procedure. In Sect. 3, we review the method by Fischer et al. (2011) and propose a GOF test. In Sect. 4, we present a simulation study to investigate the performance of our proposed test. In Sect. 5, we apply the proposed method to data derived from a randomized trial to evaluate the impact of a primary care-based intervention on depression. Finally, in Sect. 6, we conclude with a discussion.

Structural mean models
We consider a randomized two-arm trial, where n patients are randomized to one of the two treatments. Let R be the indicator of treatment assignment, equal to 1 (0) for the test (control) treatment. Let A be the actual treatment whether an individual received test treatment (1: test, 0: control), X is the vector of baseline covariates, and Y is the continuous outcome measured at the end of the trial. We assume the , n are n independent and identically distributed random vectors. Thus, we omit the subscript i unless necessary. In contrast to the observed outcome variable Y, we define Y ra with r, a = 0, 1 as the potential outcome (Rubin 1974) that would be observed if possibly contrary to the fact that R were set to r and A were set to a. We make the following three assumptions to estimate causal treatment effects: The potential outcome for each patient does not depend on the treatment assigned or the treatment actually received by any other patient. SUTVA also implies the consistency assumption, which means that a patient's potential outcome under his/her treatment is precisely his/her observed outcome. In notation, SUTVA implies that Treatment assignment only affects the outcome through its effect on treatment received. This assumption implies that Y ra = Y a with r, a = 0, 1. Under this assumption, Y 11 = Y 01 = Y 1 is the potential outcome under test treatment, while Y 10 = Y 00 = Y 0 is that under control treatment. (A3) Randomization assumption The random assignment R and Y 0 are conditionally independent given baseline covariates X, i.e., Y 0 ' R |X.
Furthermore, we assume that the average causal treatment effects follow linear SMMs (Robins 1994;Goetghebeur and Vansteelandt 2005): where Z(X, R) is a v-dimensional (v C 1) vector that depends on (X, R) and h is the unknown v-dimensional causal parameter vector of interest. Note that from (1), h is the effect of the treatment on the treated conditional on the baseline covariates and the randomization indicator (X, R). For example, when Z(X, R) T = (1, X T ), we allow for the possibility that the average causal effect on the treated is not constant with levels of X and changes linearly with X.
Because the full data (Y 1 , Y 0 , A, R, X T ) is only partially observed for each patient i, no regression methods for the complete data can be used to fit the model (1). However, from (1) and the assumption A3, it follows that E½Y À AZðX; RÞ T hjX; R ¼ E½Y 0 jX; R ¼ E½Y 0 jX: Using this, a consistent estimator of h can be obtained from a class of unbiased g-estimating functions (Robins 1994): is a scalar function. For some w(X) and q(X), a consistent estimator of h (called the g-estimator) is analytically obtained by solving g-estimating equations P n i¼1 w i ðhÞ ¼ 0; where w i ðhÞ is the i-th sample value of wðhÞ: The optimal choices for w(X) and q(X) from the viewpoint of efficiency that lead to a semiparametric efficient estimator of h were derived by Robins (1994). Under the homoscedasticity assumption that the error variance of the regression of U(h) on (R, X) is constant, these choices are given by w opt ðXÞ ¼ d opt ðXÞE½ZðX; RÞjX and q opt ðXÞ ¼ is called the compliance score (Joffe and Brensinger 2003). d opt (X) upweights participants characterized by X for whom the effect of treatment assignment on the treatment received is large, thus contributing information to estimate the effect of the treatment on the outcome. Since the optimal choices are unknown functions of X, it is often assumed parametric models for d opt (X) in w opt (X) and q(X). In our simulation and data analysis, we estimate Pr[A = 1|R, X] in d opt (X) by a logistic regression. We assume q opt (X) is linear in X, which leads to an analytical estimator ofĥ (Fischer et al. 2011). A consistent variance estimator ofĥ is obtained as n À1X ðĥÞ À1K ðĥÞðXðĥÞ À1 Þ T ; where XðhÞ ¼ ÀE½owðhÞ=oh T , KðhÞ ¼ var½wðhÞ: 3 Goodness of fit tests for structural mean models 3.1 Goodness of fit test proposed by Fischer et al. (2011) Before discussing our method for assessing the fit of the SMM (1), we briefly review the GOF test proposed by Fischer et al. (2011). Their methods are essentially based on the fact that if model (1) is correctly specified, then the expected ''treatmentfree'' outcomes UðĥÞ in both arms R = 1 and R = 0 will have the same regression functions on X. The GOF test was conducted using the following linear regression model for UðĥÞ on (X, R): To assess whether the model is a good fit, we conducted the test of the following null hypothesis: H 0 : b 3 ¼ 0: If we find the interaction terms significant (that is, H 0 is rejected), then there is an evidence for the lack of fit. Note that this GOF test would not detect a non-linear effect modification by X because (3) includes only linear terms of X. We additionally note that model misspecification will usually occur in the model (3). To see this, let h * be the true value of h. For an arbitrary h, the following equation holds by the SMM (1): The third term of (4) with h ¼ĥ will be apparently nonlinear in X when A is binary unlessĥ ¼ h Ã holds. In such cases, (3) is a misspecified model. This misspecification could affect the power of the GOF test, although the size of the test should be asymptotically equal to the nominal level because the third term of (4) with h ¼ĥ has asymptotically zero expectation under the correct specification of (1).

Proposed goodness of fit test
Rather than assuming a parametric model for UðĥÞ such as (3), we can construct a GOF test in the spirit of Su and Wei (1991), Lin et al. (2002), Pan and Lin (2005), and Chen and Qin (2014). The idea is based on testing for the unbiasedness of the gestimating equations under the correct model specification. To check the validity of the assumed SMM (1), we consider the following statistics: where x is a real-valued vector of length v. Under the null hypothesis that the SMM (1) is correctly specified, (5) has zero expectation for all values of x. Thus, a large value of the following omnibus test statistic G n ¼ sup x 2 R v jV n ðxÞj leads to the conclusion of model misspecification.
To make a GOF test, we need to specify the distribution of V n (x). The cumulative-sum process V n (x) converges in distribution to a zero-mean Gaussian process under the null hypothesis that the SMM (1) is correctly specified. Using the Taylor expansion of (5) aboutĥ around the true value h Ã ; we have where gðx; hÞ ¼ n À1 P n i¼1 IðX i xÞðR i À pÞd opt ðX i ÞoU i ðhÞ=oh and A % B means A-B = o p (1). Using a similar Taylor expansion, we obtain Behaviormetrika (2017)  Thus, from (5) and (6), we have Although it is hard to specify the explicit distribution form of (7), Su and Wei (1991) proposed a simulation-based method to approximate the null distribution of V n (x). The idea is as follows. Suppose that S 1 ,…, S n are independent and identically distributed variables from F(S), with l = E[S] = 0 and r 2 = E[S 2 ] \ ?, then based on the central limit theorem, we have n À1=2 P n i¼1 S i ! N 0; r 2 ð Þ. Let Z 1 ,…, Z n be independent standard normal random variables. Then conditional on the original data S 1 ,…, S n , n À1=2 P n i¼1 Z i S i $ Nð0; n À1 P n i¼1 S 2 i Þ ! N 0; r 2 ð Þ. That is, n À1=2 P n i¼1 Z i S i has the same asymptotic distribution as that of n À1=2 P n i¼1 S i . Using these results, for large n, the distribution of V n (x) is approximated bỹ where (Z 1 ,…, Z n ) is a random sample from N(0,1). To approximate the null distribution of V n (x), we generate large number of samples (Z 1 ,…, Z n ) from N(0,1) while fixing the data at their observed values.

A simulation study
In this section, the performance of the proposed method is evaluated via a simulation study. The following data (R, X, A, Y) were generated as follows. Let X be distributed as N(0,1) and the treatment assignment R be generated from Bernoulli(0.5). Next, the received treatment A was assigned according to the logistic model logit½PrðA ¼ 1jR; X; cÞ ¼ À1 þ 4R þ X þ c; where c follows N(0,0.25). Then, outcome Y was generated from N(3X ? A(k 0 ? k 1 X ? k 2 X 2 ) ? 0.5c, 0.25). This leads to the true SMM: The shared random effect, c, gave rise to non-ignorable non-compliance. We set (k 0 ,k 1 ,k 2 ) = (3,0,0) for no effect modification by X, (k 0 ,k 1 ,k 2 ) = (3,0.1,0), For the analysis of the simulated data, we assumed the main effect model: We investigated four GOF tests: (i) Fischer: Fisher et al.'s GOF test; (ii) V 1n (x): proposed GOF test with V n ðxÞ ¼ n À1=2 P n i¼1 IðX i xÞ ðR i À pÞU i ðĥÞ; (iii) V 2n (x): proposed GOF test with V n ðxÞ ¼ n À1=2 P n i¼1 IðX i xÞ ðR i À pÞd opt ðX i ÞU i ðĥÞ; (iv) V 3n (x): proposed GOF test with V n ðxÞ ¼ n À1=2 P n i¼1 IðX i xÞðR i À pÞd opt ðX i ÞfU i ðĥÞ Àq opt ðX i Þg: For each test, the two-sided significance level was set at 0.05. Table 1 summarizes the empirical rejection probabilities by four methods. For no effect modification case, all of the four GOF tests kept the nominal significance level. For linear effect modification cases, the power of the all tests were increasing as the true effect was increasing. Among the four methods, Fischer et al.'s test performed the best in terms of the empirical power, although the power of the proposed test with V 3n (x) was only slightly lower than that of Fischer et al.'s test. This is not surprising because our GOF test was an omnibus test using a Kolmogorov-type test statistic. For quadratic effect modification cases, the power of the proposed tests were also increasing as the true effect was increasing. On the other hand, the power of the Fischer et al.'s test was not monotonically increasing with the strength of the true effect. Among the three statistics for the proposed method, V 3n (x) performed by far the best. This indicates that using the optimal nuisance functions d opt (X) and q opt (X) as described in Sect. 3.2 is very important for the good performance of our proposed test.

Application
We now apply the proposed method to data derived from the PROSPECT (Prevention of Suicide in Primary Care Elderly: Collaborative Trial) (Bruce and Pearson 1999;Bruce et al. 2004). Data are available at http://research.bmh. manchester.ac.uk/biostatistics/research/data. PROSPECT was a multi-site prospective, randomized trial designed to evaluate the impact of a primary carebased intervention on reducing major risk factors (including depression) for suicide in later life. Participants were recruited from 20 primary care practices in New York City, Philadelphia and Pittsburgh regions. Ten pairs of practices were matched by region (urban vs suburban/rural), affiliation, size, and population type. Within these 10 pairs, practices were randomly allocated to one of the two conditions. The two conditions were either (a) an intervention based on treatment guidelines tailored for the elderly with care management including antidepressant medication (R = 1) or (b) treatment as usual (R = 0). For illustration purposes, here, we analyzed the data as if interventions were randomly assigned at the individual level. We use these data to assess the effect of antidepressant medication (A = 1: presence; A = 0: absence) on the change of the Hamilton Depression Rating Scale (HDRS) (Hamilton 1960) score at four months after randomization from baseline (Y). We use the baseline score of the HDRS as a baseline covariate (X), and it is centered with the mean value of the entire sample for estimation of SMMs. Table 2 summarizes the analysis results. We started with an Intention-to-treat (ITT) analysis and it indicated that the HDRS score at four months was significantly lower in the intervention group than it was in the control group (ITT effect¼Ê½YjR ¼ 1 ÀÊ½YjR ¼ 0 ¼ À3:62, 95 % confidence interval: -5.29 to -1.95). However, those who did not comply with the assigned treatment comprised 15.2 % (22/145) of the intervention group and 45.4 % (69/152) of the control group. Thus, the ITT effect would substantially underestimate the true causal effect of the treatment (that is, antidepressant medication). Then, we applied the following two SMMs for estimation of the causal treatment effect on the treated: (i) a one parameter SMM including the main effect only, that is, Z(X, R) = 1; (ii) a CI confidence interval, HDRS Hamilton depression rating scale, ITT intention-to-treat, PROSPECT prevention of suicide in primary care elderly: collaborative trial, SD standard deviation, SMM structural mean model two parameter SMM assuming the effect modification with X, that is, Z(X, R) = (1, X). The baseline covariate X was centered; thus, the main effect parameter for the model (ii) was interpreted to represent the treatment effect at the mean value for the covariates. As shown in Table 2, the two SMMs gave much larger effect estimates than the ITT analysis did, as expected. From the estimation result of the two parameter SMM, the treatment effect was slightly larger for those with higher baseline levels of the baseline HDRS score, although the effect was not statistically significant. We then applied the proposed GOF test using the test statistics V 3n (x) in Sect. 4 as well as the test proposed by Fischer et al. (2011). The p value of the GOF test was large for the one parameter model (i) for both methods, indicating good fitting of the main effect model. No difference was observed between the two GOF tests in this analysis. As noted in Taguri

Discussion
In this article, we have proposed a new global GOF test for the parametric part of the SMMs. The proposed model-checking method is an objective and informative approach for numerically checking the function form of covariates in SMM. Simulation studies demonstrate that the proposed test works well in terms of the type errors and power for both linear and non-linear effect modifications. Although SMMs and g-estimation always provide a valid test of the no treatment effect in the presence of non-compliance (Robins 1994), the correct model specification is a fundamental assumption for consistently estimating the causal treatment effect. In this regard, assessing the GOF of the candidate SMMs is very important. Our GOF test and the model selection criterion proposed by Taguri et al. (2014) can be used as complementary approaches, with the GOF test evaluating the overall fit and the model selection criterion evaluating the relative fit of candidate models.
SMMs have been used to handle repeated measures over time as structural nested mean models (Robins 1994) and related structural distribution models have been developed for survival outcomes (Mark and Robins 1993;Loeys and Goetghebeur 2003). Recently, Wallace et al. (2016) proposed a model assessment technique which can detect misspecifications of nuisance functions in SMMs for dynamic treatment regimens using the property of double robustness in observational studies. It is interesting to investigate as to how to extend our method to these problems.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.