Augmented Weighted Estimators Dealing with Practical Positivity Violation to Causal inferences in a Random Coefficient Model

The inverse probability of treatment weighted (IPTW) estimator can be used to make causal inferences under two assumptions: (1) no unobserved confounders (ignorability) and (2) positive probability of treatment and of control at every level of the confounders (positivity), but is vulnerable to bias if by chance, the proportion of the sample assigned to treatment, or proportion of control, is zero at certain levels of the confounders. We propose to deal with this sampling zero problem, also known as practical violation of the positivity assumption, in a setting where the observed confounder is cluster identity, i.e., treatment assignment is ignorable within clusters. Specifically, based on a random coefficient model assumed for the potential outcome, we augment the IPTW estimating function with the estimated potential outcomes of treatment (or of control) for clusters that have no observation of treatment (or control). If the cluster-specific potential outcomes are estimated correctly, the augmented estimating function can be shown to converge in expectation to zero and therefore yield consistent causal estimates. The proposed method can be implemented in the existing software, and it performs well in simulated data as well as with real-world data from a teacher preparation evaluation study. Electronic supplementary material The online version of this article (10.1007/s11336-018-09657-y) contains supplementary material, which is available to authorized users.


Introduction
Assessing causal relationships using nonexperimental data is challenging, yet central in many educational studies. Within the potential outcome framework (Rubin 1978), inverse probability of treatment weighting (IPTW; Robins et al. 2000) is a popular approach known under two key assumptions: (1) ignorability-treatment assignment mechanism is ignorable given the observed confounders and (2) positivity-treatment and control both have positive probability at each level 449 violations is associated with the heterogeneity among schools, e.g., the trimmed sample has systematically higher or lower average treatment effect.
The literature in handling positivity violation without altering the causal estimand is limited. Notable exceptions include the extrapolation approach that assumes an outcome model holds both inside and outside the positivity region, i.e., both at the levels of the confounders where positivity holds and at levels where it fails (Lechner 2008;Peterson et al. 2010). Hill (2008) and Westreich and Cole (2010) discussed the advantage and risk of extrapolation to deal with practical positivity violations in the absence of theoretical violation. Although not the main focus of Lechner & Strittmatter (2017)'s simulation comparison study, incorporating extrapolation in IPTW estimators was considered as an alternative to the trimming approach, and its potentials have shown in some scenarios. Similar to the idea of extrapolation, Neugebauer & van der Laan (2005) redefined the estimating function by including, for every observation of treatment (or of control) that falls outside the positivity region, an estimated potential outcome of control (or of treatment) to work around the positivity violation in a single-level setting. Given a correctly specified outcome model that holds both inside and outside the positivity regions, the resultant estimator is consistent even when the positivity assumption is violated.
Inspired by Neugebauer and van der Laan (2005)'s idea, we assume a random coefficient model that holds for both intern-teaching and student-teaching potential outcomes across all schools, and propose to augment the IPTW estimating function (Raudenbush 2014;Raudenbush & Schwartz 2016) by an estimated intern-teaching potential outcome for every school k that does not have any intern-teaching observation, i.e., if T ik ≡ 1 for all i's in school k, and an estimated student-teaching potential outcome if T ik ≡ 0 for all i's in school k. We show the augmented weighted estimating function converges in expectation to zero as long as the school-specific potential outcome can be correctly estimated. Thus, the corresponding estimator, that we call "AIPTW", is consistent even when some schools only have student-teaching observations or only intern-teaching observations in the sample.
The rest of the article is organized as follows. In Sect. 2, we introduce the potential outcomes and the causal estimand of our interest. Section 3 specifies the theoretical model, random coefficient model, for the potential outcomes, and Sect. 4 describes the model of the observed data as well as the assumptions made to identify causal estimand using the observed data. Section 5 shows that solving the conventional IPTW estimating equations yields consistent causal estimates only if all schools in the sample display variations in the observed values of T ik . In Sect. 6, we redefine and augment the IPTW estimating function and specify the condition under which the augmented weighted estimating function can be used to yield consistent causal estimates. In Sect. 7, we discuss in the random coefficient model, how the school-specific potential outcomes can be estimated to satisfy the condition specified in Sect. 6. Section 8 presents a simulation study examining the performance of the proposed method, and Sect. 9 illustrates the method with a real data analysis to evaluate the effectiveness of teachers prepared by intern-teaching and student-teaching. We conclude the paper with some discussions and remarks in Sect. 10.

Potential Outcomes and Causal Estimands
To elaborate the relaxed SUTVA (Rubin 1986, Hong & Raudenbush 2006, we step back and reintroduce some notations. Suppose there is binary treatment T i = 1 if student i is instructed by a newly graduated teacher prepared by intern-teaching fieldwork experience, and T i = 0 if this student is instructed by a teacher with student-teaching experience. There is also a school assignment indicator S i = k if student i is observed to have been assigned to school k. Student's learning outcome depends on their school assignments, but student-school assignment is typically far from random. To move forward without modeling the student-school assign-450 PSYCHOMETRIKA ment mechanism, we assume students are first assigned to schools and then, treatments are assigned to students within schools (the intact schools assumption; Hong & Raudenbush 2006, and fix our interest in the event (T i = t | S i = k) that occurs when student i who has been assigned to school k is assigned to treatment t ∈ {0, 1}. This event will be denoted by T ik = t in the rest of the article for notational simplicity. Although the generalization of our causal inference is now restricted to the observed student-school allocation, the resultant estimates have practical value since students would typically attend schools in their neighborhood areas, not any school in the study population.
Then, we adopt a weaker form of the SUTVA to reduce the number of potential outcomes for each student. At the elementary level, the same teacher and students typically stay in the same classroom for all classes throughout the year. Hence, it seems reasonable to assume all students in the same classroom receive the same treatment and there is no interference between classrooms. Given S i = k, student i's has two potential outcomes, defined as Y ik (t), t ∈ {0, 1}.
The difference between student i's two potential outcomes given the average treatment effect of all students who has been assigned to school k. Then, our causal estimand can be expressed as = E(ω k k ), the weighted average of k 's across all k's. If we aim to generalize to a population of schools, each school should be weighted equally and ω k ≡ 1 for all k's. Suppose we are interested in generalizing to a population of students, k will be weighted in proportion to the number of students in school k, e.g. ω k = n k K N where n k , K , and N are, respectively, the number of observed students in school k, the number of observed schools, and the total number of observed students across all k's, assuming all schools and students in each school have equal probability to be observed in the sample.

Theoretical Model for the Potential Outcomes
Hierarchical linear models (HLM), also known as multilevel models or linear mixed effect models, is commonly used to accommodate the clustered structure of educational outcomes (Raudenbush & Bryk 2002;Goldstein 2011). To take into account the important role schools play in student learning without overcomplicating the exposition of the proposed methodology, we consider a simple two-level HLM-random coefficient model-for the potential outcomes of students i who has been assigned to school k: where ik (t) is the random error assumed independently and identically distributed as N (0, σ 2 ) for t ∈ {0, 1}, and β k1 is the school k's average intern-teaching outcome and β k0 the school k's average student-teaching outcome that vary among schools as a function of the school random effects b k1 and b k0 : ρσ 1 σ 0 ρσ 1 σ 0 σ 2 0 , and β 1 and β 0 are, respectively, the population average intern-teaching outcome and the population average student-teaching outcome. The difference of school k's averages (β k1 − β k0 ) corresponds to the k defined in Sect. 2, and the difference of population averages (β 1 − β 0 ) corresponds to our causal estimand with ω k incorporated in the estimation stage, as shown in the latter sections. Although not the focus of this article, this model also supplies the following estimands: σ 2 1 the variance of the average intern-teaching outcome across schools, σ 2 0 the variance of the average student-teaching outcome across schools, and −1 < ρ < 1 the correlation between average intern-teaching outcome and student-teaching outcome across schools.

Model for the Observed Data
The fundamental problem in estimating (β 1 − β 0 ), or equivalently , is the fact that we only observe one of the two potential outcomes for each student. The observed outcome for student i in school k can be written as a function of the observed T ik , where e ik = T ik ik (1) + (1 − T ik ) ik (0). This model also has the form of a random coefficient model, but the conventional maximum likelihood estimation (Raudenbush & Bryk 2002;West et al. 2014;Bates et al. 2015) does not yield consistent estimates of β 1 and β 0 unless T ik is independent of ik (1), ik (0), b k1 and b k0 for all i's and k's, i.e., the treatment assignments are completely randomized (Ebbes et al. 2004;Wooldridge 2010). In our observational study, we impose the following two assumptions to proceed: (Ignorability) Random treatment assignment within each school, or equivalently, since b k is controlled, although not directly observed, once the school identity is given. In other words, T ik might be correlated with b k , but is independent of ik (1) and ik (0). (Positivity) Define the probability of treatment as Pr(T ik = 1 | b k ) = π k for i = 1, . . . , n k in school k, then, 0 < π k < 1 for all k s.
Since treatment assignment is random within each school, π k can be consistently estimated by the proportion of the sample assigned to T ik = 1 in school k (Arpino & Mealli 2011;Li et al. 2013;Raudenbush 2014;Raudenbush & Schwartz 2016): where n k1 is the number of intern-teaching observations in school k. When n k1 = 0,π k = 0, andπ k = 1 if n k1 = n k , causing the so-called practical violation of the positivity violation and problematic IPTW estimates, as shown in the next section.

IPTW Estimating Function Under Practical Positivity
The IPTW method, proposed by Robins et al. (2000) in single-level settings, has been integrated into a broad class of HLM to study causal effects in multilevel settings (Hong & Raudenbush 2008). Similar to the single-level setting, each observation is weighted in proportion to the inverse probability of its assigned treatment to create a pseudo-sample that approximates a sample collected under randomization. Specifically, Hong & Raudenbush (2008) showed that given the value of the variance components, like the unweighted complete-data score function from randomized treatment assignments, the weighted complete-data score function also has expectation zero. Therefore, equating the weighted complete-data score function to zero and jointly solving for fixed effects and random effects yields consistent causal estimates. In our example, the complete data for student Given and σ 2 , the weighted complete-data score functions for θ = (β 1 , β 0 , . . . , b k1 , b k0 , . . .) can be written as (Hong & Raudenbush 2008;Bates 2014), with a constant c chosen to normalize the weights such that (4) and (5), given and σ 2 , equating (7) to zero and jointly solving for θ yields consistent estimates of β 1 and β 0 if practical positivity holds, i.e., 0 < n k1 < n k for all k's.

Theorem 1. Under the assumptions of ignorability and positivity in
Proof. When 0 < n k1 < n k for all k's, we have (2 + 2K ) score functions in (7) associated with the observed data. Equating them to zero results in (2 + 2K ) estimating equations. Then, the consistency of the resultant estimates follows by showing that the weighted complete-data score function in (7) has expectation zero (see Appendix A).
However, when n k1 = 0 or n k1 = n k for some k's, the number of score functions in (7) associated with the observed data reduces to (2 + K +K ), whereK is the number of schools that have variations in the observed values of T ik . This is because in (7.1), n k i=1 w ik σ 2 T ik e ik = 0 when n k1 = 0, and in (7.2), n k i=1 w ik σ 2 (1 − T ik )e ik = 0 when n k1 = n k . Equating them to zero results in a system of (2 + K +K ) estimating equations as follows, whereθ is a length (2 + K +K ) vector that includes all elements in θ , except for b k0 if n k1 = n k and b k1 if n k1 = 0. The left hand side of these estimating equations does not have expectation zero, because E(b k1 | n k1 ) = 0 and E(b k0 | n k1 ) = 0, causing bias in the resultant estimates. If theoretical positivity holds, practical positivity is less likely to be violated as sample size increases in n k , i.e., n k1 is unlikely to be 0 or n k , as n k approaches infinity. But in finite samples, n k1 can equal 0 or n k by chance. In the next section, we propose to augment the weighted score function to correct the bias that occurs in such situations.

Augmented IPTW Estimating Function when Positivity is Practically Violated
When n k1 = n k or n k1 = n k for some k's, we consider the following augmented weighted complete-data score function for θ : is the difference between an estimate of the school-specific potential outcome derived from the observed data and their true expected value based on the model assumption in (1) and (2). Similarly, (8) differs from (7) only in (8.1) and (8.2), and (8) becomes (7) when 0 < n k1 < n k for all k's. (4) and (5), given and σ 2 , equating (8) to zero and jointly solving for

Theorem 2. Under the assumptions of ignorability and positivity in
Proof. As seen in (8.3) and (8.4), all of the (2 + 2K ) score functions in (8) are associated with the observed data, whether or not 0 < n k1 < n k for all k's. Equating them to zero results in (2 + 2K ) estimating equations. The resultant estimates are consistent if the augmented weighted complete-data score function in (8) can be shown to converge in expectation to zero: The values of the variance components and σ 2 are usually unknown and need to be estimated. Following Hong & Raudenbush (2008), we adopt a maximum pseudo-likelihood approach and make use of existing software program for implementation, with further details provided in Appendix B. In brief, an augmented data A = (A 1 , . . . , A K ) is created that includes, for every schools k, having n k rows if 0 < n k1 < n k , and n k + 1 rows if n k1 = 0 or n k1 = n k . Then, the estimates of β 1 , β 0 , and σ 2 that maximize the likelihood function corresponding to the augmented weighted complete-data score function in (8) can be obtained by first calculatingπ a k based on (6) as if A k is observed in school k, and then feeding A into the standard HLM estimation procedure with assigned as the weights. We call this the AIPTW estimator in the rest of the article.

Estimating the School-Specific Potential Outcomes
when n k1 = n k is challenging because information regarding the unobserved b k1 and b k0 is limited for these schools. In a random intercept model, including the school-specific average T ik as an additional covariate in the model (Kim & Frees 2006;Bafumi & Gelman 2006;Raudenbush 2009) has been used to obtained consistent fixed-effect estimates when T ik is not independent of the random intercepts. In that spirit, we re-parameterize model (3) as follows: It can be shown thatb k1 andb k0 are close to independent of T ik , in large K (see Appendix C). Therefore, standard maximum likelihood estimation can be used to obtain consistent estimates ofβ 1 ,β 0 and γ 1 and γ 0 . In the standard maximum likelihood estimation, random effect estimates shrink toward their marginal expectation, zero, when school has little or no relevant observations. Specifically, when n k1 = 0,T k = 0 andb k1 = 0, resulting in school k's estimated potential outcomeÊ[Y ik (1) | S i = k] =β 1 +b k1 +γ 1Tk =β 1 , and Q(1, k) =β 1 − (β 1 +b k1 ). Similarly, when n k1 = n k , and has expectation zero, as sample size increases. Similarly, E[Q(0, k)] approaches E(b k0 ) and has expectation zero.
Furthermore, sinceb k1 andb k0 are close to independent of T ik in large K, We call the model in (10) a school-average-T-corrected model, denoted by "SATC" in the rest of the article. To improve efficiency, we also consider a simplified version, called reduced SATC (RSATC), with one parameter less than SATC: . Therefore, RSATC is expected to be correct and more efficient when cov(b k1 , T ik ) and cov(b k0 , T ik ) are close enough. We will compare the performance of AIPTW based on SATC and RSATC using simulated data in the next section.

Simulation
We conducted a simulation study to explore the moderate sample size performance of the AIPTW when SATC or RSATC are used in estimating Q(1, k) and Q(0, k), denoted by AIPTW-SATC and AIPTW-RSATC, respectively, and to compare their performance with the IPTW using the original sample (denoted by IPTW-orig), and the IPTW using the trimmed sample (denoted by IPTW-trim). Two simulation settings were chosen which mimicked the real data example described in Sect. 9, and 1000 replicated data sets were generated for each setting using the random coefficient model specified in (1) and (2). In the first setting, we generated K = 150 clusters and within each cluster n k observations where n k follows a discrete uniform distribution ranging from 1 to 19 such that 26% of the schools have no more than 5 observations. The binary treatment indicator both ζ k and ξ ik generated from a standard normal distribution representing other unknown schoollevel and student-level factors in the treatment assignment mechanism; constants c 1 , c 2 , c 3 and c 4 were chosen to have the correlation coefficient between T ik and b k0 : r 0 = 0.4, the correlation coefficient between T ik and b k1 : r 1 = 0.4, the overall probability of treatment: p = 0.3, and 26% or 80% of the schools have practical positivity violations, i.e., n k1 = 0 or n k1 = n k in these schools. Then, the outcome Y ik was generated based on Model (1) and (2) with β = (β 0 , β 1 ) = (35, 40), σ 0 = σ 1 = 8, ρ = 0.8 and σ ε = 45. In the second setting, K = 200, n k follows a discrete uniform distribution ranging from 1 to 49 such that 10% of the schools have no more than 5 observations, β = (12, 15), σ 0 = σ 1 = 8, and σ = 35. And for T ik , c 1 , c 2 , c 3 and c 4 were chosen to have various combinations of (r 0 , r 1 , ρ) as detailed below, p = 0.3, and 80% of the schools have practical positivity violations. Simulation results in evaluating IPTW and AIPTW in dealing with school-level confounders and practical positivity violations; β = (35, 40), σ 0 = σ 1 = 8, ρ = 0.8 and σ = 45; T ik = 1 if g(b k ) > 0 and T ik = 0 if g(b k ) < 0 where g(b k ) = c 1 + c 2 b k0 + c 3 b k1 + c 4 ζ k + ξ ik and c1-c4 were chosen to have r 0 = 0.4, r 1 = 0.4, p = 0.8, and 26% or 80% of the schools have practical positivity violations.

PB%
T We focus on obtaining an estimate for (β 1 − β 0 ) to be generalized to a population of students. In other words, we have ω k = n k K N , or equivalently, v k ≡ 1. For each data set, we obtain β = (β 0 ,β 1 ) directly by feeding the original sample, the trimmed sample, the SATC augmented data and the RSATC augmented data into the R function lmer in the lme4 package (Bates et al. 2015) with corresponding w ik , or w a ik for the augmented data, assigned in its weights argument. Forπ k = n k1 n k , we choose c = N 1 N to normalize the weights where N 1 is the total number of the intern-teaching observations because they help to neutralize the impact of observations with extremely small or extremely large n k1 n k . For the standard error ofβ in IPTW-orig and IPTW-trim, we calculated the square root of the following robust estimator (Hong & Raudenbush 2008) using (σ 2 0 ,σ 2 1 ,ρ,σ 2 ) returned from the lmer function, To estimate the standard error ofβ in AIPTW-SATC and AIPTW-RSATC, we employed the bootstrap procedure by resampling the clusters with replacement 30 times (Field & Welsh 2007) and then calculated the sample standard deviation of the 30 AIPTWβ's from these bootstrap samples. Readers can find in the supplementary materials, the program code in R with a generic function AIPTW-HLM that can be used to obtain the IPTW-orig, IPTW-trim, AIPTW-SATC and AIPTW-RSATC estimates, and the sample code to generate the simulated data and obtain the simulation results for one of the settings.  Simulation results in evaluating IPTW and AIPTW in dealing with school-level confounders and practical positivity violation; β = (12, 15), σ 0 = σ 1 = 8, ρ = 0.3, and σ = 35; T ik = 1 if g(b k ) > 0 and T ik = 0 if g(b k ) < 0 where g(b k ) = c 1 + c 2 b k0 + c 3 b k1 + c 4 ζ k + ξ ik and c1-c4 were chosen to have p = 0.3, 80% of the schools have practical positivity violations, and (r 0 , r 1 ) = (0.4,0.4), (0.2,0.6), (0.4, −0.4).
The simulation results for the first setting are presented in Table 1, including the following quantities summarized from the 1000 sets of estimates: percentage bias calculated as the average difference betweenβ and β divided by β (PB%), the average estimated standard error ofβ (T.SE), the sample standard deviation of the 1000β (S.SE) and the percentage of 95% confidence intervals covering the true β (95% CP). In Table 1, estimates of all approaches had nominal bias and satisfactory 95% CP when practical positivity violations occurred in only 26% of the schools. But when 80% of the schools had practical positivity violations, the IPTW-orig and IPTW-trim had larger bias and lower 95% CP, while the bias of AIPTW-SATC and AIPTW-RSATC remained nominal. The T.SE and S.SE are consistent with each other, indicating that theβ standard errors can be estimated by the bootstrap procedure reasonably well.
The simulation results for the second setting are presented in Tables 2 and 3, including the PB% and S.SE forβ. The average of the 1000σ 0 ,σ 1 , andρ returned directly from the lmer function (Avg. Est.) and their S.SE's are also reported, just to explore the potential of estimating these parameters using the AIPTW approaches, but they are not the main focus of this article. In Table 2, we examined the performance of AIPTW-SATC and AIPTW-RSATC when b k0 and b k1 are correlated with T ik with the same or different correlation coefficients: (r 0 , r 1 ) = (0.4,0.4), (0.2,0.6) and (0.4, −0.4). When r 0 = r 1 = 0.4, AIPTW-RSATC yielded smaller bias and standard errors forβ than AIPTW-SATC. When r 0 = 0.2 and r 1 = 0.6, AIPTW-RSATC yielded larger bias forβ than AIPTW-SATC. When r 0 = 0.4 and r 1 = −0.4, the bias inβ 1 yielded by the AIPTW-RSATC is even larger than their bias using the IPTW-trim and IPTW-orig while AIPTW-SATC managed to reduce much of the bias in bothβ 1 andβ 0 . Simulation results in evaluating IPTW and AIPTW in dealing with school-level confounders and practical positivity violation; β = (12, 15), σ 0 = σ 1 = 8, and σ = 35
In Table 3, we investigated the performance of AIPTW-SATC and AIPTW-RSATC when b k0 and b k1 are moderately or strongly correlated with each other, and when they are moderately or strongly correlated with T ik : (r 0 , r 1 , ρ) = (0.4, −0.4, −0.3), (0.4, −0.4, −0.8) and (0.6, −0.6, −0.8). The bias ofβ 1 in the AIPTW-SATC and its S.SE in estimating ρ are slightly reduced when b k0 and b k1 are strongly correlated with each other, i.e., (r 0 , r 1 , ρ) = (0.4, −0.4, −0.8) compared to (0.4, −0.4, −0.3). A reasonable explanation is that outcomes made of b k1 (or b k0 ) help to estimate b k0 (or b k1 ) more when |ρ| is large. When b k0 and b k1 are strongly correlated with T ik , larger bias inβ was yielded by all estimators, but AIPTW-SATC was able to correct proportionally more of the bias and returned reasonable results. In estimating the β of all simulation settings we examined, IPTW-trim yielded smaller bias but larger standard errors than the IPTW-orig, i.e., completely ignoring the practical positivity violation and using the original sample as is. The AIPTW-SATC outperformed both the IPTW-trim and IPTW-orig in all cases and also outperformed the AIPTW-RSATC when r 0 and r 1 were different. The AIPTW-RSATC, however, outperformed the AIPTW-SATC when r 0 and r 1 were close. The best AIPTW, i.e., AIPTW-SATC when r 0 and r 1 were different and AIPTW-RSATC when r 0 and r 1 were close, also yielded better estimates of σ 0 , σ 1 , and ρ in general, but σ 1 tended to be underestimated, and ρ had large S.SE; further work is needed to make inferences about these parameters.

Real Data Analysis
The research reported here was partially motivated by a teacher preparation evaluation study conducted by the Center of Teacher Quality (CTQ) of the California State University (CSU). The evaluation is a large-scale observational study aiming to evaluate the effects of teacher preparation on K-12 student learning and to identify potential ways of improvement. Outcomes of teacher preparation such as the student test scores were collected from partner school districts together with student's demographic information and linked to the CSU credential programs where the teachers were prepared. Understanding how features of teacher preparation programs such as fieldwork pathways influence teacher effectiveness might suggest ways to improve. In one particular analysis, we compare the effectiveness of newly graduated grade 3 teachers who were prepared by two different fieldwork pathways in the CSU credential programs: student-teaching (T ik = 0) and internteaching (T ik = 1). During student-teaching, credential candidates were closely supervised by an experienced teacher. During student-teaching, credential candidates were the solely responsible teacher in the classroom. Teachers in their first two years of classroom teaching after earning a teaching credential are considered "newly graduated," and their effectiveness was measured by the difference of the student-level California Achievement Test (CAT-6) scores before and after the instruction, i.e., score gain from grade 2 to grade 3. More than 6860 student score gains from 218 K-12 schools in California were used in this analysis, derived from the grade 2 to 3 CAT-6 scores for two cohorts of students during the academic year of 2002-2003 through 2004-2005. Descriptive statistics of the test score gains and results of a naive two sample t test can be found in Table 4. Teachers are not typically assigned to schools at random, and the school characteristics that affect the selection between teachers and schools often also affect the student score gains in that school. Moreover, as shown in Table 5, over 64% (16%) schools hired only newly graduated grade 3 teachers with student-teaching experiences (intern-teaching) during the academic year of 2003-2004 and 2004-2005. In other words, practical positivity violation occurred in over 80% of the schools. Hence, the IPTW may not yield proper results for these data. Assuming that these schools are likely to hire any teachers with either kind of fieldwork experiences in the long run, the AIPTW we proposed is expected to address the practical positivity violations found in our sample.
Separate analyses were performed for the subjects of language, reading, spelling and math, and for the Hispanic and non-Hispanic students. Table 6 presents the analysis results from the IPTW-orig, the IPTW-trim, the AIPTW-SATC and the AIPTW-RSATC, including the fixed-effect estimates (β 0 ,β 1 ,β 1 −β 0 ), their standard errors, and p values for the Hispanic students. All approaches produced significantly positiveβ 0 andβ 1 ( p < 0.001), indicating one year of newly graduated teacher's instruction significantly improved the CAT-6 scores of the Hispanic students in all subject areas. However, these approaches generated differentβ 1 −β 0 for describing the relative effectiveness of teachers with intern-teaching experience compared to teachers with studentteaching experience. The IPTW-orig showed significant effectiveness of the teachers with internteaching experience in teaching spelling ( p = 0.02), but this trend was not significant when the IPTW-trim or AIPTW-RSATC was used. Using the AIPTW-SATC, teachers with intern-teaching experience appeared to be significantly more effective than the teachers with student-teaching experience in teaching both reading ( p = 0.07) and spelling ( p = 0.04) to the Hispanic students. None of the approaches had significant results for math and language. Analysis results for the non-Hispanic students are presented in Table 7. The benefit of one year of instruction was obvious in spelling and math as indicated by significantly positiveβ 0 and β 1 by all estimation approaches. But both groups of teachers showed less effectiveness in teaching language and reading to the non-Hispanic students, as indicated by insignificantβ 0 in reading using IPTW-trim ( p = 0.28), insignificantβ 1 in reading using IPTW-trim ( p = 0.54) and AIPTW-SATC ( p = 0.43), and insignificantβ 1 in language using AIPTW-SATC ( p = 0.12). As such, no significant difference is found between the two groups of teachers in teaching language or reading by any approach. In spelling and math, the difference between teachers with intern-teaching experience and teachers with student-teaching experience was also insignificant using the IPTWorig. But at 0.10 level, the difference in teaching spelling was significant in favor of the teachers with intern-teaching experience when the IPTW-trim ( p = 0.07), AIPTW-SATC ( p = 0.09) or AIPTW-RSATC ( p = 0.09) was used. Moreover, the AIPTW-SATC revealed an insignificant but important effectiveness of the teachers with intern-teaching experience in teaching math to the non-Hispanic students ( p = 0.19). Conceptually, the trends especially supported by AIPTW-SATC are interesting because during the 1-2 years of intern-teaching experience, credential candidates receive less supervision, but gain more independence as the solely responsible teacher in the classroom. Further investigation is warranted.

Discussion
Clustered data structure provides a way to make causal inferences without having to observe all the cluster-level confounders, e.g., an IPTW with probability of treatment estimated byπ k = 462 PSYCHOMETRIKA β 0 : the overall effectiveness of teachers prepared by student-teaching. β 1 : the overall effectiveness of teachers prepared by intern-teaching. β 1 − β 0 : the relative effectiveness of teachers prepared by intern-teaching compared to teachers prepared by student-teaching. n k1 n k for all i's in cluster k. However, even when the theoretical positivity holds, it can be quite common for the finite sample of some clusters to have no variation in T ik , i.e., n k1 = 0 or n k1 = n k for some k's, causing practical positivity violations and bias in the resultant IPTW estimates. Based on a simple two-level HLM assumed for the potential outcome, we propose an augmented IPTW (AIPTW) that basically includes in the estimation procedure an estimated potential outcome of treatment for every cluster that has no treatment observed, and an estimated potential outcome of control for every cluster with no control observed. In the form of an augmented weighted HLM score function, we show that the resultant estimates are consistent if the cluster-specific potential outcomes can be estimated correctly. Embedding AIPTW in a simple two-level HLM results in a causal estimate that is essentially the same as a nonparametric version of the AIPTW, β 0 : the overall effectiveness of teachers prepared by student-teaching. β 1 : the overall effectiveness of teachers prepared by intern-teaching. β 1 − β 0 : the relative effectiveness of teachers prepared by intern-teaching compared to teachers prepared by student-teaching.
But sinceÊ[Y ik (1) | S i = k] andÊ[Y ik (0) | S i = k] in (11) are obtained based on the HLM assumption, not much robustness can be gained by using (11). In addition, embedding AIPTW in HLM has the potential to supply other estimands of interest, e.g. σ 0 , σ 1 , and ρ, and to include other covariates for the purpose of increasing precision or adjusting for student-level confounders. For example, we assume in our real data analysis that at the elementary levels, the assignments of teachers and students to classrooms within each school are relatively random compared to the assignments of teachers and students to schools (Harris 2011), although controversial. The proposed AIPTW-HLM can be extended to include the student-level confounders, if they exist and measurements are available, as covariates in the HLM to address further confounding bias. Moreover, AIPTW-HLM is also extendable to make causal inference in data of more than two levels, with confounders at any level higher than the level where treatments are assigned and implemented. Pfeffermann et al. (1998) and Hong & Raudenbush (2008) discussed specifically how weights of various levels can be incorporated in HLM. Further theoretical development for causal inference specialized in the educational context (McCaffrey et al. 2004, Hill 2013, accompanied by software program to facilitate the implementation, is worth continuing effort. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,