Effects of Single-Sex Schooling in the Final Years of High School: A Comparison of Analysis of Covariance and Propensity Score Matching
- First Online:
- Cite this article as:
- Nagengast, B., Marsh, H.W. & Hau, K. Sex Roles (2013) 69: 404. doi:10.1007/s11199-013-0261-8
- 1.5k Views
Typically, the effects of single-sex schooling are small at best, and tend to be statistically non-significant once pre-existing differences are taken into account. However, researchers often have had to rely on observational studies based on small non-representative samples and have not used more advanced propensity score methods to control the potentially confounding effects of covariates. Here, we apply optimal full matching to the large historical longitudinal dataset best suited to evaluating this issue in US high schools: the nationally representative High School and Beyond study. We compare the effects of single-sex education in the final 2 years of high school on Grade 12 and post-secondary outcomes using the subsample of students attending Catholic schools (N = 2379 students, 29 girls’ schools, 22 boys’ schools, 33 coeducational schools) focusing on achievement-related, motivational and social outcomes. We contrast conventional Analysis of Covariance (ANCOVA) with optimal full matching based on the propensity score that provides a principled way of controlling for selection bias. Results from the two approaches converged: When background and Year 10 covariates were controlled, uncorrected apparent differences between the school types disappeared and the pattern of effects was very similar across the two methods. Overall, there was little evidence for positive effects of single-sex schooling for a broad set of outcomes in the final 2 years of high school and 2 years after graduation. We conclude with a discussion of the advantages of propensity score methods compared to ANCOVA.
KeywordsSingle-sex schoolingPropensity scoreCausal inferenceHigh School and Beyond study
The effects of attending single-sex versus coeducational high schools have been a topic of intense debate in educational research (see e.g. Lee and Bryk 1986, 1989; Marsh 1989a, b). However, no ultimate conclusion has been reached, and the issue continues to be controversial (see e.g. Mael et al. 2005: Smithers and Robinson 2006). In fact, although single-sex schools largely disappeared in many countries during the 1980s and 1990s, Title IX (U.S. Department of Education 2006) now allows publicly-funded schools to create non-vocational single-sex classes if they are meant to “improve educational achievement … or to meet … particular, identified educational needs of … students” (Sec.106.34(b)(1)(i)). As a consequence, the number of public single-sex schools has started to increase again (see Bigler and Signorella 2011, for an overview of the number of public single-sex schools based on data reported by the National Association for Single-Sex Public Education). This change, however, is based on thin empirical evidence. In their comprehensive review of research on effects of single-sex schools in elementary, middle or high schools in “Westernized countries … somewhat comparable to American public- sector schools” (Mael et al. 2005, p. 4), concluded that “there is a dearth of quality studies (i.e., randomized experiments or correlational studies with adequate statistical controls) across all outcomes” (p. xvii). Nevertheless, the research on single-sex schooling has influenced public policy more broadly: “the extant evidence—supporting many contradictory conclusions—has been used to support widely differing policy recommendations” (Caspi 1995, p. 58).
As is often the case with research questions that are not amenable to random assignment, one of the key battlegrounds in this debate has been methodological questions, such as: What covariates should be controlled when evaluating the effects of single-sex schooling? What kind of hypothesis test (one-tailed or two-tailed) is appropriate? (see e.g., Lee and Bryk 1989; Marsh 1989a, b). In this paper, we extend these methodological discussions by introducing an explicit conceptual framework for causal inference (Rubin 1977) that has been sorely missing from the analysis of single-sex schooling effects. Conventional analytical approaches, based on Analysis of Covariance (ANCOVA) do fit into this framework, but have important limitations when used for causal inference (e.g., Schafer and Kang 2008; Rubin 2001; Rubin and Thomas 2000). We introduce propensity score methods (Rosenbaum and Rubin 1983a; Morgan and Winship 2007; Stuart 2010 for recent overviews) as a principled alternative. These approaches can be highly effective in removing bias (Rosenbaum and Rubin 1985; Rubin 2001; Rubin and Thomas 1996) under some circumstances (Shadish et al. 2008), and overcome many of the drawbacks of conventional ANCOVA analysis (see Schafer and Kang 2008; Stuart 2010). Propensity score methods have recently become popular in educational and psychological research (see Thoemmes and Kim 2011, for a review). However, none of the studies in Mael et al.’s (2005) comprehensive review of research on single-sex schooling, and none of the more recent methodologically strong studies, used propensity matching at the individual level to control for possible confounding effects of covariates. One of the historically strongest available databases for investigating the effects of single-sex schools is the High School and Beyond (HSB, National Center for Educational Statistics 1986) longitudinal study commissioned by the U.S. Department of Education in the early 1980s. HSB oversampled Catholic single-sex schools and provided possibly the largest nationally representative database for the evaluation of single-sex schooling effects ever obtained in the U.S., thus yielding unprecedented (and unlikely to ever recur) opportunities to study the effects of single-sex schooling. Revisiting this classic database, we study effects of single-sex in the final 2 years of high school on changes using propensity score matching on a comprehensive set of outcome measures to test the stability of the estimated effects across adjustment techniques.
Effects of Single-Sex Schools: A Substantive-Methodological Synergy
The present investigation is a substantive-methodological synergy, bringing to bear new and evolving methodology to address new – or revisit old – substantive issues with important policy implications (Marsh and Hau 2007). Here we begin with a review of the substantive issues and then move to a discussion of methodological issues.
The Substantive Focus: Effects of Single-Sex Schools
The effects of single-sex schooling on achievement, career choices and social adjustment remain a controversial topic. Much of the justification for single-sex schooling seems to rest on the assumption that it would lead to gender equity (see the review in Bigler and Signorella 2011), remediate failing schools in urban areas and support at-risk students (e.g. Riordan 1994, 1998) or speak to supposed gender differences in the brain (Eliot 2011). In the public debate, girls are assumed to profit more from single-sex schooling (see the review in Bigler and Signorella 2011). However, the theoretical basis for these purported differences is thin (see e.g. Eliot 2011 for a rebuttal of gender differences in the brain), empirical evidence based on strong research designs is rare, and research findings are equivocal. Mael et al.’s (2005) systematic review of studies into single-sex schooling effects found a mixture of small effects favouring single-sex schools and no significant differences between the school types, with almost no evidence suggesting that coeducational schools are more advantageous (but see Marsh et al. 1988, 1989 for evidence in an Australian high school sample). They also reported a “pronounced tendency to study girls’ schools more than boys’ schools” (Mael et al. 2005, p. xvii) in line with the hypotheses of preferential effects of single-sex education on girls. As their review constitutes the basis for our expectations and hypotheses about the direction of single-sex schooling effects, we briefly review their major findings.
Mael et al. (2005) analyzed a sample of studies predominantly from the U.S., U.K., Australia, and New Zealand (see Mael et al., for the specific countries in each sample). In regard to concurrent academic outcomes, they concluded that the limited positive effects of single-sex schooling on concurrent academic outcomes, reported by some studies (e.g., Carpenter and Hayden 1987; Caspi 1995; Lee and Bryk 1986; Riordan 1994; Spielhofer et al. 2004; Woodward et al. 1999) but not by others (e.g., Baker et al. 1995; Daly 1996; Daly and Shuttleworth 1997; Harker 2000; Lee and Marks 1990; LePore and Warren 1997; Marsh 1989a, 1991), are not sustained in long-term post-secondary academic achievement (Marsh 1989a, 1991; but see Riordan 1990). For adaptational and motivational outcomes, evidence again was mixed. In terms of academic self-concept, both positive effects of single-sex school (Riordan 1990) and mixed or null effects (Lee and Bryk 1986; Marsh 1989a, 1991) were found. Using a series of quasi-experimental intervention studies, Marsh and colleagues (Marsh, et al. 1988, 1989) reported positive effects of coeducation on academic self-concept for Australian high school students, but no effects on achievement. Coeducational schools appear to be somewhat more beneficial for self-esteem, particularly for boys (Riordan 1994), but again, a number of studies found no differences between school types (LePore and Warren 1997). Educational aspirations appeared to be better fostered in single-sex schools (Lee and Bryk 1986; Lee and Marks 1990; Watson et al. 2002; but see Marsh 1989a, 1991), with mixed evidence and null effects for school track and subject preferences (Daly and Shuttleworth 1997; Lee and Bryk 1986; Lee and Lockheed 1990; Marsh 1989a; Spielhofer et al. 2004). Although Mael et al. (2005) reported some evidence for positive long-term effects of single-sex schools on adaptation and socio-emotional development (Woodward et al. 1999), the number of methodologically sound studies was too small to allow firm conclusions.
As a consequence of the methodological shortcomings of the papers included in their review, Mael et al.’s (2005) called for more “quality research on extant outcomes, the refinement of methodology, better statistical reporting, and the expansion of the theoretical domain.” (p. xviii). As there are likely to be substantial pre-existing differences between intervention and comparison groups, the results of most studies will be systematically biased in the same direction, due to the failure to control these differences fully, rendering dubious even results from meta-anlaysis. In particular, there are likely to be many pre-existing differences in favour of students attending single-sex schools that tend to select students from more privileged backgrounds who are expected to achieve better in any schooling context (see for example, Hoffnung 2011, for an analysis of how differences in socio-economic background explain positive occupational outcomes for women in single-sex schools). In this situation, it is preferable to rely on large, representative studies specifically designed to overcome these limitations.
From this perspective, the High School and Beyond (HSB) study commissioned by the U.S. Department of Education in the early 1980s is apparently the strongest available database for evaluating the effects of single-sex education versus coeducation in the US. It is based on a large, nationally representative sample of U.S. high school students. The design intentionally oversampled Catholic schools at a time when single-sex schools were still common in the Catholic school sector (for a more detailed description of the study design see NCES 1986). This allows controlling for some of the differences between single-sex schools (predominantly Catholic or elite private) and coeducational schools (predominantly in the Public sector), which typically confound observed school-type effects. HSB also included a wide range of education’s most important outcomes—achievement tests and psychosocial variables—at both Grade 10 and Grade 12, as well as post-secondary outcomes like high school completion. Nevertheless, there is one shortcoming inherent in the design of HSB that should be mentioned upfront: As the first data collection took place in Grade 10, it is only possible to study schooling effects in the final 2 years of high school. Some of the variables assessed in Grade 10 will invariably have been affected by attending a specific type of high school. How to best handle this particularity of HSB in the analyses of single-sex schooling effects, has been hotly debated.
The High School and Beyond Debate
Lee and Bryk (1986) were the first to analyze data from HSB with a focus on single-sex schooling in the final two high school grades. They statistically controlled for stable individual characteristics of students and families (e.g., SES and race) and reported a mix of positive and non-significant effects of single-sex schooling on achievement in Grade 10 and 12, achievement gains between Grades 10 and 12, course choices, attitudinal outcomes (such as interest and academic self-concept) and educational aspirations. Their results showed comparatively more advantages of single-sex schooling for girls. But even for boys, the statistically significant effects favoured single-sex education. Using the third follow-up of HSB and a similar analytical approach, Lee and Marks (1990) reported sustained positive effects of single-sex education during college: Students educated in single-sex schools were more likely to attend more selective four-year colleges and reported more ambitious plans for their post-baccalaureate education. Girls who had attended single-sex schools also reported higher college satisfaction.
Marsh (1989a) strongly critiqued the analyses by Lee and Bryk (1986, also see the ensuing debate, Lee and Bryk 1989, Marsh 1989b) on the basis that the analytical approach and statistical analyses were flawed and inconsistent with the design and intended use of the HSB database. Lee and Bryk’s inappropriate use of one-tailed tests of statistical significance and the failure to control appropriately for design effects associated with clustered sampling, rendered many of the “significant” effects statistically non-significant. Of particular relevance to the present investigation, Marsh pointed out that the set of background variables controlled by Lee and Bryk was not sufficient. Most importantly, he argued that pre-existing differences in achievement and motivation were not sufficiently controlled. The unavailability of proper pre-test measures rendered differences in the base year (Grade 10) outcomes meaningless as estimates of causal school-type effects (at Grade 10).
Marsh (1989a; also see Jencks 1985) argued that HSB was specifically designed to study the long-term effects of high school attendance during the last 2 years of high school and that all proper analyses should be focused on the Grade 12 outcomes, controlling for background variables and Grade 10 outcomes. When analyzing the data in this way (and employing two-tailed significance tests), Marsh showed that the effects of single-sex schooling in the final two grades were mostly non-significant for both boys and girls: Out of 40 outcomes analyzed, students from single-sex schools were only favoured in reading achievement and foreign language credits, whilst students from coeducational schools had a larger number of English credits. Marsh also explicitly tested gender main effects and gender-by-school-type interactions to find evidence for differential effectiveness of single-sex schooling. Although there were many main effects of gender, when all covariates were controlled, there was only one significant gender-by-school-type interaction. Somewhat ironically, Mael et al. (2005) counted Lee and Bryk’s (1986) methodologically flawed analyses as showing support for positive effects of single-sex schooling, despite arguing strongly for the importance of appropriately controlling for pre-existing differences in confounding variables.
Further Evidence from Large Scale Longitudinal Studies
LePore and Warren (1997) studied the effects of Catholic single-sex schooling in U.S. high schools using the successor of HSB, the National Educational Longitudinal Study of 1988 (NELS:88). Overcoming a crucial limitation of HSB, NELS:88 started data collection in Grade 8, typically before students actually attended high school. However, NELS:88 did not oversample Catholic and single-sex schools, so the available sample of students of both genders who attended single-sex schools was very small: The effective weighted sample size after controlling for clustering was always below 300 students, from a small number of schools. This problem notwithstanding, LePore and Warren’s findings supported Marsh’s (1989a) contention that Lee and Bryk (1986) had overestimated single-sex schooling effects: When pre-existing differences between students were not accounted for, LePore and Warren (1997) reported mean differences in favour of single-sex schooling particularly for boys. Once they appropriately controlled for differences in background and outcome variables assessed prior to high school, none of the effects remained statistically significant.
Billger (2009) also used data from NELS:88 to study the effects of single-sex schooling, focusing on postsecondary and labour market outcomes. She used a slightly larger sample of students, including also students who attended private, but not Catholic, single-sex schools. Overall, her findings did not show clear support for the benefits of single-sex schooling: Although there were positive effects of single-sex schooling for some subgroups of students, there were no positive effects on the degrees achieved by single-sex school students, who were in fact slightly less likely to meet their educational expectations.
Studying the effects of single-sex schooling in England, Spielhofer et al. (2004) used the National Pupil Database, a large population level database of English students. They studied the effects of single-sex secondary education on achievement levels in Grade 9 (Key Stage 3 in the English school system) and Grade 11 (GCSE exams taken by 16-year-old students at the end of the compulsory schooling age) assessments. In their analyses, they controlled background characteristics assessed at the end of primary school (Key Stage 2 in the English educational system) and a variety of school-level variables. Spielhofer et al. (2004) reported some advantages of single-sex schooling for girls in comprehensive schools (as compared to selective grammar schools), and for low attaining boys in single-sex comprehensive schools and boys in single-sex grammar schools. However, their results did not control for socio-economic status, an important confounder that determines school choice in England (Burgess et al. 2009) and influences educational achievement around the globe (e.g. OECD 2010).
Sullivan (2009) and Sullivan et al. (2010) studied the effects of single-sex schooling in secondary schooling in Britain (England, Scotland and Wales) using data from the National Child Development Study (NCDS), a birth cohort study starting in 1958. The design allowed them to control for a variety of individual and parental background variables and to consider long-term effects of single-sex schooling. However, because the data were not based on sampling of schools, they could not control school-level covariates. Sullivan et al. (2010) reported that while girls attending single-sex schools at age 16 were more likely to obtain higher qualifications, there was no difference between coeducational and single-sex schools for boys once background variables and prior test scores were controlled. However, these initial positive effects did not translate into higher levels of educational attainment later in life. Attending single-sex schools was only related to attainment in gender-atypical subjects for both girls and boys. Girls who attended single-sex schools were also more likely to gain their highest qualification in fields dominated by males. Sullivan (2009) found similar results when studying the effects of single-sex schooling on academic self-concept. After controlling for a large number of individual background variables and factors related to school sector and curriculum, girls who attended single-sex schools had higher self-perceptions of their ability in math, a stereotypically male subject, than girls who attended coeducational schools. Boys who attended single-sex schools had higher self-perceptions of their ability in English (a stereotypically female subject) than boys who attended coeducational schools.
Recent studies of single-sex schooling effects have increasingly and appropriately controlled the effects of large sets of potential confounders and—as a consequence—tend to report only small or no beneficial effects of single-sex schooling on achievement and psychosocial outcomes. A very important limitation of the research on the effects of single-sex schooling is of a methodological nature. Almost all studies rely on observational data (even quasi-experimental intervention designs are rare, but see Marsh et al. 1988, 1989). However, apparently no study has used more advanced methods, designed to mimic randomized experiments with observational data by matching individual students from single-sex and coeducational schools to form comparable samples. In the present investigation we overcome this shortcoming by introducing propensity score methods based on the strongest historical dataset available, the HSB study.
The Methodological Focus: Propensity Score Matching
In studying the effects of single-sex versus coeducational schools, the interest is in estimating the causal effect of school-type, net of other confounding influences, such as pre-existing differences between the student populations. However, as there is no random assignment of students to school-type (for an exception in the context of Korean high schools see Park et al. 2010), simple mean differences provide biased estimates of the school-type effect. Thus, it is helpful to use a theoretical framework in which conditions for the identification of causal effects in observational studies are outlined. The most prominent framework for the analysis of causal effects is the potential outcomes model (also known as the counterfactual model of causality, or Rubin’s causal model, Holland 1986; Rubin 1974, 1977, 1978, 2005). Originally proposed in the 1970s, with roots extending back to the 1920s (Neyman 1923/1990), the counterfactual model of causality has only recently enjoyed a surge of popularity and met its promise for the evaluation of educational interventions (see e.g., Thoemmes and Kim 2011; for more technical reviews see Morgan and Winship 2007; Rubin 2005; Schafer and Kang 2008; West and Thoemmes 2010).
Potential Outcomes Model
The following thought experiment is at the heart of the counterfactual model of causality (Rubin 1974, 1977, 1978). For each student, we can imagine one hypothetical outcome if the student attended a single-sex school and one hypothetical outcome if the student attended a coeducational school. The difference between these two potential outcomes is the individual causal effect of attending a single-sex versus a coeducational school (Rubin 1974, 1978, 2005). In practice, of course, only one of the hypothetical outcomes is ever observed. Hence, the individual causal effect can never be estimated: this is the fundamental problem of causal inference (Holland 1986).
However, two summary treatment effects can be estimated in observational studies under additional assumptions. The average treatment effect (ATE) is defined as the (unweighted) average of the individual causal effects over the total population (e.g. Rubin 1977). The ATE is estimated by the mean difference between treatment groups in a randomized experiment. The average treatment effect of the treated (ATT) is the average of the individual causal effects of the students in the treatment group (e.g., Rubin 1977), or, more generally, the average in a population of students with the same characteristics as the treatment group. In our case, the ATT would be the effect of attending a single-sex school for the population of students who actually attended a single-sex school. As we describe later, the ATT is the quantity usually estimated with matching techniques (Stuart 2010). In observational studies, the ATE and the ATT will in general differ, unless very restrictive conditions apply (see, e.g., Schafer and Kang 2008).
In order to estimate either the ATT or the ATE, the condition of strong ignorability has to hold. Strong ignorability is defined as stochastic independence of the potential outcomes and the assignment to treatment conditions conditional on a set of covariates (Rubin 1978; Rosenbaum and Rubin 1983a). In our case, assignment to single-sex or coeducational schools would have to be independent of the potential outcomes, given background characteristics and pre-test variables of the students. Strong ignorability would be fulfilled if assignment to school-type could be considered random, given a set of background covariates. Strong ignorability requires that all covariates that influence both the potential outcomes and the treatment assignment probabilities are known and included in adjustment models. It can only be falsified in applications, which implies that unmeasured confounders can potentially bias empirical results (Little and Rubin 2002) and that sensitivity analyses are warranted (Rosenbaum and Rubin 1983b; Rosenbaum 2002).
In addition to strong ignorability, identification of average causal effects rests on two further assumptions: overlap (Rubin 1977) and the stable-unit-treatment-value assumption (Rubin 1980, 1986, 1990a). The assumption of overlap requires that the treatment probabilities are between zero and one for all units, in both intervention and comparison groups. In our case, this implies that there are no students who would always attend either a single-sex or a coeducational school. Under such conditions, the other potential outcome would not be defined, and the student would have no meaningful individual causal effect (e.g. Steyer et al. 2000). The stable-unit-treatment-value-assumption requires that there be only one version of the treatment, and that the hypothetical potential outcomes are neither affected by the treatment assignment of the student under consideration nor by the composition of the treatment and control groups (Rubin 1980, 1986, 1990a; for a critical discussion see Hong and Raudenbush 2006; Gitelman 2005; Manski 2010; Nagengast 2009; Sobel 2006; VanderWeele 2008).
In practice, estimating the average causal effect requires a statistical model for the relations between the covariates, the treatment assignment and the outcome variable. Although other methods such as instrumental variable estimation (e.g., Heckman and Vytlacil 2007), fixed effects modelling (e.g., Wooldridge 2005) or differences-in-differences estimators (e.g., Angrist and Pischke 2009) could be used under appropriate circumstances, there are typically two strategies that are used to estimate average causal effects in observational studies (Schafer and Kang 2008; Shadish et al. 2008) within the potential outcomes framework. One can either model the relation between the covariates and the outcome variable, as is typically done with analysis of covariance (ANCOVA) and related methods (e.g. path models based on manifest or latent variables), or one can model the relation between the covariates and the treatment assignment, which is the focus of a set of analytical methods that can be broadly classified as propensity score methods (for recent reviews see, e.g., Morgan and Winship 2007; Schafer and Kang 2008; Stuart 2010). Both approaches rely on the assumption of strong ignorability for the identification of causal effects, i.e., on the inclusion of all covariates that affect both the potential outcomes and the assignment to treatment conditions—the sine qua non of causal effects analysis.
Analysis of Covariance
ANCOVA is the classic method for analyzing the effects of single-sex versus coeducational schools (e.g., Lee and Bryk 1986; Lee and Marks 1990; LePore and Warren 1997; Marsh 1989a). While it is theoretically possible to identify causal effects using ANCOVA (typically the ATE), doing so requires additional strong assumptions that undermine the use of ANCOVA as a technique for causal inference (Rubin 1977). The first shortcoming is that rules for variable selection in multiple regression (explained variance, statistical significance of regression weights) are not specifically tailored for causal inference. Strong ignorability requires controlling all variables that are related to treatment assignment. As these variables will share some explanatory power for the outcome with the treatment variable, they might be non-significant predictors in a multiple regression model, leading to their erroneous exclusion from the set of predictor variables (Schafer and Kang 2008).
The second shortcoming of ANCOVA with respect to causal inference is that overlap in the covariate distribution between treatment groups is usually not assessed. In the case of minimal or no overlap, ANCOVA approaches are highly reliant on extrapolation of the covariate-outcome relation outside of the range of observed data. This is particularly problematic when there are large differences in covariate distributions between the treatment groups (e.g., Rubin 2001; Rubin and Thomas 2000). In studies of single-sex and coeducational schools, this is likely to be a key issue: as discussed earlier, single-sex schools are more likely to be selective private schools than coeducational schools.
Third, ANCOVA relies on the specification of a parametric model for the covariate-outcome relation: Usually a simple linear model is assumed. The appropriateness of the effect estimates obtained from ANCOVA models rests on the assumption that this relation is correctly specified, e.g. that no interaction or non-linear effects are omitted (Cochran and Rubin 1973). In particular, the exclusion of interaction effects between the covariates and the treatment variable is potentially problematic (Rubin 1990b), as there is no a priori justification for why the covariates should be similarly related to both sets of potential outcomes (Imbens and Lemieux 2008; Schafer and Kang 2008).
Finally, modeling the covariate-outcome relation directly, introduces the possibility of intentionally or unintentionally misrepresenting the results (Rubin 2001). As there are no clear criteria for what constitutes a well-specified model for causal inference when ANCOVA is employed, and since all modeling efforts always necessarily include the outcome of interest, the analyst might be tempted to adjust the model until obtaining the desired effect that fits theoretical preconceptions. As Rubin (2001) elegantly summarized: “It is essentially impossible to be objective when a variety of analyses are being done, each producing an answer, either favorable, neutral, or unfavorable to the investigator’s interests” (p. 170). Particularly with highly politicized topics, such as the effects of single-sex schooling, modeling the outcome variable directly is potentially problematic.
Propensity Score Methods
A much stronger alternative, which resolves many of the problems associated with ANCOVA, is to model the assignment of students to the treatment conditions. Methods based on the estimated propensity score, the probability of being assigned to or taking up the treatment given the covariates, have many desirable features (see Schafer and Kang 2008; Stuart 2010; Morgan and Winship 2007, for recent accessible reviews). Most critically, they separate the design and the analysis of observational studies (Rubin 2005). The analyst can specify and optimize the model for treatment assignment (the design), making it “possible to duplicate one crucial feature of a randomized experiment: one can design an observational study without access to the outcome data” (Rubin 2001, p. 170). Once a good model for the treatment assignment has been specified, estimation of the treatment effect (the analysis) can proceed. A very desirable property of the propensity score is that it balances the distribution of all observed covariates if it is based on a correctly specified model (Rosenbaum and Rubin 1983a). Hence, balance of covariates can be used as a criterion for model specification and a test for the tenability of the strong ignorability assumption for the observed covariates; an important advantage over ANCOVA approaches. As balance can be tested without knowledge of the outcome, the propensity score model can be modified until a good balance in the covariate distribution between the treatment and the control group is achieved, without compromising the analysis of the outcome (Rubin 2005). Furthermore, it is possible to assess the overlap of the propensity score distribution between the treatment groups (the region of common support, e.g., Stuart 2010) to identify the areas for which inferences about the treatment effect are justified by the data. Lack of overlap between the treatment groups suggests that the groups may differ so much that comparisons are not really warranted and, perhaps, essentially meaningless (e.g. Rubin 2001). Theoretically, regression-based adjustment methods such as ANCOVA and propensity score approaches will yield similar results when the treatment and control groups do not differ substantially with respect to the distribution of covariates (Rubin 2001; Rubin and Thomas 2000).
In practice, there are different methods for specifying the propensity score model (e.g. linear logistic regression, boosted regression, non-parametric methods; see Harder et al. 2010; Thoemmes and Kim 2011) and different ways of using the propensity score to create balance between the treatment and the control group (variations of matching, stratification and weighting, see Schafer and Kang 2008; Stuart 2010, for comprehensive reviews; Thoemmes and Kim 2011, for a systematic review of how these methods are applied in the social sciences). In this paper, we focus on linear logistic regression for estimating propensity scores, and consider full optimal matching, an advanced matching technique with desirable properties (Hansen 2004; Rosenbaum 1991; Stuart and Green 2008).
The Present Investigation
The present investigation revisits the effects of attending single-sex versus coeducational schools in the final 2 years of high school using the subsample of Catholic schools and data from the first three waves of the U.S.-american HSB study, historically the most comprehensive and strongest dataset for this purpose. The major research questions are whether there are differences between students attending single-sex or coeducational educational schools, and whether these differences vary as a function of student gender, i.e. whether there are gender-by-schooltype interactions. In line with the literature review and previous research using this same database (Marsh 1989a), we expect that without proper controls there will be large differences in favor of single-sex and coeducational schools that will largely disappear once individual-level covariates are taken into account. We evaluate these research questions in relation to a wide range of 39 academic and psychosocial outcomes designed to represent most of the major outcomes of education (See Table B1 for a listing of the variables that are described in more detail in supplementary Appendix A1; also see Marsh 1989a; NCES 1986). The outcomes were chosen to parallel earlier analyses with the HSB database. We present findings for a comprehensive set of outcome variables at Grade 12 and post-secondary education controlling for background variables and largely parallel outcomes collected at Grade 10. In our analyses, we compare findings based on different analytical strategies: ANOVA that does not adjust for pre-existing differences, ANCOVA that adjusts for the effects of the covariates on the outcome variables, and optimal full matching based on the propensity score (Hansen 2004; Rosenbaum 1991; Stuart and Green 2008) and leave it as an open research question whether there are differences between the two methods.
We used data from the first three waves of the sophomore cohort of the HSB study: 1980 (Grade 10 baseline); 1982 (first follow-up, last year of high school); 1984 (second follow-up, 2 years after normal high school graduation). A detailed description of the database is given in the user’s manual (NCES 1986). The initial sample was obtained using a two-stage sampling scheme: Approximately 36 sophomores were sampled from a representative sample of 1015 high schools in the U.S. The total sample in Grade 10 consisted of 14825 students. As there were almost no single-sex public schools, we selected the 2392 students who attended Catholic schools in Grade 10 so that potential differences between the large number of public comprehensive schools and their Catholic counterparts would not confound the difference between single-sex and coeducational schools. We then excluded 13 students who changed schools and went on to a public sector school, leaving a final sample of 2379 students who attended Catholic schools and did not change school between Grade 10 and Grade 12: 29 single-sex girls’ schools (818 girls), 22 single-sex boys’ schools (604 boys), and 33 coeducational schools (508 girls and 449 boys).
Overall, we considered a set of 14 background variables at the student level, 6 background variables at the school-level, 31 outcome variables in Grade 10, 36 outcomes in Grade 12, and 3 post-secondary outcomes (see Supplemental Appendix A1 and Table B1). The outcome variables in Grade 10 and Grade 12 could be subdivided into achievement and achievement-related outcomes (e.g. course taking patterns) and psychosocial and behavioural outcomes (such as academic self-concept, educational aspirations and school delinquency). Most of the outcomes were assessed with similar measures in Grade 10 and Grade 12. Missing data were handled using multiple imputation (see Supplementary Appendix A2 for details).
Design and Analysis
Unadjusted differences in the final year and postsecondary outcomes, prima facie effects (Holland 1986; Steyer et al. 2002) of single-sex schooling, were tested with a 2-factorial MANOVA using gender, school-type and their interaction as independent variables. Both factors were represented with effect-coded indicator variables (gender: female = −1 versus male = 1; school-type: coeducational = −1 versus single-sex = 1). In order to control for the nesting of students within classrooms, the analysis was performed in Mplus (Muthén and Muthén 1998–2010) using the design-based correction for standard errors and test statistics to control for the nesting of students within schools. Differences on individual variables were followed up with separate ANOVAs using design-based correction for standard errors in the R-package survey (Lumley 2010). In addition, simple effects of school-type were calculated within gender groups to illustrate potential interaction effects.
Analysis of Covariance
ANCOVA models were conducted for the outcome variables at Grade 12 and the postsecondary assessment. These models controlled all background variables and base year outcomes at the student level. In the total sample, all models included effect-coded indicator variables for gender, school-type, and their interaction. In addition, the main effects of all background variables and outcomes assessed at Grade 10 were included as predictors in the model. No further interactions between variables were included. These models were also run separately for boys and girls so that the effects of the covariates were allowed to vary between gender groups, allowing for implicit gender-covariate interactions. Hence, if the covariate effects varied with gender, the resulting estimates could be slightly different from the effects obtained from the total group analysis. Again, standard errors were corrected for the nesting of students within schools, using the R-package survey (Lumley 2010) in all analyses.
Propensity Score Estimation
The probabilities of attending a single-sex school were estimated with linear logistic regression models. Separate models for females and males were fitted, in order to allow for gender differences in the selection into single-sex schools. The predicted propensity scores were linearized by a logit-transformation (Rosenbaum and Rubin 1985; Rubin 2001; Rubin and Thomas 1996; Stuart 2010).
Use of the Propensity Score
We compared the performance of three analytical techniques based on propensity scores (stratification, nearest neighbor matching with replacement and optimal full matching), but only report the results for optimal full matching, here, as this method yielded the best balance of the covariate distribution. Further details and results of the other methods are reported in Supplementary Appendix A3.
Optimal Full Matching
Optimal full matching is an advanced matching technique that optimizes an overall balance criterion (Hansen 2004; Rosenbaum 1991). Units from the treatment and the control group are placed in subgroups of different sizes that contain at least one unit from each group. The subgroup sizes are optimized to maximize balance between the matched samples (see Hansen 2004, for details). In contrast to conventional nearest neighbor matching, full matching uses the full sample of subjects in the treatment and the control group. Full matching has been shown analytically and empirically to have good properties for creating balance across treatment groups (Hansen 2004; Harder et al. 2010; Stuart and Green 2008). Following Hansen (2004) and Stuart and Green (2008), we used different maximum and minimum ratios for the boys’ (from 1:6 to 3:1) and girls’ (from 1:7 to 2:1) strata, to enhance the matching estimator efficiency. We used the R-package optmatch (Hansen and Klopfer 2006) to implement full matching separately for girls and boys. Our analyses following optimal full matching identified the ATT (Stuart 2010).
Assessment of Balance
The success of the propensity score estimation and subsequent matching procedure was assessed by comparing standardized mean differences (similar to effect size measures, Rosenbaum and Rubin 1985) of the covariates before and after matching. Following Rubin (2001) and Harder et al. (2010), balance was deemed acceptable when the absolute standardized mean difference of the linearized propensity score and the covariates was below 0.25 (standardized with respect to the variance in the treatment group prior to matching). However, we always tried to reduce the imbalance between the single-sex and the coeducational schools as much as possible (Stuart 2010). In order to further assess the similarity of the covariate distributions between school-types after matching, we inspected the variance ratios of the propensity score and the covariates between the treatment and control groups after matching (target ratios were between 0.5 and 2; Rubin 2001) and visually inspected quantile-quantile (QQ) plots of the propensity score and covariate distributions in the treatment and control groups (Stuart 2010). Balance assessment was done in R (R Development Core Team 2010) using the package MatchIt (Ho et al. 2011).
Estimation of Treatment Effects After Matching
The analyses after matching controlled for the nesting of students within schools using the R-package survey (Lumley 2010). In addition, the appropriate weights resulting from optimal full matching were used in the estimation of the average treatment effects. Hence, in the present investigation the ATT of single-sex schooling was estimated using a two-way ANOVA model with gender, school-type and their interaction as independent variables. The interaction effects reflect the difference in school-type effects between girls and boys schools after matching. In addition, mean comparisons were conducted separately on the boys’ and girls’ samples to test simple main effects (i.e., differences between single-sex and coeducational schools separately for each gender).
Significance Testing and Effect Sizes
As we did not have specific hypotheses about the direction of effects, we opted for two-tailed tests of statistical significance in all analyses (see also the discussion by Lee and Bryk 1989; Marsh 1989a, b). We chose to use the conventional 5 %-level for statistical significance in all analyses. Given the large sample size, all effects were tested using the standard normal distribution as reference distribution. All results are presented in an effect size metric using the variance in the total group of Catholic school students before matching for standardization. This ensured comparability of the findings across methods.
The results section is structured as follows: We begin by briefly presenting unadjusted differences between gender and school-type for outcomes at the final year and post-secondary follow-ups. These results are only descriptive of unadjusted school-type prima facie effects (Holland 1986; Steyer et al. 2002) and clearly do not represent causal effects. Then, we present the school-type effects for the final year and postsecondary outcomes when both background variables and base year outcomes are controlled. Here, we report the estimates obtained from ANCOVA and optimal full matching, the method that led to the best balance of covariates between single-sex and coeducational schools.
Analysis of Variance
An overarching MANOVA followed by simple ANOVA models was used to obtain unadjusted (for covariates) main effects for gender, school-type and for the gender-by-school-type interaction. The MANOVA indicated significant main effects for gender, χ2 (39) = 1197.131, p < 0.001, for school-type, χ2 (39) = 96.318, p < 0.001, and for the school-type-by-gender interaction χ2 (39) = 59.404, p = 0.019. Separate ANOVA analyses were used to disentangle these overall effects.
The following unadjusted prima facie main effects (Holland 1986; Steyer et al. 2002) of school-type emerged from the ANOVAs. It should be emphasized that these effects do not represent causal effects of attending single-sex schools, as they do not control for differences in background variables and earlier outcomes, so that selection effects are not controlled. Students in single-sex schools had more math credits, more English credits, more language credits and more physical science credits. They reported higher self-esteem, higher work and community orientations, and a more internal locus of control. Their academic and social self-concepts were also higher. They also showed lower gender stereotypes, higher educational and parental aspirations, spent more time on homework, and showed more concentrated effort at school. Finally, a large number of significant gender differences emerged. Boys had higher values in: math scores, science scores, vocabulary scores, math credits, physical science credits, working orientation, following an academically oriented course pattern in math and science, school delinquency, physical self-concept, athletic self-concept, gender stereotypes. Girls had higher values in: writing scores, self-reported grades, attending a life science course, following a vocational course pattern, internal locus of control, family orientation, parental involvement, time spent on homework, concentrated effort, academic self-concept. The descriptive statistics for all variables are reported in Table B1 in the supplemental materials, separately for girls and boys.
Controlling for Background Variables and Base Year Outcomes
Analysis of Covariance
The conventional ANCOVA models controlled school-type effects for student-level covariates (background variables and outcomes assessed in Grade 10). There were few effects of school-type, after controlling for the comprehensive set of background and baseline covariates. This is also highlighted in Fig. 1, Panel b, where only two main effects of single-sex schools and one interaction effect are shaded: Students in single-sex schools had smaller gender stereotypes and higher parental aspirations. The gender-by-school-type interaction was only significant for concentrated effort, where the effect of being in a single-sex school was positive for boys, but non-significant for girls. All other 37 main effects of school-type and 38 school-type-by-gender interaction effects were statistically non-significant. In order to take account of potential interactions between gender and the covariates in predicting outcomes at the final year, we examined simple effects of school-type separately for girls and boys. In the girls’ sample, there was only one statistically significant effect of single-sex schools on foreign language credits. For boys, attending a single-sex school had a positive effect on math, physical science, English and foreign language credits. In addition, boys attending single-sex schools reported a more internal locus of control, higher parental aspirations and more concentrated effort at school.
Although not a central focus of our study, there were also a large number of significant gender main effects. Boys had higher scores in: math, science, reading, vocabulary, math credits, physical science credits, math orientation, science concentration, academic credits, foreign language credits, work orientation, community orientation, physical self-concept, athletic self-concept, gender stereotypes and trouble at school. Girls had higher scores in: writing, self-reported grades, concentration in vocational studies, family orientation, academic self-concept, time spent on homework (also see Fig. 1, Panel b).
Optimal Full Matching
First, we summarize the balancing properties for the background variables and the base year outcomes; then we present outcome analyses based on the matched sample.
Balance and Overlap Assessment
Standardized mean differences and variance ratios of covariates after optimal full matching
Standardized Mean Differences
Standardized Mean Differences
Linearized propensity score
Ethnicity = Black
Ethnicity = Hispanic
Number of Repeated Grades
College Expectations Grade 8
Years in Catholic School
Years in Public School
Home Language = English
Physical Science Credits
Foreign Language Credits
Social Science Credits
Locus of Control
Time Spent on Homework
TV hours per day
Trouble in School
Average Treatment Effects Based on Optimal Full Matching
Only one significant main effect of school-type and one school-type-by-gender interaction emerged from the analyses after full matching (evident also from Fig. 1, Panel c, where only one effect is shaded): Students in single-sex schools had more foreign language credits than students attending coeducational schools. All other school-type effects were not statistically significant when the total sample was analyzed. Again, the interaction for concentrated effort was statistically significant, indicating that single-sex schools led to higher concentrated effort for boys, but not for girls. When the samples for boys and girls were considered separately, only one simple effect for girls reached statistical significance: Girls in single-sex schools had more foreign language credits. For boys, the simple effects of school-type on foreign language credits, math pattern and concentrated effort were statistically significant at the 5 %-level. These findings indicated that boys in single-sex schools received more foreign language credits, took more advanced math courses and showed higher concentrated effort in school. The school-type effects after full matching are presented in Fig. 1, Panel c. The effect estimates based on full matching were highly correlated with the ANCOVA estimates (total sample: r = 0.877; girls: r = 0.886; boys: r = 0.905; for a more detailed comparison of the effect estimates from different methods see Supplemental Appendix A4). These large correlations – as well as our detailed evaluation of the results – indicate substantial agreement between these two approaches in this particular study.
Our study is a substantive-methodological synergy (Marsh and Hau 2007) revisiting classic, unresolved research issues with new and evolving methods in order to gain new insights. Specifically, we applied propensity score methods to an evaluation of the effects of single-sex schooling in the final 2 years of high school, revisiting one of the strongest datasets used to study these effects and addressing the lack of analyses using matching at the individual level, that had hampered earlier research (see Mael et al. 2005). Substantively, we investigated a hotly debated issue with important policy implications.
Substantive and Methodological Strengths
In line with the findings of Marsh (1989a) we found almost no differences between single-sex and coeducational schools in the development of achievement and motivational outcomes during the final two years of high school. We applied new and evolving statistical methods to the strongest available nationally-representative dataset for this purpose, the HSB study. We showed that the descriptive (unadjusted) advantages for single-sex Catholic schools in the HSB study almost completely vanished once differences between students in Grade 10 were controlled. This lack of effects was largely consistent over gender and certainly did not show that girls were more advantaged by single-sex schooling than boys.
The pattern of effect estimates was very similar between the ANCOVA and the propensity score methods. This finding points to a possible equality of ATE (i.e., the effect of single-sex schooling on the whole population of students attending Catholic schools) and ATT (i.e. the effect of single-sex schooling on the subpopulation of Catholic school students that attended single-sex schools). Neither the students who attended single-sex schools, nor, hypothetically, the whole population of Catholic high school students, would have profited from attending single-sex schools as compared to coeducational schools during the final 2 years of high school. The convergence of results from propensity score and regression-based effect estimates is in some sense surprising, given that similar comparisons in relation to different substantive issues tended to show larger differences in effect estimates between ANCOVA and propensity score methods (e.g., Dehejia and Wahba 1999; LaLonde 1986; Schafer and Kang 2008; but also see Senn et al. 2007). One reason for the congruence of ANCOVA and propensity score estimates is that the data constellation was relatively benign for regression-based adjustment. In both the boys’ and the girls’ samples, the standardized mean differences on the covariates were not far above the threshold of 0.25, with similar sample sizes, reasonable overlap and similar variances fulfilling the criteria for regression-based adjustment to work (Rubin 2001; Rubin and Thomas 2000). It is also in line with findings from within-study comparisons (Pohl et al. 2009; Shadish et al. 2008) that demonstrate that both regression-based adjustment and propensity score methods can approximate findings from randomized experiments if the covariates that are controlled are theoretically related to selection into control and treatment groups and if the number of covariates controlled is sufficiently large. As both ANCOVA and propensity score approaches are based on the assumption of strong ignorability, it is not surprising that both approaches result in similar results when this assumption is met.
However, the similarity of effect estimates should not detract from some of the theoretical advantages of propensity score methods. Only by using propensity scores and the mindset of separating the design of the observational study from the actual analyses of the outcome, was it possible to assess differences in the covariate distribution pre- and post-matching and corroborate the assumption of strong ignorability. Particularly in highly controversial and politicized research fields, such as the effects of single-sex schooling, being able to specify the design of the study without knowledge of the outcome variable is a key advantage of propensity score methods. Rubin (2001), in the context of tobacco litigation, another high-stakes policy and judicial area, argued that “the lack of availability of outcome data when designing experiments is a tremendous stimulus for ‘honesty’ in experiments and can be in well-designed observational studies as well” (p. 169). Only propensity score approaches offer the opportunity to assess two assumptions (balance and overlap) that underlie causal inferences. Research in single-sex and coeducational education should follow the example set by medical research and take full advantage of these methodological advances (see also Foster 2010).
Limitations and Directions for Further Research
Controlling for School-level Covariates
In contrast to some earlier studies of single-sex schooling (e.g. LePore and Warren 1997; Marsh 1989a), we did not control school-level covariates and contextual effects in our analyses. How can the effect estimates of single-sex schooling obtained by controlling only for individual-level covariates be interpreted? Here, a distinction made by Raudenbush and Willms (1995; see also Raudenbush 2004) in the context of school effectiveness research using potential outcome terminology is helpful. Raudenbush and Willms (1995) distinguished three factors that influence a student’s achievement in a particular school: Student characteristics, school context and school practice. Based on this distinction, they derived two types of school effects: Type A effects are controlled only for student characteristics and are of interest for school choice when the outcomes of a student should be optimized by the school, with no regard as to what elements of the school (context or practice) are responsible for the outcome. Type B effects are controlled for school context as well and are important in accountability systems when schools are to be held accountable for factors under their control only, not for the context in which they are operating. As we did not include school-level covariates in our analyses, we obtained Type A effects of single-sex schooling that do not control for differences in school context between single-sex in coeducational schools.
In supplementary analysis, we tested the influence of six school-level covariates (enrolment, school-mean socio-economic status, the respective percentages of black and hispanic students, urbanicity, and community income) on the effect estimates to control for contextual effects. These analyses revealed that, although there were substantial differences between single-sex and coeducational schools for most of the school-level covariates, the ANCOVA estimates of school-type effects, with or without controlling for the school-level covariates, were very similar. The correlations between the effect estimates of the two sets of analyses were very high (total sample: r = 0.936; girls: r = 0.912; boys: r = 0.908). The statistical significance of specific effects sometimes changed between the two sets of analyses, but the overall pattern remained largely unchanged.
Including the school-level covariates in the propensity score analyses revealed a critical problem. Single-sex and coeducational schools differed substantially in urbanicity for both girls and boys (single-sex schools tended to be predominantly in urban and suburban areas) and in school-average socio-economic status for girls (single-sex schools were considerably lower). When the school-level variables were used to estimate the probabilities of attending a single-sex school, optimal full matching was unable to balance the covariate distributions between single-sex and coeducational schools for both school-level and student-level covariates, indicating that it was not possible to create comparable groups of students who attended single-sex and coeducational schools. This shows a particular strength of the propensity score approach. By explicitly testing the overlap and balance between single-sex and coeducational schools it was possible to uncover a problem that would have gone unnoticed with conventional ANCOVA models. Due to the small overlap in the distribution of some school-level covariates (urbanicity in particular) the ANCOVA estimates rely heavily on extrapolation when adjusting the effects of single-sex schooling. This is an important issue, as many previous analyses (e.g., Marsh 1989a) have included school-level covariates in their adjustment model and hence, potentially have heavily relied on extrapolation. Although there are good theoretical reasons for controlling school-level covariates when analyzing school-type effects (e.g., the desire to separate the effects of school-type from the effects of school context), our findings suggest that these goals might not be easily achievable in US studies where school context and school practice are highly confounded. The problems with the propensity score model including school-level covariates, indicated that trying to unconfound school-type (single-sex versus coeducational) from school context analytically was not possible. Hence, our findings generalize to the effects of Catholic single-sex schools in the U.S. on the development of their students in the final 2 years in high school and their specific context at the time of the HSB study.
The HSB study tracked the development of achievement and socio-cultural variables throughout the final 2 years of high school. Hence, our results of non-effects of single-sex schooling are obviously limited to this crucial period. However, the focus on a short span of the total school career also allowed assessment of the effects of single-sex schooling in a representative sample of schools and students. To be better able to assess the effect of single-sex education, life course studies that would track students’ achievement and development over longer time-periods are obvious options (see e.g. Sullivan 2009; Sullivan et al. 2010). However, while these studies track students’ performance over longer time periods and have the potential to assess the effects of different educational institutions at different points along the way, they suffer from potentially small and unrepresentative school samples. Also, the number of purely single-sex schools has been dramatically reduced over the last decades in most Western, developed countries as well as the US, Hence, comparisons of school-type effects would be very difficult (as evidenced by the lack of overlap of school-level variables found in our sample). In our study we circumvented this problem by restricting our analyses to Catholic schools, using an implicit experimental blocking technique and unconfounding school-type effects from differences between the public and the Catholic school sector. This obviously restricts the generalizability of our findings to Catholic schools, as they were operational at the time of the HSB data collection. However, these limitations are hard to overcome, even with new data and study designs, as the number of single-sex schools has dropped significantly (LePore and Warren 1997), their recent surge in popularity (Bigler and Signorella 2011) notwithstanding.
One particular strength of the HSB database and the current investigation is the large sample size of single-sex schools and students, and the study design that assesses many important achievement and motivational outcomes on two occasions during high school. Marsh (1989a, see also Jencks 1985) argued that the only appropriate analyses for evaluating school-type effects with the HSB database would involve the Grade 12 outcomes, controlling for background variables and Grade 10 outcomes—the approach used here. Although easily the most defensible approach with this data, it is not without problems. First, the Grade 10 outcomes were obtained when students were already enrolled in high school (typically starting with Grade 9 in the U.S. school system) and, hence, do not represent true covariates that are prior to treatment assignment, but are themselves potentially affected by attending a single-sex or a coeducational school. Rosenbaum (1984) discussed issues involved in controlling for such “concomitant variables” (p. 656) for causal inference. He noted that the resulting effect estimates could only be interpreted as causal effects of the original treatment when the concomitant variables were surrogates for pre-test variables that were not affected by the treatment. Clearly, this assumption is not tenable for some of the variables in the present context, particularly achievement-related and psycho-social measures are likely to have been affected by attending a specific type of high school. So the findings that we reported above cannot be interpreted as effects of single-sex schooling per se. Instead, they have to be interpreted as single-sex school effects during the final 2 years of high school—formative years with respect to career aspirations and further educational choices. Although this restricts the scope of the current findings, relative to an ideal study with covariates assessed prior to attending a specific school-type, the alternatives would be even less palatable. Controlling only for background variables that were retrospectively assessed, but reasonably stable across the study (such as socio-economic status or race) would render the effect estimates liable to bias, due to unmeasured confounders like prior achievement differences (see Lee and Bryk 1989; Marsh 1989a, b). Hence, although not optimal in an absolute sense, the effect estimates presented here are as close as one might hope to get to causal effects within the restrictions of the HSB study design. Even achievement data obtained before students entered high school (e.g. in 8th grade) would not allow us to single out single-sex school effects over and above the high school period, as students typically would have different educational experiences (public/private, single-sex coeducational schools).
Methods based on the propensity score offer clear criteria for testing their appropriateness (balance between treatment and control groups after matching, overlap of the propensity score distributions) that have no apparent equivalent in ANCOVA. A number of recent accessible reviews of the application of propensity scores in the social sciences offer guidance for applied researchers (Harder et al. 2010; Morgan and Winship 2007; Schafer and Kang 2008; Stuart 2010; Thoemmes and Kim 2011).
What practical advice can be given to researchers who try to evaluate policy issues similar to the effects of single-sex schooling, in cases where selection into groups is observed and cannot be experimentally manipulated? First, information on a large number of covariates needs to be collected. The research reported here, the theoretical role of strong ignorability and results from within-study comparisons (Shadish et al. 2008; Pohl et al. 2009), underline the importance of assessing a large set of informative covariates that predict selection into the conditions to be compared (see Austin et al. 2007; Hill et al. 2011, for advice on selecting covariates and dealing with a high-dimensional set of covariates). In the HSB sample, we were able to control for a total of 45 covariates, including demographic background variables and—most importantly—pre-tests for most of the outcomes considered.
Second, the differences in distribution of the covariates between the conditions need to be carefully assessed, focusing on mean differences, variance ratios and potentially Q-Q plots (see Diamond and Sekhon 2006; Ho et al. 2007; Stuart 2010; also see supplemental materials). If the differences are small (see the rules in Harder et al. 2010; Rubin 2001, 2005; Rubin and Thomas 1996), analyzing the outcomes directly with ANCOVA is likely to lead to unbiased results. Otherwise, propensity score methods will be more appropriate.
Third, care should be taken to specify the propensity score model, using different specifications (and, if necessary, methods) to predict assignment into treatment and control conditions. In our application, a parsimonious linear logistic model with main effects was sufficient to balance the covariate distribution between single-sex and coeducational schools, but this might not always be the case (see e.g. Dehija and Wahba 1999). Also, different methods for balancing the covariates should be compared with respect to their performance (see Stuart 2010, for guidelines on which matching method to choose in different situations). In our case, optimal full matching (Hansen 2004; Stuart and Green 2008) was most successful in balancing the covariates’ distributions, but results can differ in other applications. When setting up the model, full advantage of the exploratory modifications of the propensity-score model should be taken (Rubin 2001, 2005).
Finally, once the observed covariates are balanced, the outcomes of interest can be compared across the two matched samples and effect estimates can be calculated. Here, we relied on simple mean comparisons within gender groups, but more complex models including covariates to increase power and control for residual confounding might be useful as well (Ho et al. 2007; Kang and Schafer 2007).
Conclusion and Implications
We believe that propensity score methods will play an increasingly important role in the evaluation of educational interventions and policies (see Thoemmes and Kim 2011, for a review) as they offer a principled way of evaluating effects obtained from quasi-experimental and observational studies. Many educational interventions, particularly when implemented on a large scale, are not amenable to the “gold standard” (e.g., Rubin 2008, p. 1350) for assessing causal effects, the randomized experiment. Propensity score methods are the strongest and most objective alternative when experimental designs are not feasible (Shadish and Cook 2009). When applied skilfully, they promise to offer new insights into classic, unresolved research questions, leading to methodological-substantive synergies (Marsh and Hau 2007) such as that reported in this paper.
How do our findings relate to the broader picture on single-sex schooling effects? Given the prevailing evidence of mostly minor and at best subtle effects of single-sex education on educational and psychosocial outcomes, summarized in the reviews by Mael et al. (2005) and Smithers and Robinson (2006), the finding of essentially no discernible effects during the last 2 years of high school should not come as a surprise, even though we used a historical database that limits the temporal generalizability of the findings. In fact, even researchers who reported positive effects of single-sex schooling in previous research seem to have backtracked from earlier strong statements. For example, Lee (1998) stated that “the research on single-sex schooling . . . should not be interpreted as favoring the separation of girls and boys for their education” (p. 50). Indeed, whether a school is single-sex or coeducational seems to be relatively unimportant in terms of its effectiveness in producing educational outcomes, compared to more proximal teaching-related variables, a fate shared with other distal variables related to school organisation and structure (Hattie 2008).
This research was supported in part by grants to the second author from the UK Economic and Social Research Council and the King Saud University in Saudi Arabia. Requests for further information about this investigation should be sent to Benjamin Nagengast, Department of Education, Center for Educational Science and Psychology, University of Tübingen, Europastr. 6, 72072 Tübingen, Germany; E-mail: email@example.com