Background

Understanding what language ability is and what it takes to learn and use language is crucial to language testing and pedagogy (Alderson and Banerjee 2002). Motivated by such enquiries, recent years have witnessed a profusion of studies into the structure of second language (L2) ability through analyzing language test data, most of which employed confirmatory factor analysis (CFA) as their analytic methodology (e.g., In’nami and Koizumi 2011; Sawaki et al. 2009; Shin 2005). While most previous factor analytic studies suggest a hierarchical structure of L2 ability (e.g., In’nami et al. 2016; Llosa 2007; Sawaki et al. 2009), other studies have suggested the plausibility of alternative models such as a non-hierarchical multi-componential model (e.g., Gu 2014; In’nami and Koizumi 2011; Kunnan 1995). Despite the proposal of different construct representations, results of these studies repeatedly suggest that language ability is a multidimensional construct, thus rejecting the strong form of Unitary Trait Hypothesis which argues for the indivisibility of language ability (Oller 1976). Although the multidimensional nature of language ability has been widely recognized, researchers have not reached consensus regarding the nature of its constituents and/or the manner in which they interact. Hence, continued research on the nature of language ability is warranted to examine the internal structure of language tests developed for different purposes, of different stakes, and in different contexts (Gu 2014).

Despite the growing body of research on the internal structure of language tests, few studies have taken the further step of modeling the relationship between test takers’ performance on language tests and their ability to use the language in target language use (TLU) domains. Admittedly, the internal structure of language tests assists in understanding the nature of language proficiency. However, modeling the relationship between test performance and language use is arguably more crucial to the validity argument for a language test, as it extrapolates test performances over to real-life language tasks within the same domain (e.g., Chapelle et al. 2008). As such, the present study set out to examine the internal structure of the Fudan English Test (FET), a high-stakes university-level English proficiency test in China and the relationship between test taker performance and their ability to use language in TLU domains. In this study, test takers’ language use in TLU domains was captured by their responses to a self-assessment (SA) inventory, which was modeled on the local English teaching syllabus and the FET test specifications, and subsequently validated through Rasch measurement theory (Fan 2016).

Although SA has been traditionally used in formative assessment contexts as an alternative to teacher-assessment, it can also be justified as a measure of language use in TLU domains. A typical challenge language test developers face is the collection and examination of predictive validity evidence for a test, i.e., the extent to which test scores predict future or subsequent language performance in TLU domains. More specifically, the challenge lies in determining what constitutes a good measure of language use in TLU domains. While most predictive validity studies have used academic performance or success as the validity criteria, these measures have been criticized as being incongruent with the target construct (e.g., Cho and Bridgeman 2012). Although self-assessment has been questioned in terms of its reliability and validity (e.g., Blanche and Merino 1989), it constitutes a logistically viable and justifiable alternative for measuring real-life language use in TLU domains for the following reasons. First, human beings, because of the possibility of constant self-reflection, can achieve a better understanding of the full spectrum of their own language performance in real-life contexts than do external evaluators (Powers and Powers 2015). Second, research has shown acceptable predictive accuracy of SA in different settings (see, Ross 1998, for a meta-analysis of the predictive accuracy of SA in L2 research). Third, construct-irrelevant factors that are assumed to affect the reliability and validity of SA can be mitigated through a host of measures, such as using achievement-based rather than proficiency-based abstract statements, providing rich contextual information about the tasks being self-assessed, and above all, improving the content validity of the SA inventory (e.g., Butler and Lee 2006; Suzuki 2015).

While previous studies have used SA for test validation purposes (e.g., Enright et al. 2008; Powers et al. 2003), most of these studies examined the correlations between test takers’ self-assessments and their test performance. A number of researchers, however, expressed the view that correlation is inherently difficult to interpret even for experienced social scientists (e.g., Sackett et al. 2008). Furthermore, to the best of our knowledge, no previous study has modeled the relationships between test takers’ performance and their self-assessments at the latent variable level. To fill in this gap, this study utilized structural equation modeling (SEM), which is a comprehensive statistical methodology that “takes a confirmatory (i.e., hypothesis-testing) approach” to the analysis of a structural theory bearing on some phenomenon (Byrne 2006, p. 3), and to testing theoretical hypotheses about the relationships among observed and latent variables. In this study, we first used CFA to investigate the internal structure of the FET and SA inventory by modeling test takers’ test and SA responses. Then we used a structural regression model to investigate the relationship between test takers’ performance on the FET and their ability to use language in TLU domains, as captured by their responses to an SA inventory.

The Present Study

The Fudan English Test

In China, the College English Test (CET) has been recognized as a reliable and valid instrument for assessing university students’ English language proficiency and achievement (Zheng and Cheng 2008). However, recent years have witnessed the CET coming under heavy criticisms from some educators and researchers for its test format (e.g., heavy reliance on the multiple-choice questions), lack of alignment between the CET and the teaching curriculum developed within any particular university, and its rather negative washback effect on English teaching and learning at the tertiary level. In response to these criticisms, some first-tier universities in China have developed local English language tests in the hope of addressing the deficiencies of the CET and better aligning English testing, teaching, and learning within those university settings. It is against this backdrop that the FET was implemented in 2011, following a number of trials and pilot studies (Fan and Ji 2014).

The FET is developed by the College English Center of Fudan University (FDU), one of the first-tier research universities in China, and administered once a year by The University’s Academic Affairs Office (AAO) to non-English major undergraduates. Modeled on the English Teaching Syllabus at FDU, the FET is aimed at assessing students’ English proficiency in the four modalities of listening, writing, reading, and speaking (FDU Testing Team 2014). The listening part of the FET comprises four sections: spot dictation, conversations, news reports, and academic lectures; the writing part comprises two sections: essay writing and practical writing; the reading part includes the reading comprehension of five passages; the speaking part consists of three tasks: responding to questions, short comment, and picture/graph description and comment. Given that the teaching syllabus clearly specifies the learning objectives, those who have passed the FET are expected to demonstrate the ability to perform the language activities as reflected in the teaching syllabus.

Since September 2011, the AAO mandates that all undergraduate students pass the FET in order to graduate from FDU. Although repeated attempts are allowed, failure to pass the FET before graduation may result in delay or suspension in the conferment of a bachelor’s degree. Because of these policies, the FET can be regarded as a high-stakes English test. The stakes associated with the FET require that it be validated on a continuous basis. Some validation research has been undertaken of the FET since its inception. Fan and Ji (2014), for example, examined the validity of the FET through surveying students’ views on the quality of the test. The results indicated the generally satisfactory face validity of the FET. In another study, Fan and Bond (2016) interrogated the validity of the analytic rating scale used in the FET speaking part. The study concluded that overall, the rating scale functioned as intended, thereby lending support to the validity of the FET speaking part.

Research Questions

The present study intends to contribute to the validity narrative of the FET through modeling students’ performance on the FET and examining its relationship to their ability to use language in real-life situations. To achieve these objectives, the following three research questions were investigated:

RQ1 What is the factor structure of the FET? Is this factor structure consistent with the theoretical configuration of the constructs as defined and operationalized in the FET specifications?

RQ2 What is the factor structure of the SA inventory used in this study? Is the factor structure consistent with relevant theories of language ability?

RQ3 To what extent can students’ performance on the FET predict their ability to use language in the TLU domains?

Method

Data

The data in the present study were obtained from 4162 undergraduate students who took the FET in 2014, consisting of 1985 (47.7%) males and 2177 (52.3%) females. Of the 4162 participants, 244 test takers completed an SA inventory a week before the FET. This SA sample consisted of 74 (30.3%) males and 170 (69.7%) females. At the time of the investigation, 69 (28.3%) students were studying in their first year in the university, 133 (54.5%) in the second year, and the rest 42 (17.2%) in the third year. It should be noted that compared with the test population in 2014 (N = 4162), the SA sample (n = 244) had relatively higher English proficiency, as indicated by their observed mean scores on the test (sample: M = 66.72, SD = 11.12; population: µ = 61.62, δ = 11.74). An independent samples t test showed that the difference was statistically significant (t = −6.94, df = 275.72, p < 0.05, d = 0.45). This difference needs to be considered when generalizing the findings of this study to the test population.

The test data were section-level raw scores. In accordance with the FET test structure, the data consisted of raw scores on four sections in the listening part (i.e., spot dictation, conversation, news report, and lecture), two sections in the writing part (i.e., essay writing and practical writing), five sections in the reading part (i.e., five reading comprehension passages), and three sections in the speaking part (i.e., responding to questions, short comment, and picture/graph description and comment). In addition to the test data, students’ SA data were collected through an SA inventory. The SA inventory was written in Chinese, the participants’ first language (see “Appendix 1: English Translation of Can-Do Statements in the SA Inventory” section for its English translation). After the inventory was generated, it was subjected to iterative content reviews from language teaching and testing experts, English teachers, and students. A total of 26 can-do statements in the form of a five-point Likert scale survived this a priori validation process, covering a range of content and degrees of difficulty in four language modalities of listening (n = 6), reading (n = 7), writing (n = 7), and speaking (n = 6).

All can-do statements in the inventory were modeled on the learning objectives in the teaching syllabus and designed to capture students’ language use in TLU domains. For example, students were asked to self-assess their ability to “understand the details of English lectures which do not involve much subject knowledge” (see “Appendix 1: English Translation of Can-Do Statements in the SA Inventory” section). This ability was included in both the learning objectives and the FET test specifications, and considered as an essential language skill for students in their real-life language use. Following the content reviews, Rasch measurement theory was applied to examine the measurement properties of the SA inventory (Fan 2016). This posterior validation study indicated that all statements in the inventory fit the strict expectations of the Rasch model. Therefore, the inventory was deemed ready to be used for the present study. Following the validation of the SA inventory, the SA data were collected with the assistance of the English teachers in December 2014. All students participated in this study on a voluntary basis and provided written consent. Since students’ perceived difficulty of the test and performance on the test might affect their SA results on certain tasks (e.g., Powers and Powers 2015), the SA inventory was administered to the participants one week before they took the FET.

Analysis Procedures

The analyses in this study were performed using EQS 6.3 (Bentler and Wu 2005), following three steps. First, CFA was utilized to model the test data to investigate the factor structure of the FET (N = 4162). Second, similar to the procedures taken during the first step, CFA was performed of the item-level data of the SA inventory (n = 244). Third, structural regression modeling was utilized to examine the relationship between test data and SA data simultaneously (n = 244).

To address RQ1, three plausible models were hypothesized to capture the internal structure of the FET based on a review of previous research (e.g., Gu 2014; In’nami et al. 2016; Kunnan 1995; Sang et al. 1986; Sawaki et al. 2009), as well as the constructs claimed to be assessed in the FET test specifications (FDU Testing Team 2014). A correlated four-factor model (Fig. 1) was specified wherein the four skill factors correlated with each other. The configuration of this model is congruent with the section structure of the FET and results of some previous studies (e.g., Kunnan 1995; Sang et al. 1986). A higher-order factor model (Fig. 2) consisted of four independent skills factors whose correlations were explained by a higher-order factor representing general EFL ability. This model reflects the current score-reporting practice of the FET, which consists of both a composite score and profile scores on the four subskills. Also, this model, as explained earlier, has garnered extensive support from previous research (e.g., In’nami et al. 2016; Llosa 2007; Sawaki et al. 2009). A correlated two-factor model (Fig. 3) consisted of a speaking factor and a factor associated with listening, writing, and reading. This model is consistent with the current administrative procedures of the FET wherein the written test is administered separately from the speaking test. Moreover, this model has been found to exhibit satisfactory fit to other language tests (e.g., Gu 2014). Given that the SA inventory was designed in accordance with the test specifications, the same three models were hypothesized to represent its factor structure and assessed against the SA data to respond to RQ2. To address RQ3, a structural regression model was built wherein the latent factors in the test model were regressed on the latent factors in the SA model.

Fig. 1
figure 1

Correlated four-factor model

Fig. 2
figure 2

Higher-order factor model

Fig. 3
figure 3

Correlated two-factor model

The appropriateness and adequacy of models were assessed based on three criteria: (1) values of selected global model fit indices; (2) individual parameter estimates; and (3) the principle of model parsimony (Byrne 2006). The global model fit indices used in this study were selected based on Ockey and Choi (2015), including the comparative fit index (CFI), the normed fit index (NFI), and the Tucker-Lewis index (TLI) of 0.90 or above; the root mean square error of approximation (RMSEA) of 0.08 or below; and the standard root mean square residual (SRMR) of 0.08 or below. A Chi square difference test was used to compare nested models, and the results were always evaluated in conjunction with the criteria of global model fit indices explained above (i.e., CFI, NFI, TLI, RMSEA, and SRMR). Individual parameter estimates were examined for appropriateness and significance. Moreover, the principle of parsimony favored a simpler model over a more saturated one if two models fit equivalently (Sawaki et al. 2009).

Results

Preliminary Analyses

As preliminary analyses, the normality of the univariate score distribution was assessed using skewness and kurtosis values. Descriptive statistics of all observed variables in the SEM models indicated that all skewness and kurtosis values were within |3.30| (the z score at p < 0.01), suggesting univariate normality of the data (e.g., Tabachnick and Fidell 2007). Multivariate normality was assessed using Mardia’s normalized estimate, with values of 5.00 or below considered to indicate multivariate normality (e.g., Byrne 2006). Mardia’s normalized estimates exceeded the criterion of 5.00 in all cases, suggesting multivariate non-normality of the data. Therefore, a corrected normal theory estimation method, the Satorra–Bentler (S–B) estimation, was employed in EQS to correct global fit indices and standard errors for non-normality (Satorra and Bentler 2001). Furthermore, dependence among all pairs of variables was moderate, and no extremely high value of correlation coefficient was found, suggesting no violation of linearity (Byrne 2006).

Modeling Test Data

To respond to RQ1, the three competing models were assessed against the test data. Table 1 shows the global model fit indices for the three models. Overall, all three models demonstrated acceptable fit statistics. However, the correlated four-factor model and the higher-order factor model fit the data more satisfactorily. Although for both models, the S–B x 2 statistics were significant (p < 0.05), other fit indices suggested a satisfactory fit for the two models (e.g., the correlated four-factor model: CFI = 0.958, RMSEA = 0.049 [90% confidence interval: 0.046, 0.052], SRMR = 0.030; the higher-order factor model: CFI = 0.950, RMSEA = 0.052 [90% confidence interval: 0.049, 0.055], SRMR = 0.037). Although both the correlated four-factor model and the higher-order factor model demonstrated comparably satisfactory fit, the higher-order factor model was more parsimonious and was therefore selected as the best-fitting model. An examination of the parameter estimates in this model indicated that all parameters were significant (p < 0.05). Next, this model was cross-validated on the test data from the sample who completed the SA inventory (n = 244). The global fit indices also suggested that this model fit the test data satisfactorily (S–B x 2 = 124.01, df = 73, CFI = 0.954, RMSEA = 0.051 [90% confidence interval: 0.033, 0.057], SRMR = 0.053). Given the satisfactory fit of this model, it was accepted as the best-fitting model representing the factor structure of the FET (see “Appendix 2” section for the factor loadings in this model).

Table 1 Global fit indices for the three models (test data, N = 4126)

Modeling SA Data

When modeling the SA data, similar analysis procedures were followed. The global fit indexes of the three competing models are presented in Table 2. Similar to the findings derived from modeling test data, both the correlated four-factor model and higher-order factor model were found to exhibit acceptable fit to the data (e.g., the correlated four-factor model: CFI = 0.916, RMSEA = 0.069 [90% confidence interval: 0.061, 0.076], SRMR = 0.054; the higher-order factor model: CFI = 0.914, RMSEA = 0.069 [90% confidence interval: 0.062, 0.076], SRMR = 0.057). The correlated two-factor model, however, exhibited extremely poor fit to the data (e.g., CFI = 0.813, RMSEA = 0.101 [90% confidence interval: 0.095, 0.108], SRMR = 0.073). This model was therefore rejected and eliminated from further analysis. Although both the correlated four-factor model and the higher-order factor model fit the SA data acceptably well, the latter was accepted as the best-fitting model due to its comparative parsimony. All parameter estimates in this model were found to be significant (p < 0.05). This model was therefore used for structural regression analysis (see “Appendix 3” section for the factor loadings of this model).

Table 2 Global fit indices for the three models (SA data, n = 244)

Structural Regression Analysis

To model the relationship between test performance and language use (RQ3), a structural regression model was built wherein the higher-order factor in the test model was regressed on the higher-order factor in the SA model (see Fig. 4). This model was found to exhibit satisfactory fit to the data (e.g., S–B x 2 = 1229.22, df = 731, CFI = 0.911, RMSEA = 0.053 [90% confidence interval: 0.048, 0.058], SRMR = 0.061). All parameters in this model were significant (p < 0.05), with the standardized parameter estimates displayed in Fig. 4.

Fig. 4
figure 4

Structural regression model with standardized parameter estimates

As shown in Fig. 4, the first-order factor loadings in both the test and the SA model were substantial (all >0.4), suggesting strong linear relationships between the first-order factors and the observed variables. Regarding the higher-order factor loadings in Fig. 4, all first-order factors in both the test and the SA model had high loadings, ranging from 0.72 to 0.91. This supported the presence of a common underlying dimension that was strongly related to the four subskill factors in both models. The path coefficient from “EFL” to “SA” was 0.52, indicating that one standard deviation of change in students’ test performance was associated with 0.52 standard deviation of change in the latent SA factor. Meanwhile, the standardized regression residual variance of the latent SA factor was 0.85, meaning that 85% of the variance of this factor could not be explained by the independent variable in the model.

Discussion and Conclusion

This study examined the factor structure of a high-stakes English proficiency test as well as its relationship to an SA inventory designed to capture test takers’ ability to use language in TLU domains. RQ1 addressed the factorial configuration of the constructs measured by the FET. Three competing models were hypothesized to represent the factor structure of the test. The selection of the higher-order factor model as the best-fitting model indicates the tenability of the hierarchical structure of language proficiency and the existence of a second-order ability factor which represented general EFL ability (see also Sawaki et al. 2009). This finding lends empirical support to the claim from the test developer that the test assesses students’ general English proficiency, which is further divisible into four subskills of listening, reading, writing, and speaking (FDU Testing Team 2014). Moreover, this finding supports the current score-reporting practice of the FET which includes both a composite score and profile scores on the four subskills. The selection of the high-order factor model concurs with numerous previous studies which investigated the factor structure of language tests (e.g., In’nami et al. 2016; Llosa 2007; Sawaki et al. 2009). The empirical support for the higher-order factor model, using the data from a local high-stakes English test, further confirms the hierarchical nature of L2 ability which cannot be explained by a unitary trait structure, as postulated by the Unitary Trait Hypothesis (Oller 1976).

RQ2 addressed the internal factorial structure of the SA inventory developed for this study. Similar to the findings derived from modeling test data, the higher-order factor model was selected as the best-fitting model due to its satisfactory fit and comparative parsimony. The reasonably satisfactory fit of the correlated four-factor model is consistent with Enright et al. (2008), who found that the four subskills were all distinct in the factor structure of the SA inventory used in their validation research. The finding of this study also resonates with that of Bachman and Palmer (1989), suggesting that self-ratings of language ability have a similar factorial configuration to language tests.

Following CFA analyses, the relationship between test performance and language use in TLU domains (RQ3) was examined through a structural regression model. The higher-order factor in the test model, which represents students’ general EFL ability, was found to have a moderately strong relationship to the higher-order factor in the SA model. The path coefficient found in this study resonates with previous research using self-assessments in language test validation and seems to confirm Ross’s (1998) endorsement of using a well-crafted and validated SA inventory as an external criterion measure. For example, Enright et al. (2008) discovered moderately strong correlations between the prototype of TOEFL iBT and test takers’ self-assessments in the four subskills (r ranging from 0.30 to 0.62). Similarly, Powers and Powers (2015) observed similar correlation coefficients between the measures in the TOEIC test and test takers’ self-assessments in the four subskills (r ranging from 0.34 to 0.51). Despite the similar findings, the SEM approach that this study adopted was more rigorous as it examined the relationships at the latent variable level and analyzed all the variables related to test performance and language use simultaneously in one model. Hence, the moderately strong relationship between the latent EFL ability factor and the latent SA factor provides supportive evidence for the predictive validity of the FET, suggesting that test takers who have better test performance tend to demonstrate higher ability to use language in TLU domains.

However, it should be noted that the standardized residual variance of the latent SA factor was 0.85, suggesting that a substantial proportion of factor variance of SA was not explained by the independent variables. A plausible explanation lies in the characteristics of the sample used in this study. Previous research into the use of SA in language learning suggests that learner characteristics such as proficiency level and experiential factors might come into play when students self-assess their language abilities (e.g., Blanche and Merino 1989; Oscarson 2013). Due to the convenience sampling approach, the SA sample in this study had higher language proficiency. Furthermore, at the time of the investigation, most participants were in the first and second year of their undergraduate study, and thus may not have had sufficient experience with language uses to enable them to reliably self-assess their abilities to use English in the corresponding TLU domains. A closer look at the residual variances of individual statements in the SA inventory appears to confirm that the relatively high residual variance was caused by participants’ lack of direct experience with some activities in the SA inventory. For example, the high residual variances of “SAR3” and “SAS4” (0.78 and 0.75, see Fig. 4) may be attributable to such an assumption. These two items asked students to self-assess their ability to read academic books or articles (SAR3) and conduct basic English–Chinese interpretation (SRS4), which are arguably beyond their repertoire of language use. This interpretation also corroborates previous Rasch analyses of the item-level data of the SA inventory (Fan 2016), which found these two items fit only marginally into the range of Infit Mean Square statistics of 0.6–1.4, as suggested by Bond and Fox (2015). Future research, therefore, may consider including variables such as language proficiency and experiential factors in the full SEM model to more accurately uncover the relationships between test and SA data.

Implications and Limitations

The study reported herein has three main implications. First, the emergence of a hierarchical structure of L2 ability from a locally developed English proficiency test adds supportive empirical evidence to the current view of the multidimensional nature of L2 ability. The presence of a higher-order factor and four distinct first-order factors is also in line with the current score-reporting practice of the FET, which provides both a composite score and four scores on each subskill. This finding will facilitate test score interpretation and use, which is crucial to a high-stakes language test (e.g., Sawaki et al. 2009). Second, this study demonstrated the usefulness of a well-crafted and validated SA inventory in test validation research. However, it is essential to note that a thorough validation of the SA inventory is critical before using it as an external criterion measure (Ross 1998). Third, different from previous studies which relied on correlational analyses, this study adopted the SEM approach to model the relationships among multiple variables of interest at the latent variable level simultaneously in one model. By so doing, it provides a statistically more rigorous and theoretically viable solution to examining the nature of and relationships among the constructs underlying standardized tests and SA.

That said, this study also has a few limitations. First, the unavailability of item-level test data precluded analysis of the appropriateness of parceling items through applying the unidimensionality principle. Such item-level factor analysis is essential because datasets can produce different models depending on whether they are analyzed at the item level or the subskill level (In’nami and Koizumi 2011). Second, due to practical constraints, convenience sampling, rather than strictly stratified sampling was adopted in this study. As noted earlier, the sample in this study represented a slightly higher proficiency level than did the test population. A more rigorous sampling approach can be employed in future research, and variables such as test takers’ proficiency level and experience with the linguistic activities in the SA inventory can be included in future SEM analyses. Finally, although this study demonstrates that a well-crafted and validated SA inventory can be used as a justifiable criterion measure to validate language tests, we concur with Powers and Powers (2015) that multiple measures (e.g., direct observations, teachers’ evaluation) should be used and triangulated to derive a more comprehensive representation of test takers’ real-life language use.