Background

International English Language Testing System (IELTS) is an admission requirement for either immigration or education abroad and focuses on language use in a social and academic context (Nakatsuhara et al. 2017; Phakiti 2016). In design terms, IELTS listening comprehension test (LCT) is intensive, i.e., played just once; it is also in a read-listen-write format (Field 2005). This is overwhelming as the learners are obliged to pay simultaneous attention to three skills: listening, reading and writing, so it is demanding in format for processing information; this under-represents IELTS listening construct (Aryadoust 2012).

To date, more research has been conducted on listening in second or foreign language (Alavi and Janbaz 2014; Bodie and Worthington 2010; Harding et al. 2015; Kimura 2016; KÖk 2017; Roussel et al. 2017; Vandergrift 1997, 2006, 2007). In particular, more and more research has been conducted on IELTS since the IELTS research program started in 1995, so that more than 110 empirical studies have received grant so far (Nakatsuhara et al. 2017); however, to date, there has been a paucity of research in the field of IELTS listening and just fewer bodies of research (e.g., Aryadoust 2011, 2012, 2013; Badger and Yan 2006; Field 2005; Harding et al. 2015; Phakiti 2016; Winke and Lim 2014) have been conducted on IELT LCT as far as we are aware of.

The current study is on the (construct) validity of IELTS LCT; construct validity is a crucial element for language testing or large scale public tests (Cronbach and Meehl 1955; Kane 2013, 2016). Over the course of the years, some distinguished language testing scholars (Kane 2013, 2016; Messick 1974, 1986, 1995, 1996; Newton and Shaw 2015; Newton and Baird 2016. Sireci 2017) have accepted the evidence-based definition of validity as a unitary concept (Messick 1974): it refers to the meaningfulness, usefulness, and appropriateness of the degree evidence and theory weaken or support the inferences and decision made based on a test; however, to judge whether the construct of a test (IELTS LCT, for example) is valid or not requires a validation procedure and a multiple source of evidence. The researchers agree with various types of evidence such as test content, response processes, internal structure, relation to other variables, and test consequence (Sireci 2017). To this end, this study first examined the construct validity of IELTS LCT with the use of structural equation modeling (SEM), i.e., confirmatory factor analysis (CFA) with the use of LISREL software; then, in phase 2, it assessed differential item functioning (DIF) with the use of cognitive diagnostic modeling (CDM) and Mantel-Haenszel (MH) method; the reason why we used two methods for item analysis was to put more confidence in the accuracy of DIF findings.

Literature review

SEM and factor analysis

Factor analysis refers to a multivariate technique (Sawaki 2012; Schmitt 2011) required for making an interpretation of a large number of correlations (Field 2009; Khine 2013); it is a statistical method used for testing and estimating the relations (Alavi and Ghaemi 2011; Ockey and Choi 2015) inherent in a group of variables in order to gain insight into the underlying causal processes (In'nami and Koizumi 2011; Kunnan 1994, 1998).

A look at literature review reveals more studies conducted with use of SEM (Alavi and Ghaemi 2011; Alavi et al. 2011b; Cai 2013; Carr 2006; Phakiti 2008, Sawaki et al. 2009; Schoonen 2005; Song 2008). However, very few bodies of research have been conducted on IELTS LCT with the use of SEM. A very recent SEM study associated with IELTS listening has been done by Phakiti (2016); his findings suggested that there are complex structural relationships among test-takers’confidence, calibration, trait, strategy use, IELTS listening difficulty, and performance on IELTS listening. Another study on IELTS listening was done by Field (2005); his study explores the cognitive validity of lecture-based questions in IELTS LCT; his findings support the cognitive validity of the IELTS. In the same vein, Badger and Yan (2006) did a research on IELTS listening strategies and their findings supported the construct validity of IELTS listening, too.

Differential item functioning

DIF exists when different groups of learners have different probability of successfully answering an item (Drabinova and Martinkova 2017; Ferne and Rupp 2007; Li and Wang 2015); therefore, if the test takers have less or more the same knowledge, then they should perform similarly on test items; DIF is needed for test validity and test fairness (Fidalgo et al. 2014; Hou et al. 2014; Pae 2004, 2012; Su and Wang 2005; Zumbo 2003, 2007).

A look at literature review indicates an abundant number of DIF studies; the studies appear in various DIF-related factors, such as gender (e.g., Abbott 2006; Amirian et al. 2014; Aryadoust 2012; Li and Suen 2013; Pae 2012; Rezaee and Shabani 2010; Song et al. 2015), age (e.g., Geranpayeh and Kunnan 2007), academic background (e.g., Alavi et al. 2011a; Pae 2004), text familiarity (e.g., Ahmadi and Jalili 2014), field of study (Barati et al. 2006), and language background (e.g., Harding 2011; Kim 2001; Kim and Jang 2009). As noted in introduction, among all these studies, a study exactly related to the DIF of IELTS LCT was conducted by Aryadoust (2012), as far as we are aware; his research indicates some construct-underrepresentation on IELTS LCT. Therefore, this study is in line with DIF detection.

Two DIF-detection methods: MH and CDM

MH statistic (Mantel and Haenszel 1959) is one of the most globally utilized procedures for DIF detection, as it is relatively easy to calculate; it does not need large sample sizes; it includes a test of statistical significance and also reports effect size (Monahan and Ankenmann 2005, Monahan and Ankenmann 2010; Su and Wang 2005). That said, the Mantel-Haenszel statistic makes comparisons of item performance for various groups; it compares examinees of similar proficiency levels, instead of comparing overall group performance on an item (Michaelides 2008). That said, it needs however to be acknowledged that MH does not behave optimally in all situations and that this might lead to an error in DIF detection (Guilera et al. 2013).

Another method is CDM; it is a psychometric model developed for assessing examinees mastery and non-mastery of skills or attributes (Chen et al. 2013; de la Torre 2011); recently, various kinds of CDMs are used, such as deterministic inputs, noisy and gate model (DINA; Junker and Sijtsma 2001) and the deterministic inputs, noisy or gate model (DINO; Templin and Henson 2006). As for the significance of CDM, George and Robitzsch (2014) recommend the use of CDM as one of the recent statistical tools for detecting DIF and plenty of psychometric questions in relation to DIF can be addressed with use of CDM (Hou et al. 2014). To date, only a few studies have been conducted on DIF assessment within the framework of CDM (Drabinova and Martinkova 2017; Li and Wang 2015; Hou et al. 2014; Li 2008; Zhang 2006). However an extensive body of research has been done in the area of cognitive diagnosis of students’ learning (Li and Wang 2015; de la Torre 2011; de la Torre and Douglas 2004; Junker and Sijtsma 2001), no study has so far been done on detecting the DIF of IELTS LCT with use of CDMs, so that some researchers (e.g., George and Robitzsch 2014) suggest the use of CDM for DIF detection.

Research questions

Based on the review of literature, this study investigates the following research questions:

RQ1: Does the factor structure of IELTS LCT reflect the design of the test in terms of task types, i.e., gap filling, diagram labeling, multiple choice, and short answer?

RQ2: Does group membership (gender) exert any bias towards the participants’ performance on the items of IELTS LCT as investigated by Mantel-Haenszel (MH)?

RQ3: Does group membership (gender) exert any bias towards the participants’ performance on the items of IELTS LCT as investigated by Cognitive Diagnostic Modeling (CDM)?

Method

Participants and context

The study was carried out at various English Language Institutes in Iran; these institutes mainly aim at administering monthly IELTS mock-tests to the potential IELTS candidates. Also, the participants were mostly on IELTS preparation courses; the participants in both phases of the present study were those who needed to attend IELTS preparation course. As for sample size, it determines the quality of SEM study (Ockey and Choi 2015); the minimum and maximum sample size for SEM is indicated to be 100 to 150 subjects (Ding et al. 1995; Khine 2013) and 400 subjects (Boomsma 1987), respectively, or 5–10 subjects for every item or variable (Bentler and Chou 1987).

Therefore, in this study, 480 participants took a proficiency test adopted from Cambridge IELTS books; the performances of 17 participants were excluded from this study, as their performances were of extraneous variances. Finally, an adequate number of 463 participants (Table 1) took part in the study; they had studied the English language (for an ultimate goal of passing IELTS) for approximately 4 years; they were characterized by the same cultural, societal, native language, and educational context. The researchers strove to obtain access to real data of IELTS LCT; however, due to confidentiality reasons associated with IELTS organization, it was not possible. As such, the participants took the test in IELTS mock-test condition. Also, 18 teachers (8 female teachers and 10 male teachers) all majoring in ELT and teaching IELTS preparation course took part in the study; five of them had a B.A. in English language, 9 of them had an M.A. in TEFL, and four of them were PhD candidates in TEFL.

Table 1 Demographic data of participants

Materials

Two IELTS LCTs adopted from IELTS test books (Cambridge IELTS 2016; 2017) were used: a proficiency test and a main test; the first was used for proficiency purpose and the second was used to probe the (construct) validity of IELTS LCT. IELTS LCTs were played in 30 min, and the participants were given 10 min to transfer their answer to the answer sheet (IELTS Handbook 2007). Also, a summarized handout was used for strategy instruction adapted from Tips for IELTS (McCarter 2006), Action Plan for IELTS (Jakeman and McDowell 2006), and Step Up to IELTS (Jakeman and McDowell 2004). These techniques and strategies were instructed in five sessions in context and with related listening subsections, related to IELTS LCT in five sessions. Since IELTS LCT demands its own strategies (McCarter 2006; London Teacher Training College 2005) and testwiseness also maximizes the performance on test (Rogers and Yang 1996), so the questions on real IELTS tests are susceptible to testwiseness strategies. This would approximately keep the test takers’ condition on mock-test setting similar to the real test takers’ situation on real test. The specifications of the main test appear below (Table 2).

Table 2 Specifications of the main test

Procedures and data analysis

First, the participants signed a consent form for participation in the study. Then, a proficiency test, i.e., IELTS LCT, was administered and the reliability of the measurement tool was investigated; that is to say, we ran Cronbach’s Alpha on IELTS LCT which reached at a reliability of 0.66 (0.73 and 0.58 for the males and females, respectively). Of course, on all of the 13 volumes of Cambridge IELTS Books appears a phrase which reads authentic papers, which was the main impetus for this investigation.

Next, the teachers instructed the strategies and finally, the main test was administered. We ran confirmatory factor analysis using LISREL software to probe the construct validity of the test. We also analyzed the data for DIF detection related to gender using MH and CDM.

Results and discussion

Phase 1

Data analysis showed that the performances of the participants on proficiency test and on the main test (M = 22.94, SD = 4.62; M = 23.34, SD = 5.06), respectively, were approximately the same (Tables 3 and 4). As it is clear from Table 3, 480 participants took a proficiency test, and just 463 participants’ performances (Table 4) on items of the main test were analyzed for the purpose of confirmatory factor analysis and item bias.

Table 3 Descriptive statistics for the proficiency test
Table 4 Descriptive statistics for performance on the main test

If the absolute values of the skewness and kurtosis statistics are lower than 2, the univariate normality of the items is met (Bae and Bachman 2010). As it is evident in Table 5, this assumption was met. Also, the Mardia test of multivariate normality of − 6.31 was lower than 1680, so the assumption of multivariate normality was also met. This was calculated with this formula: p × (p + 2) or 40 × (40 + 2) = 1680 (Khine 2013); here, p stands for the number of observed variable which was 40 in this study.

Table 5 Tests of univariate and multivariate normality

Figure 1 displays the 40 items (the items in squares) of IELTS LCT. Four sub-sets of items, i.e., Gap filling (GF), diagram labelling (DL), multiple choice (MC), and short answer (SA), measure four latent variables (the four ovals), which eventually, measure total IELTS LCT (the oval titled listen). Based on the statistical analysis outlined in Fig. 1 and Table 6, among the 14 items of the GF, eight (items 4 to10 and 12) were significant, i.e., = > .30 (higher than .30). Five of the six items on DL (items 16 to 20) were higher than .30. And also, eight items (items 21, 23, and 25 to 30) of MC were significant. Finally, just three items (items 31, 32, and 37) of SA were significant. The four latent variables of gap filling (b = 1.10), diagram labeling (b = .43), multiple choice (b = .60), and short answer (b = .91) all had significant contributions to the total IELTS LCT (Fig. 1).

Fig. 1
figure 1

Model for factor structure of IELTS listening

Table 6 Standardized regression weights of IELTS listening comprehension test

As seen in Tables 6 and 7, the results of the chi-square (χ2 (736) = 1226.49, p = .000) indicated the poor fit of the model. However, chi-square is sensitive to sample size (Hooper et al. 2008) that is why its ratio over the degree of freedom (1226.49/736 = 1.66) should be consulted. Since this ratio is lower than 3, it can be concluded that the overall model enjoys a good fit.

Table 7 Model fit indices

As Table 7 reveals, the root mean square of error approximation (RMSEA) of .038 and its 90% confidence intervals, i.e., [90% CI (.034, .042)] which were lower than .05 as well as the closeness of fit statistic (PCLOSE) which was higher than .50 supported the fit of the model. Further evidence which confirmed the fit of the model results from the non-normed fit index (NNFI = .91), comparative fit index (CFI = .92), incremental fit index (ICI = .92), and goodness of fit index (GFI = .90), all of which were equal to or higher than .90. Also, the critical N (CN = 312.97) which was higher than 200, indicating the sampling adequacy of the model supported the fit of the model.

Phase 2

Differential item functioning with use of Mantel-Haenszel method

Mantel-Haenszel’s method (Table 8) showed 15 significant DIF items, of which eight enjoyed large effect sizes, i.e., items 4, 8, 10, and 16 to 20. The effect size values for the Mantel-Haenszel’s DIF are weak = 0, moderate = 1, and large = 1.5 (Mantel and Haenszel 1959).

Table 8 Differential item functioning through Mantel-Haenszel’s method

Differential item functioning with use of CDM

As seen in Table 9, the results of the chi-square tests identified 12 DIF items, i.e., items (2, 4, 5, 8 to 10, 14, 18, 20, 21, 26, 33, and 36). One of the main features of CDM DIF is that the original p values (fourth column) are recalculated using Holm’s adjusted formula, in which the p values are penalized for multiple pair-wise comparisons. The Holm p value is computed using this formula (Wright 1992, p. 1008): p-Holm = 1 − (1 − p)K.

Table 9 Differential item functioning through CDM

In this formula, K stands for the number of items minus the number of comparison above it; the K for the first item is 40, 39 for the second item, 38 for the third, and finally, 1 for the last items. Based on Holm’s adjusted p values, six DIF items (2, 8, 10, 14, 18, and 20) were flagged.

The unassigned area (UA) is the effect size for the CDM DIF with three values (weak = lower than .059; moderate = .059 to .088; large = higher than .088) (George and Robitzsch 2014, p.414). The results identified 13 items enjoying large effect sizes (items 2, 4, 5, 8, 10, 11, 14, 18, 20, 21, 26, 30, and 36). The summary of the findings appears in Table 10.

Table 10 Comparison of DIF detection methods

Discussion

The purpose of the present study was to investigate the validity of IELTS LCT. In this context of IELTS LCT validation, the hypothetical variables were associated with a construct or test method (gap filling, multiple choice, diagram labelling, and short answer); here, the researchers first hypothesized a model and then examined whether the model is advocated by the present sample. The overall model fit in the phase 1 of the study provided evidence of construct validity for IELTS LCT, so the hypothesized SEM model enjoyed a good fit. What is hence clearly outlined in the analysis is that the individual items revealed to be valid indicators of their assumed factors or constructs, i.e., gap filling, diagram labelling, multiple choice, and short answer.

The findings of the first phase of the study are consistent with the findings by Phakiti (2016), Badger and Yan (2006) and Zhang (2015) whose findings provide some positive evidence in support of construct validity of the IELTS LCT; the statistical significance of construct validity for IELTS LCT is in keeping with the statement that there can be no validity without construct validity (Messick 1974, 1986). However, to argue the validity of a test, we need rich pieces of evidence (Kane 2016; Messick 1974, 1986, 1995, 1996; Sireci 2017), for example, differential item functioning, consistency of the measurements, response processes, internal structure, content, context, test consequence, and cognitive data. Therefore, that our study confirmed the construct validity of the test does not mean that the test is fully valid, as no test is inherently valid or invalid (Sireci 2017); rarely will it be possible for a test to make a prediction of a definite construct (Cronbach and Meehl 1955; Messick 1986); in other words, construct-related evidence may not be the whole validity (Messick 1974, 1986), so no one single piece of evidence for probing the construct validity is sufficient on its own. Clearly, due to the challenging nature of validity, IELTS LCT as a global test with a macro and micro impact needs being viewed and investigated in light of multiple evidences. These given, investigating the degree of validity of IELTS LCT with reference to DIF was also required. That is why phase 2 of the study provided another piece of evidence.

Based on some evidence, IELTS LCT suffers from some degree of invalidity (Aryadoust 2012). Along the same line, in our study, the two methods, i.e., Mantel Haenszel method detected 15 DIF items and CDM flagged at most 12 and at least 6 DIF items (Tables 9 and 10). A closer look at sub-sections reveals all items (six DIF items) of diagram labeling flagged by MH and just two DIF items of diagram labeling were detected by CDM; of course, the difference in the number of DIF items detected by these two methods needs some reflection. That is to say, based on MH, diagram labeling revealed six items (all items) and CDM just two DIF items on diagram labeling. Also, on gap filling, seven DIF items (half of the items) and five DIF items were detected by MH and CDM, respectively.

The findings of study 2 is consistent with Aryadoust’s (2012) findings; his research revealed that the first construct in the test was found to be under-represented, as construct under-representation is apparent in the gap filling and diagram labelling in our study too. On the other hand, gap filling (with 14 items) and diagram labelling (with 6 items) both are sub-tests of social dimension of IELTS LCT; however, the relative number of items have not been equally designed (Table 2), as the former has more items than the latter. These all given, it seems that some unwanted or construct irrelevant variances can possibly interfere with sections 1 and 2 of IELTS LCT; they, therefore, need further investigation. As for the analysis of item bias, item-internal evidence for probing the validity is not sufficient on its own too. Therefore, as Cronbach and Meehl (1955) stated, the stability of test scores, i.e., measurement consistency, can be related to construct validation and together with other cognitive and contextual evidence can help with any decision about examining the validity of IELTS LCT.

DIF items can threaten the validity of IELTS LCT; there is some effect-size-based evidence that DIF is not equivalent to bias, but DIF is unavoidable in international tests (Le 2006), such as IELTS; so, not all cases of DIF necessarily have to be interpreted as item bias (Tatsuoka et al. 1988), as the effect size of the DIF item should be consulted for final decision for either improvement, revision, or removal. Based on the findings from DIF analysis (Tables 9 and 10), we do not claim that the DIF items detected in phase 2 of the study severely pollute IELTS LCT because more study needs for big claims; neither do we suggest the generalization of the findings beyond, as it was done in an Iranian EFL context, where the language learners have the least amount of (or no) exposure to listening input in a social context and in a governmental school setting; they just learn English language at private institutes; also, the learners receive very restricted amount of live audio and visual input from mass media due to some educational policy and governmental decisions in Iranian EFL setting.

As for the construct validity of IELTS, it is played just once (Field 2005); the candidates must pay simultaneous attention to three skills: listening, reading, and writing, as it is in a read-listen-write design. As Aryadoust (2012) maintains, if test takers make use of other skills such as reading or writing beyond the intended skill, (this might pollute the score use and interpretation related to IELTS listening: The bracket is ours). Likewise, some (or most) of the real-world characteristics are missing on IELTS LCT. The listeners perceive the message through scaffolding elements such as lip-reading, facial expression, body language, gestures, and postures. These can underrepresent IELTS listening construct (Aryadoust 2012). This creates another reservation and motivation in line with further investigation into IELTS LCT.

Overall, IELTS LCT seems to be a good indicator of listening proficiency as assessed by the University of Cambridge. To be impressionistic, as the report of the 15-year IELTS teaching experience of the third author of this paper can provoke some thought. With reference to the performance results of hundreds of IELTS candidates undertaking IELTS preparation courses, the IELTS candidates who get a band score of 6 or 6.5 are capable to easily communicate and meet their academic and social needs. This indicates that there seems to be a close line between IELTS listening construct and the demand of real world. Therefore, IELTS seems to be an effective assessment tool; since it sounds to be of global impact, nothing should be taken for granted and more research should be done into it. Of course, as mentioned, the nature of our findings or other researchers’ findings and the third authors’ impressionistic and personal judgment all need to be more investigated and highly documented. However, the findings of our study call into question Pilcher and Richards’ (2017) tone of speech regarding the power of IELTS; their strong claim is that the power of IELTS needs to be challenged; contrary to their findings, our findings indicate that IELTS needs to be more investigated; its invalid sub-parts and sub-constructs need to be improved and revised—and if needed, be removed or replaced—rather than challenged.

Conclusion

In terms of implications, the findings of the study can be thought-provocative; it can motivate the researchers, the materials developers, and IELTS listening test designers and the curriculum designers to be more aware of the psychologically underlying construct of the test. Since, IELTS LCT is an example of a public test that is used to make crucial decisions about huge numbers of people all over the globe, the wash-back effect and the consequential validity of IELTS LCT must be taken into account.

In conclusion, due to its international nature and world-wide evaluative contribution, IELTS needs a stable factor structure, so that it should be invariant across populations and various cultures. More naturally, a test highly valid in one context might suffer from some degree of invalidity with some related constructs in another context. This in mind, our perspective in this research is not recommended to be taken as a one-size-fits-all model and neither generalization nor claim is made based on the present study. The study is limited in scope as the test takers were not real IELTS test takers; they were not also drawn from very large international population. Further research should concentrate on a larger sample size in world-wise educational and cultural contexts, as there is a need to other evidence to warrant further examination of IELTS LCT validity.