Empirical validity evidence
The first element of empirical validity evidence was constructed around the internal structure of the MBPA. The internal structure of the assessment was assessed through a psychometric analysis [based on classical test theory (CTT)] of the answers that the students gave. We decided to use CTT rather than item response theory (IRT) because of two reasons. First, the number of participants was too low to perform IRT analyses. Secondly, we believe that the goal of validation can also be reached using CTT, as the CTT indices (e.g., reliability and item indices) provide enough information about the quality of a test. As mentioned earlier, the assessment was composed of 35 items. In total, students could get one point for each correct answer. The mean score on the test was 22.5 (σ = 3.44), with a 95% confidence interval [21.6, 23.6], which indicates that, compared to the PBA (for which students on average received more than 17 out of 19 points), the test was more difficult for the students. The maximum score (obtained by two students) was 30, whereas the minimum score was 14 (N = 1). The standard deviation is rather low, which means that most students achieved a score around the mean. The average time that students needed to complete the assessment was 29 min (σ = 8). The minimum amount of time spent on the assessment was 19 min and the longest was 58 min. The high standard deviation and the wide bandwidth between minimum and maximum indicate that there is a lot of variance between the time students spent on the assessment. Table 1 provides mean, standard deviation, and confidence intervals for the MBPA.
Table 2 provides other characteristics of the MBPA. The variation of the scores is relatively high (11.9). The distribution of the scores is not skewed (0.014), but the kurtosis is high (0.488). This indicates that a relatively large portion of the variance is caused by extreme values on both sides of the distribution, with most students’ scores being clustered around the mean. In addition, the standard error of measurement (SEM), which was calculated by multiplying the standard deviation of the assessment scores by the square root of 1 minus the reliability, is 0.83. The SEM was defined as the standard deviation of the mean of a hypothetical normal distribution of many administrations of the same test. In other words, it represents the possible distance between the observed score and the true score in a CTT context. For the MBPA, the SEM was relatively low, which means that the students’ true score was relatively close to their observed score. To be precise, 95% of the scores on the hypothetical distribution for a student with a mean score of 22.5 fall between 20.8 and 24.2. The reliability of the MBPA is high—with a Greatest Lower Bound (GLB) of 0.94. Of course, the high reliability is in accordance with a low SEM. We looked at the GLB as this was the index automatically calculated by the software used (TiaPlus (Cito 2006) and because this index is regarded as one of the most accurate estimations of the reliability of a test (Verhelst 2000; Sijtsma 2009).
To establish further support for the internal structure of the MBPA, we also looked at CTT indices for the individual tasks and assignments in the MBPA. Table 3 displays the CTT indices for the 35 items in the test. The second and third columns indicate what part of the content was assessed with the item and which item format was used. The p value of the test is the proportion of students that answered the item correctly. That is, a high p value is associated with a relatively easy question, whereas a low p value points to a difficult question for this group of students.
The GLB reliability index is presented for the MBPA. This was not possible for the other measures because of the number of items; for these measures, Cronbach’s alpha is presented. We cannot report the reliability of the multiple-choice test, because the items are randomly selected from the item bank and, therefore, every student has a different test.
The mean p value of the test was 0.62, which indicates that the assessment was rather difficult for this group of students. The rit, the point-biserial correlation between the item scores and the total test score in which the item score is not excluded from the total score, gives an indication of the discriminative power of an item. The higher the value, the better the item can discriminate between good and poor performers. A low rit means that students that score high on the overall test score low on the item, whereas students that score low on the overall test score high on the item. Conversely, a high rit means that good performers do well on the item and poor performers do worse on the item. The mean rit of the MBPA is 0.22. There is reason to believe that the mean rit could be improved, as the quality of the individual items in the MBPA fluctuates (see Table 3). To summarize, we have provided evidence that the MBPA has a strong internal structure. Although there is room for improvement, all indices fall within acceptable levels.
The next element of empirical validity evidence is used to support the external structure of the MBPA, in particular the convergent validity (based on the PBA scores) and the discriminant validity (based on the questionnaire and the multiple-choice knowledge test). In other words, the MBPA scores are expected to correlate with the PBA scores, but not with the questionnaire and to a lesser extent with the multiple-choice test. Although the multiple-choice test also measures aspects of the CSG construct, it does so on a different level. The correlations (Spearman’s rho) between the measures of the experiment are presented in Table 4. Spearman’s rho is used because there is a monotonic relationship between the variables and because the measures do not meet the assumptions of normality and linearity. For example, on the 19-point rubric, most students score 17–19 of the criteria as correct. It is, therefore, better to look at the rank order of the scores on the different measures than at the linear correlation. As can be seen, the correlations between the MBPA and the rubrics used in the performance assessment are 0.37 (p < 0.01) and 0.38 (p < 0.01), respectively, for the 19-point rubric and the 12-point rubric, which are both significant. We have also combined students’ scores on both rubrics to get a total rubric score. The correlation between the total rubric score and the MBPA score is strongly significant (rs=0.43 (p < 0.01)). We also applied a correction for attenuation and found that the correlation then improves considerably (respectively to 0.92, 0.59, and 0.70). This indicates that the correlation is strongly diluted by measurement error. Considering the low reliability that was established for the performance-based assessment, it was very likely that the correlation would increase.
Of course, there is also a strong significant correlation between both rubrics used in the assessment (rs=0.68, p < 0.001). We also performed a linear regression analysis to see the extent to which performance in the MBPA could predict performance in the PBA. Because of the negative skew of the distribution of the rubrics, especially the 19-point rubric, we first subtracted each score from the highest score obtained, plus one, and then performed a log transformation (see Field 2009).
We did this for the 12-point rubric, the 19-point rubric, and the total rubric score to get a reliable comparison. The regression analysis for the 19-point rubric showed a significant effect (F(1,53) = 4.365, p < 0.05), which indicates that the MBPA score can account for 7.6% of the variation in the PBA score. We performed the same analysis for the 12-point rubric, which was also significant (F(1,46) = 5.544, p < 0.05), with an explained variance of 10.1%. Finally, we performed a regression analysis for the total rubric score, which was also significant (F(1,46) = 5.905, p < 0.05), with an explained variance of 11.4%. The total rubric score was the best predictor for performance in the MBPA. Unfortunately, the rater forgot to complete the 12-point rubric on one assessment, which explains the lower number of students in the second analysis.
To provide further evidence, if MBPA performance is related to PBA performance, then we would expect students who had failed their PBA to score significantly lower on the MBPA than students who had passed their PBA. Unfortunately, because of the high scores on the PBA, the group of students who failed their PBA was very small (N = 8). This makes it quite difficult to interpret the results and the following conclusions should therefore be viewed with some caution. However, because the analyses seem to indicate interesting results we do decided to include them in the article. The group of students who passed the PBA had a mean score of 23.2 (σ = 0.46) and the group of students who failed had a mean score of 20.1 (σ = 1.1). We used an independent samples t test to check whether the groups differed significantly, which was the case (t(53) = − 2.563, p < 0.001, d = 0.70). We then performed a logistic regression analysis to check the extent to which the MBPA score could predict whether a student will pass or fail their PBA. The MBPA score is treated as a continuous predictor in the logistic regression analysis. The dependent variable (success in PBA) is a dichotomous outcome variable (0 = failed, 1 = passed). The results of the analysis can be found in Table 5. The analysis demonstrated that the MBPA score made a significant contribution to predicting whether students failed or passed the PBA (χ2(1, 55) = 5.09, p < 0.05). The odds ratio (eβ), as a measure of effect size, for the MBPA score is 1.39 with a 95% confidence interval [1.04, 1.86]. This suggests that a one unit increase in the MBPA score increases the probability of being successful in the PBA (i.e., passing the PBA), with 1.39. To summarize, the overall correlations and regression analysis provide evidence for the convergent validity of the MBPA.
The absence of a correlation between students’ MBPA scores and their background characteristics, the questionnaire ratings (computer experience and usability of the MBPA) and multiple-choice test results should provide evidence for discriminant validity. The background characteristics are age, education, and ethnicity. Age was not correlated with assessment scores (rs = 0.00, p > 0.05). We calculated the biserial correlation coefficient for education. The biserial correlation coefficient is used when one variable is a continuous dichotomy (Field 2009).
First, we divided the students into two groups (low education vs. high education). The low education group consisted of students who had continued education up to high school or lower vocational education (N = 26, MMBPA = 21.83, σ = 3.07), whereas the high education group consisted of students who have had continued education from middle level vocational education and upwards (N = 27, MMBPA = 23.08, σ = 3.60). We calculated the point-biserial correlation [which is for true dichotomies (Field 2009)], and then converted it into the biserial correlation. Although education and student MBPA score were positively correlated, this effect was not significant (rb = 0.19, p > 0.05). For ethnicity, we were especially interested in two groups: students with Dutch ethnicity (N = 40, MMBPA = 22.8, σ = 3.35), and students with another ethnicity (N = 15, MMBPA = 22.78, σ = 3.41). We calculated the point-biserial correlation between ethnicity (0 = Dutch, 1 = other) and the students’ MBPA scores. Again, we did not find a significant correlation (rpb = − 0.01, p > 0.05). Overall, student’s background characteristics were not related to their MBPA performance, which supports the discriminant validity of the MBPA.
We found further support for discriminant validity, because there is no significant correlation between student MBPA scores and their computer experience (rs = 0.09, p > 0.05). Additionally, the MBPA score and student rating on the usability of the MBPA are not correlated (rs = 0.14, p > 0.05). It is interesting to note that there is a significant correlation between students’ computer experience and their rating of the usability of the MBPA (rs = 0.42, p < 0.01), but that there is no significant correlation between the time spent on the MBPA and the score obtained (r = 0.07, p > 0.05).
However, there is a significant correlation between the multiple-choice knowledge-based test and the MBPA (rs = 0.3, p < 0.05), which may indicate that, at least to some extent, the multiple-choice test and the MBPA do measure the same construct(s). Interestingly, there is no significant correlation between the PBA scores and the multiple-choice test scores (rs = 0.09, p > 0.05).
Finally, we determined the number of misclassifications at six different levels of MBPA cutoff scores (50, 55, 60, 65, 70 and 75%). No misclassifications would mean that all students (N = 8) that failed their PBA would also fail the MBPA, and that all that passed the PBA would also pass the MBPA (N = 47). The results are presented in Table 6. Although the lowest cutoff percentage (50%) results in the least number of misclassifications, which can be explained by the fact that we have a small group of students who failed their PBA, it is most interesting to note the difference in fail–fail classifications between the cutoff points at 55 and 60%. At the 55% cutoff point, only two students who failed their PBA would also fail the MBPA, whereas this number increased to 7 at the 60% cutoff score. Therefore, a cutoff score at approximately 60% would be most defensible empirically. In addition, we looked at the number of misclassifications at different levels of cutoff scores using Cronbach’s alpha in TiaPlus (Cito 2006). The analysis indicates that the least misclassifications take place when the cutoff score is placed at 50% (see also Table 6). In TiaPlus, the GLB reliability coefficient is the point of departure to estimate the misclassifications at the different cutoff levels.
In the previous paragraph, we presented validity evidence. In this paragraph, the validity evidence is used and evaluated. The argument-based approach is applied to prove the proposed interpretation of the MBPA (Kane 2006, 2013). Wools et al. (2010, 2016) distinguishes three criteria to evaluate the validity and the process of validation for an assessment. The first criterion evaluates the interpretive argument, the second criterion evaluates the different elements of validity evidence, and the third criterion evaluates the validity argument as a whole.
With regard to the first criterion, we can say that there are a substantial number of inferences. Following the chain of inferences, we have to go from a student’s performance and accompanying raw scores to meaningful statements regarding performance in a practice domain and then to a final certification decision. This is an indication of the complexity of the inferences that we wish to make with the MBPA. Nevertheless, these inferences are required to ensure that the MBPA can be used for its intended purpose. The question concerns whether the interpretive argument addresses the correct inferences and assumptions (Wools et al. 2010, 2016). We specified the interpretive argument in sufficient detail so that the chance on possible voids or inconsistencies in our inferred reasoning is kept at a minimum. According to the extended argument-based approach to validation, each inference in the chain (the arrows in Fig. 2) should have at least one warrant, a supporting warrant (or backing), and rejected rebuttals. A rebuttal indicates a circumstance in which the warrant or backing would not hold (Wools et al. 2010). We can demonstrate this by looking at each inference in the chain individually. The first inference is from performance to score, or the scoring inference. The same performance always leads to the same score (warrant), but this will only hold if the MBPA is correctly programmed (rebuttal). Furthermore, there has to be an objective scoring system (backing), which needs to be used objectively (rebuttal). In our case, the MBPA has a standardized and objective scoring structure. Scoring has already been addressed in the assessment skeleton, which was made in collaboration with SMEs.
The second inference is from score to test domain, or the generalization inference. The tasks in the MBPA provide a representative sample of tasks of the whole test domain (warrant), but this only holds true if there are enough tasks in the MBPA (rebuttal). The use of the framework and collaboration with SMEs ensures that there are enough tasks in the MBPA. This is also demonstrated by the validity evidence, because we have shown that the reliability of the assessment is high.
The third inference is from test domain to competence domain, or the first extrapolation inference. The tasks in the MBPA provide an adequate measure of CSG skills (warrant), but this will only be the case if the MBPA does not suffer from construct underrepresentation or construct irrelevant variance (rebuttals). The MBPA is a good representation of content, authenticity and complexity (backing). The fact that we did not find variables that correlated with the MBPA score, except for the rubric scores which are allowed to correlate with MBPA scores, means that we can reject construct underrepresentation or construct irrelevant variance for this MBPA. In addition, the tasks in the MBPA have been designed on the basis of a very extensive construct analysis, in collaboration with SMEs, which ensures that the MBPA is representative regarding authenticity and complexity.
The fourth inference is from the competence domain to the practice domain, or the second extrapolation inference. The practice domain is correctly operationalized within the competence domain (warrant), but only if all relevant aspects of CSG performance are represented in the competence domain (rebuttal). Again, evidence is provided by the fact that the design and development followed a well-defined and structured process, in which the steps from the framework were followed, in collaboration with SMEs. All tasks and assignments that currently take place in the PBA were transformed into a computer-based equivalent, which indicates sufficient representation.
The last inference is from the practice domain to the final certification decision. There should be a cutoff score (warrant) that is correct (rebuttal). We have provided several possible cutoff scores with an accompanying number of misclassifications. We did not apply a formalized standard setting procedure. However, we provided different cutoff scores, which SMEs can use in their decision for a cutoff score.
The second criterion for validity evaluation relates to the validity evidence (Kane 2006, 2013). Is the presented validity evidence plausible and representative for the assumptions that we wish to make with the MBPA scores? In other words, are the inferences justified by our validation study (Wools et al. 2010, 2016)? Each element of validity evidence should relate to and substantiate one or more inferences in the chain of reasoning. Wools et al. 2010, 2016 indicate that an evaluation status should then be assigned to the inference as a whole. The status is justified when warrants and backings on the validity elements are accepted, and possible rebuttals are rejected. With the evidence presented above, we argue that the validity elements give enough support for all the inferences in the interpretive argument.
Finally, the third criterion focuses on the outcome of the validation process or the validity argument as a whole. The question to be answered is: Is the validity argument as a whole plausible (Wools et al. 2010, 2016)? The validity argument can only be plausible when both the first and second criteria are met. As with the second criterion, the third criterion is somewhat subjective but boils down to taking all elements of validity evidence into account and then deciding whether the argument is strong enough to substantiate the validity of the assessment scores and final interpretations. In our case, we can say that all criteria have been met. The validity evidence provided in this article is plausible because every inference in the chain of reasoning, from performance to decision, can be substantiated by evidence.