General standards in the form of quality criteria can be used in order to assess the quality of an instrument and/or to construct a high-quality test. Three main indicators, the so-called “core quality criteria,” have emerged: objectivity, reliability and validity (e.g. Bühner, 2011; Ebel & Frisbie, 1991; Linn, 2011; Miller, Linn & Grolund, 2009; Rost, 2004). These criteria must not be considered separately, but there is a logical relationship between them: objectivity is a prerequisite for reliable measurement and reliable measurement is a prerequisite for the validity of the instrument. These primary and selected secondary quality criteria (fairness and usability) are examined in more detail in the following using a data set of 349 pre-service teachers for secondary education at several German universities.

1 Objectivity

Objectivity is understood, in a narrower sense, as the degree of independence of the test results from the test instructor (Miller et al., 2009), while, in a broader sense, it is the degree of independence of the test results from any influences outside the participants is meant (Rost, 2004). Furthermore, in the context of the current phase of testing, a distinction is made between implementation objectivity, evaluation objectivity and interpretation objectivity (Bühner, 2011). In order to ensure the objectivity of implementation, the conditions under which the test is performed and the instructions provided must be as standardised as possible; in other words, the performance of a test must not vary between different examinations. One way to do this is to minimise the interaction of the test instructor with the participants. Implementation objectivity was ensured in this study by examining all participants under comparable conditions and using a pre-established standardised written introduction and printed instructions. The sole test processing was ensured by supervision and no further aid was given or additional aids were allowed. In addition, sampling checks on the implementation of the instructions did not reveal any deviations from the requirements.

In the context of the analysis of the test results, which are usually supported by software, the term evaluation objectivity refers to the independence of the test evaluation from the person (Good, 1973) and/or the program used for this purpose. It turns out that closed task formats, where participants have to choose between predefined answer alternatives, are the least prone to interference, although they cannot be completely excluded. On the other hand, when open task formats are used, the participant can respond with his or her own free formulation, the analysis of which depends—to a certain extent—on the subjective impressions of the coders (Miller et al., 2009). The evaluation objectivity was ensured by an automated evaluation of the tests, which was made after the encoding of the answers. The manual was created on the basis of six expert ratings from the German-speaking modelling community. In the process, critical items were discussed until consensus for evaluation was reached. In addition, part of the test logs were double-coded and checked for input errors.

Interpretation objectivity means the independence of the interpretation of the test results from the analysing person (VandenBos, 2015) and is ensured in this study by the fact that each participant can be assigned numerical values for their respective skill expressions on a fixed scale as part of the test. In addition, the effects can be interpreted on the basis of internationally accepted standards (Cohen, 1988).

2 Reliability

Reliability is the measurement accuracy or reliability of a test. For example, a measurement is reliable precisely when it accurately captures the personality or behavioural trait that is being measured, that is without error in measurement (Miller et al., 2009). Mathematically, the degree of reliability is determined by a so-called reliability coefficient, which describes the ratio of the variance of the true measured values to the variance of the observed and thus error-prone measured values (Bühner, 2011). In analogy to objectivity, in practice, there are also different ways of describing the reliability of a measurement and the specified variance ratio (Ebel & Frisbie, 1991). Since the variance of true values is unknown, it follows that the reliability of a test can be only estimated from the responses of the participants. This is done by means of methods that estimate reliability under certain conditions by means of a correlation between two comparisons, whereby the split-half reliability, the parallel forms reliability and the test–retest reliability are the most common methods for estimating the reliability of measurement (Bühner, 2011). For this purpose, in the first case, the tests are divided into two equivalent test halves, the results are determined separately for each test part and participant, and then both sub-test results are correlated. In the second case, the results of two strictly comparable tests that collect the same construct are correlated, whereas in the third case the results of the test (assuming that the characteristic to be captured has not changed itself) are correlated with a repeat measurement (Ebel & Frisbie, 1991). Another method for estimating reliability is internal consistency, which is essentially a generalisation of split-half reliability, where each item is considered as a separate test part (Bühner, 2011). The standard for the numerical realisation of this method is the Cronbach coefficient \(\alpha\) (1951) developed and named after Cronbach, which sets the sum of the variances of the individual items in relation to the total variance of the test. Accordingly: the greater the number of items and the stronger the positive correlation between the items, the higher is the internal consistency (Bühner, 2011).

The concept of reliability described has been defined in the framework of classical test theory and is applied there by default (Miller et al., 2009). In contrast, reliability in the probabilistic test theory or the item response theory (for a deeper look, see, e.g. van der Linden & Hambleton, 1997), which is the basis for the analysis of the facets of the modelling-specific pedagogical content knowledge, is rarely observed, despite the extremely favourable calculation conditions. The required variance percentages in the Rasch model can be directly estimated: The variance of latent variables (i.e. the true measurement value) is estimated as a model parameter in the course of the Marginal Maximum Likelihood Estimation (MMLE), while the variance of the observed values corresponds to the variance of the estimated personal parameters and the error variance of the measured values can also be calculated from the standard estimation errors of the ability expressions. However, the latter two variances are based on the choice of the estimation procedure according to which, in addition to the Unconditional Maximum Likelihood Estimation (UMLE; often also called Joint Maximum Likelihood Estimation—JMLE) resulting and due to their overestimated variance inappropriate personal parameters also calculate the more suitable Expected a Posteriori (EAP) and Weighted Likelihood Estimation (WLE) estimator. The resulting reliability (EAP or WLE reliability), of which the EAP reliability is comparable to the reliability measure from classical test theory as specified by Cronbachs \(\alpha\) (Rust, 2004), therefore represents adequate options to determine the measurement accuracy of a test within the scope of the item response theory. In view of the facets of the modelling-specific pedagogical content knowledge, reliability values are determined which tell exactly how the corresponding personal parameters (EAP or WLE estimators) can be measured in the second test part. The corresponding dichotomic items were scaled using simple Rasch models and the scales were thus checked for their sufficiency. Using the eRm package (Mair & Hatzinger, 2007) of Software R, the item difficulty parameters were estimated based on the solution rates of items and personal skill parameters based on the performance of the people interviewed. Various scale parameters have been calculated to evaluate scalability (see Table 4.1). In the course of the model validity review, items 6.1.3 and 6.2.3 for knowledge about concepts, aims and perspectives as well as items 7.1.4, 7.2.5, 7.2.6, 7.6.4 and 7.6.5 for knowledge about interventions were excluded due to insufficient discrimination and therefore not included in the scale.

Table 4.1 Dichotomous Rasch models for knowledge scales

In general, the reliability coefficients between 0.50 and 0.70 can be considered adequate for group comparisons (Ebel & Frisbie, 1991) as well as coefficients that are not less than 0.70 as the characteristic values of good test instruments (Bühner, 2011). All EAP reliability values are above 0.70 and are therefore acceptable. All the Andersen tests to assess the model fit are insignificant and therefore indicate a fit of the one-dimensional Rasch models. Furthermore, all point-biseral correlations of the remaining items are greater than 0.30 and therefore also of acceptable quality.

In addition, for the scale of self-reported prior experience, beliefs and self-efficacy expectations for mathematical modelling, the first part of the test calculated relics according to the classical approach (see Table 4.2).

Table 4.2 Reliabilities for the (sub-)scales of self-reported prior experiences, beliefs and self-efficacy expectations for mathematical modelling

Except for those of transmissive oriented beliefs (0.65), all the reliability values of the scales considered are above 0.80 and must therefore be described as good.

3 Validity

While the reliability describes the trustworthiness or measurement accuracy of the test, validity describes the extent to which the test measures what it should measure (Miller et al., 2009). A test is considered to be completely valid if its results allow accurate and immediate conclusions to be drawn about the individual characteristics of the participants’ abilities or behaviour to be captured (Ebel and Frisbie, 1991). There are also three concepts as far as validity is concerned: the content validity, criterion validity and construct validity which is explained below (Bühner, 2011).

For content validity, it is fundamental whether the test instrument as a whole, but also whether each of its individual items represents the characteristic to be captured sufficiently well. This is not checked by numerical parameters, but rather by didactic and logical considerations (Ebel & Frisbie, 1991). Accordingly, the content validity in the present study was ensured by a rational and effective design of the test tasks (see Sect. 3.1), by a theory-based operationalisation that was closely aligned with the definitions of the aspects, areas and facets of professional competence to teach mathematical modelling. In addition, the tasks developed in this way were discussed extensively with several experts from the German-speaking modelling community to determine whether the constructs considered were adequately covered.

The criterion validity refers to the validation of a test based on the association with an external manifest criterion that should correlate with the characteristic to be recorded (Bühner, 2011). Depending on the time at which this criterion is available (before, simultaneously, later), there is a distinction between retrospective validity, consistency validity and predictive validity. In the first case, therefore, the relationship of the test result to a criterion of interest that was already known, in the second case, the relation of the measured values with a criterion that was collected simultaneously and in the third case the prediction of a future characteristic is in the foreground. These correlations can therefore be used to provide a numerical variable to represent the criterion validity (Kane, 2011). In order to consider the criterion validity in the field of professional competence for teaching mathematical modelling, the annex of the study makes it possible to rely primarily on retrospective validity. However, as there is almost no knowledge about which criteria generally correlate with modelling-specific professional competence, the main focus is on the links between the modelling-specific pedagogical content knowledge facets and the school-leaving examination grade, following the results of the COACTIV study (Krauss et al., 2008) (see Table 4.3) whereas other aspects are considered in the subsequent explanations of construct validity.

Table 4.3 Correlations between facets of pedagogical content knowledge and school-leaving examination grade

It turns out that the school-leaving examination grade is not indicative of the facets of modelling-specific pedagogical content knowledge considered, with negative correlations indicating a positive correlation due to the German grade scale. These results replicate the correlations with the pedagogical content knowledge found by Krauss et al. (2008).

In the context of construct validity, the question is examined whether the instrument also captures the theoretical construct that needs to be captured. In this respect, many authors increasingly summarise the construct validity as a general concept encompassing all aspects of validity, while, in a narrower sense, only the convergent, discriminatory and factorial validities are included in the aspects of construct validity (Bühner, 2011). Instead of naming individual external manifests, as with criterion validity, one formulates diverse hypotheses about the structure and contexts of the construct and the related relationships to manifest, but also latent variables. These hypotheses can therefore relate, on the one hand, to which other construct-related variables the test to be validated is closely related (convergent validity) and, on the other hand, to which non-structural variables it is not or only very little related (divergent validity) (Ebel & Frisbie, 1991). In addition, factorial validity often involves checking the established test model before the test construction (or even structural model) with the help of confirmatory factor analyses and other model matching procedures, which on the one hand examines the defined mapping of individual test pieces to specific design areas and facets and on the other hand tests the assumption of uncorrelated measurement errors (Bühner, 2011). The possibility of verifying convergent validity is limited, since inaccessible comparative tests have not allowed the use of instruments other than those described in Sect. 3. Thus, correlations between the aspects of professional competence to teach mathematical modelling considered are calculated, also based on the results of the COACTIV study (Krauss et al., 2013), which may give indications of convergent validity (see Tables 4.4 and 4.5).

Table 4.4 Correlations of facets of modelling-specific pedagogical content knowledge
Table 4.5 Correlations of aspects of modelling-specific professional competence

Significant correlations are consistently shown, which, in almost all cases, are comparable with the COACTIV results in terms of their significance and thus support the validity of the test designed.

In view of the discriminatory validity, it would have been desirable to use additional tests which only measure similar constructs to ensure that they are not measured, in other words, that the correlation between the present and the other test results is minimised. But this was not possible for economic reasons. Instead, comprehensive efforts have been made to verify the factorial validity of the constructs under consideration. As described in Sect. 2.4.4, the model of professional competence for teaching mathematical modelling was largely confirmed by structural equation and/or confirmatory factor analyses (see Klock et al., 2019; Wess et al., 2021). In the context of structural analyses, various Rasch models for professional knowledge have also been and will be determined to teach mathematical modelling and tested using modelling tests as well as compared with each other to ensure factorial validity (for a deeper look, see Greefrath, Siller, Klock and Wess, submitted).

In addition to the main quality criteria discussed, scalability is sometimes mentioned as another criterion. This is considered to be fulfilled if the test value formation follows a valid clearing rule, that is there is a sufficient statistic (Bühner, 2011), which is ensured by the use of fitting Rasch models (see Sects. 2.4.4 and 4.2). In addition, other so-called secondary quality criteria such as fairness and usability can be listed (e.g. Bühner, 2011; Ebel & Frisbie, 1991; Miller et al., 2009), which are briefly explained below.

4 Secondary Quality Criteria

One of the most well-known secondary commodity criteria is the fairness of a test. The quality criterion of fairness is met precisely when the results of a test do not systematically discriminate against groups of participants on the basis of external characteristics (e.g. ethnic, socio-cultural or gender-specific) (Zieky, 2011). For example, when designing test tasks, care was taken to ensure that the formulations were made in a language that was appropriate for the gender. In addition, the tests used to record the professional competence to teach mathematical modelling were checked for differential item functioning in the course of two dissertation projects in order to check whether certain sample groups are significantly disadvantaged by individual items, that is a systematic item bias (see Klock, 2020; Wess, 2020).

The usability criterion is met by a test when it uses relatively little financial and time resources (Miller et al., 2009) in terms of the diagnostic knowledge gained. For this purpose, it is important to keep the implementation time as short as possible and to minimise the material requirements as well as to make the test instrument easy to handle and to realise it as a group test as far as possible (Bühner, 2011). This means that the instrument used can be described as economically viable, since it has a relatively short implementation time (approximately 60 min), consumes little material, is easy to use and can be implemented as a group test.

In addition to the associated quality criteria presented, various authors cite further criteria for the test quality. This includes, among other things, acceptability, usefulness and tamper-proof characteristics. For more detailed explanations, please refer to the relevant literature (e.g. Bühner, 2011; Downing & Haladyna, 2011; Miller et al., 2009).