1 Rationale for developing tests to assess aspects of adult and continuing education teachers’ professional competence

As lifelong learning gains in importance (BMAS and BMBF 2019; Europäische Kommission, 2007), questions about the effectiveness and efficiency of schools, vocational training, higher education, and increasingly adult and continuing education (ACE) are becoming more pressing (Kuper and Schrader 2019). As has long been empirically established for the school context (Lipowsky 2006; Rivkin et al. 2005), it is also assumed for ACE that teachers are significantly responsible, via their professional competence, for the learning success of their participants (Collins and Pratt 2011; Kraft et al. 2009; Siebert 2012).

A total of around 700,000 people currently work in ACE in Germany, of whom around 530,000 are teachers. Of these teachers, a large proportion are not employed full-time, but are engaged in other activities on a full-time basis (Autorengruppe wb-personalmonitor 2016). More recent studies have even revised these figures upward (Schrader and Martin 2021). Trainers working in continuing vocational training are regularly difficult to include in surveys due to their relatively low degree of organization in professional associations (Autorengruppe wb-personalmonitor 2016). If these are also taken into account, the number of teachers in ACE is likely to be larger than that of teachers in the general and vocational school system (Statistisches Bundesamt 60,61,a, b) and significantly larger than in the higher education system (Statistisches Bundesamt 2021c).

Access to and practice of the profession in ACE are hardly regulated by the state, collective agreements, or professional associations and have thus always been “open” for teachers with different educational backgrounds. In this respect, there are also no generally binding degrees or certificates that can be uniformly interpreted by practitioners as indicators of existing competencies. Barely one third of teachers working in ACE have a major or minor degree in educational sciences (Autorengruppe wb-personalmonitor 2016).

Against this double background—the expectations of effectiveness of ACE on the one hand and the wide range of teachers with diverse professional backgrounds on the other—it is understandable that in recent years there has been an increased interest in training and development as well as in assessing the competencies of teaching staff, e.g., with a view to recruiting teachers (Goeze and Schneider 2014). It is no coincidence that numerous initiatives originate from practice or are implemented in close cooperation with practice. For example, standardized surveys, such as those on training needs, or evaluations of the advice literature are used to gain insights into the existing and desired competencies of teachers (Hippel and Tippelt 2009; Schöb et al. 2016; Strauch et al. 2021). Instruments and procedures are being developed to make visible and recognize non-formally or informally acquired competencies (Schläfli and Sgier 2008; Steiner 2010; Strauch et al. 2020; Vinepac-Project 2008). Finally, digital learning platforms have been (www. videofallarbeit.de; www.wb-web.de) and are being (Projekt EULEFootnote 1 or KUPPELFootnote 2) developed for initial and in-service training for ACE teachers. These address, among other things, the pedagogical-psychological knowledge (PPK) of teachers.

It is striking, however, how slowly the development of empirically based instruments to assess the professional competence of ACE teachers has begun. So far, the focus has been on facets of PPK (Marx et al. 2017; Rohs et al. 2017). Instruments—e.g., for self-assessment, trainer selection, or informed choice/assignment of teachers to online learning paths on learning platforms for initial and further training—should be able to capture PPK in its conceptual breadth (Schöb et al. 2016). Researchers also lack a test that captures the PPK of ACE teachers in a comprehensive and differentiated way. Only such a test would make it possible, for example, to verify the widespread assumption in practice that the more experience a teacher has, the more successful he or she will be in implementing continuing education courses. Success in ACE (unlike in school) not only implies the learning success of the participants, but also, for example, the subjective fulfillment of their benefit expectations, the solution of action problems faced by the participants or the clients outside of the training, or also the continuation of the training activities.

This paper continues the development and validation of a test for the assessment of facets of PPK that was started in the project ThinKFootnote 3 (Marx et al. 2017; Voss et al. 2017) and focuses on knowledge about methods and concepts of teaching and learning. This knowledge is assumed to be particularly relevant for designing situations of teaching and learning. Since methods and concepts of teaching and learning for different goals and phases within learning processes exist in large numbers, the question arises to what extent a unidimensional conceptualization seems adequate (“knowledge of methods”) or whether theoretical-conceptual considerations and empirical findings argue for a multidimensional conceptualization and measurement (“knowledge of learning processes, communication and interaction with learners, leading groups …”). Thus, the question of the uni- or multidimensional conceptualizability and measurement of knowledge about methods and concepts of teaching and learning is the subject of this paper. The aim is to gain hints for the further development of the test from the project ThinK.

2 Research and development on the pedagogical-psychological knowledge of teachers in adult and continuing education

PPK is one aspect of teachers’ professional knowledge, alongside content knowledge (CK) and pedagogical content knowledge (PCK) (Baumert and Kunter 2011). In German adult education research, it is pointed out that in addition to this basic “body of knowledge” (Tietgens 1988, p. 37), there is also a need for both declarative and procedural experiential or professional knowledge (Dewe et al. 2002). “Professional knowledge” in ACE implies a particular form of experiential knowledge that refers to the specific norms, acculturated standards, or routinized practices of the professional field for learning with adults in the context of organized continuing education. Unifying these approaches, PPK is defined here as knowledge for designing and improving situations of teaching and learning in and across different subjects and educational fields. The definition includes declarative and procedural knowledge, as well as experiential knowledge acquired during a professional career (Marx et al. 2014; Voss et al. 2015, 2011).

The COACTIV competency model (Kunter et al. 2011), which was developed for the school context and has been empirically proven there, is increasingly being adapted for the field of ACE as well (Marx et al. 2017; Rohs et al. 2017; Strauch et al. 2021), because it addresses generic areas of teachers’ professional competence that also seem relevant for ACE. While the model has stimulated empirical research in the school setting, comparable studies in ACE are still lacking. Although the PPK to be applied in concrete situations of teaching and learning is regularly considered in competence models in different conceptualizations (Ziep 1990), most empirical approaches have so far been limited to instruments for self-assessment and/or peer assessment (Vinepac-Project 2008).

In recent years, there has been a growing interest in researching concrete situations of teaching and learning in ACE, their conditions for success, and thus the professional competencies of teachers (Kraft et al. 2009; Lattke and Jütte 2014; Vinepac-Project 2008). Thus, studies have been dedicated to the description of individual professionalization processes, didactical acts, and the metacognitive competencies of teaching staff. Maier-Gutheil (2012) used biographical case studies to identify differentiated processes of the formation of professionalism. Bastian (1997) created qualitative competence profiles of course instructors, to each of which specific knowledge foci and topic profiles were related. Several studies have been devoted to the planning activities of teachers: Hof (2001), for example, asked about the connection between teachers’ subjective understanding of knowledge and their teaching concepts, while Stanik (2016) inquired into the decision fields for and the factors influencing microdidactic planning; Haberzeth (2010) identified the planning logics of teachers’ actions according to which content is selected for their own courses, and Pachner (2013) investigated teachers’ competence in self-reflection. However, these studies did not make use of any competency-theoretical modeling and operationalization of PPK. The instruments developed so far to measure (facets of) PPK were conceptualized for the school context (Beck et al. 2008; König and Blömeke 2009; Lenske et al. 2015; Linninger et al. 2015; Seifert et al. 2009; Voss et al. 2011), not for ACE.

The ThinK project follows on from the COACTIV competence model and the work of Voss et al. (2011) on PPK, with the aim of developing a test to measure PPK for ACE teachers as well. The measurement theory approach used in the ThinK project is Item Response Theory (IRT). A prominent model of IRT is the Rasch model (Rasch 1960). Using the specific objectivity of comparisons of test scores and item scores—one of the central model assumptions of the Rasch model (Rasch 1960)—the importance of conceptualizing the dimensionality of the PPK can be illustrated: Specific objectivity exists when statements about people’s ability do not depend on which task they are being compared against. For this purpose, the following example is given: A task A is easier to solve than a task B. The tasks measure the same construct (knowledge domain). Person X has a lower ability than person Y. The probability of solving task A and B is lower for person X than for person Y. Without empirical dimensionality testing of the theoretical ideas about the conceptualization of PPK, it could happen that a test leads to wrong conclusions (e.g., Rost 2004; Strobl 2012)—which in turn can lead to poor decisions in educational practice.

In the ThinK project, PPK is conceptualized across disciplines and educational domains and dimensionalized into eight knowledge facets (Marx et al. 2014, 2017; Voss et al. 2015):

  1. 1.

    knowledge about learning processes of learners

  2. 2.

    knowledge about heterogeneity of learners

  3. 3.

    knowledge about methods and concepts of teaching and learning

  4. 4.

    knowledge about objectives of teaching and learning

  5. 5.

    knowledge about classroom/courseroom management

  6. 6.

    knowledge about communication and interaction with learners

  7. 7.

    knowledge about the design of learning environments

  8. 8.

    knowledge about the diagnostics of individuals and their learning processes

The conceptualization was developed following Voss et al. (2011) and is based on a systematic literature review with content analysis of relevant sources and an expert survey (Marx et al. 2017). Thus, this conceptualization is compliant with the assumption of the multidimensionality of PPK (Voss et al. 2015). This conceptualization is extended by describing the eight facets using a total of 30 subfacets. The subfacets indicate a possible multidimensionality within the eight facets that could be of practical importance for the purposes addressed above (e.g., for the use of the test in the context of a learning platform).

With this in mind, the following chapter will address the question of how a unidimensional, a multidimensional, and a hierarchical conceptualization can be justified for the knowledge about methods and concepts of teaching and learning.

3 Unidimensional, multidimensional, and hierarchical conceptualizations of knowledge about methods and concepts of teaching and learning

Methods of teaching and learning result from a specific combination of forms of work (e.g., exercise) and social form (individual, partner, group, or plenary work). The difference, which is often mixed up or misunderstood, can be illustrated by two questions (Jank and Meyer 2008): 1) Who works together with whom? (= social form); and 2) What patterns of action (e.g., giving a lecture, reproducing something) are to be performed? (= forms of work). An investigation of the literature reveals many combinations of social forms with forms of work (which can be expanded almost at will), which can be used in different phases for different goals in situations of teaching and learning.

3.1 Unidimensional conceptualization

A unidimensional conceptualization of knowledge about methods and concepts of teaching and learning raises the question of what general knowledge might exist across different combinations of social forms and forms of work. This could be knowledge about the “orchestration” of situations of teaching and learning, i.e., the conditionality of visible and deep structures (Oser and Baeriswyl 2001). In the context of Oser’s concept of teaching and learning, visible and low-inferent features of situations of teaching and learning are described by the term visible structures. These include social forms and forms of work as well as learning materials or tasks and thus the content that is the subject of a learning situation. The presence of certain visible structures does not initially allow any conclusions to be drawn about the process quality of teaching and learning, since within the same visible structure the quality of the interaction between teachers and learners can be completely different (Oser and Baeriswyl 2001). Deep structures refer to the learning processes of learners that are not directly observable (Oser and Baeriswyl 2001) and that account for learning success. The deep-structural quality of teaching and learning results from the teacher’s knowledge of the necessity to relate two aspects to each other in a pedagogically and didactically judicious way, i.e., aiming at an adequate teaching goal on the one hand and choosing the ‘appropriate’ visible-structural method to reach it on the other. The knowledge about this conditionality of visible and deep structures then represents on a theoretical-conceptual level the commonality across all tasks of the knowledge facet. This is because this knowledge could be useful for answering all tasks of the facet: If a teacher knows that visible structure features (e.g., a teacher lecture or group work) are neither effective nor ineffective as such, but can differentiate under which conditions a visible structure is more or less helpful in enabling specific deep structural and thus cognitive processes in learners, then she will have a higher probability of solving all tasks in this facet.

3.2 Multidimensional conceptualization

The knowledge about methods and concepts of teaching and learning was subdivided by Marx et al. (2017) into two subfacets: a) “social forms and forms of work as well as their combination and their target-adequate and effective use in situations of teaching and learning”; and b) “concepts for individualized, cooperative, or open forms of learning arrangements and their implementation”. This was justified by the fact that methods mostly represent smaller units of (inter)action than the “larger” concepts of teaching and learning. Operationalizing these two subfacets does not seem to make sense because subfacet a) is a subset of subfacet b). A better way to conceptualize knowledge about methods and concepts of teaching and learning in a multidimensional way is to identify key methods that are frequently addressed in the advice literature and conceptualize each method as one subfacet of this knowledge. Such a conceptualization accommodates the idea that this knowledge might be acquired and (re)present in ACE teachers in a less formally curricular way; it may be more experiential and thus “insulated” and “scattered”. The fact that different methods of teaching and learning require different knowledge to implement them will be exemplified by three essential combinations of social forms and forms of work: the teacher lecture, group work, and the feedback method.

Knowledge about the design of a teacher lecture requires cognitive-psychological knowledge (e.g., knowledge about the capacity of working memory) in combination with knowledge about methods and concepts of teaching and learning: Through the teacher’s lecture, information can be offered on a topic and this can be helpful in subjectively laying out or acquiring a basic cognitive structure (subtasks A and B; see Table 1 below). For enabling more complex cognitive operations (in the concrete situation), however, the method is usually less helpful (subtasks C, D, E). Knowledge about the teacher’s lecture was operationalized in the ThinK project by the task in Table 1.

Table 1 Task to measure the knowledge about the teacher lecture

When designing group work, knowledge of social psychology in combination with knowledge about methods and concepts of teaching and learning is necessary. Here, for example, knowledge is important about different types of interdependencies of the learners involved and, related to this, knowledge about the (non-)suitability of different task types (disjunctive, conjunctive, or additive tasks) for group work in order to avoid phenomena that are demotivating for the learners, such as the free-rider effect (Wecker and Fischer 2014). This knowledge was operationalized by the task in Table 2, which was semantically adapted from Voss et al.’s (2011) COACTIV test.

Table 2 Task to measure the knowledge about the design of group work

Giving feedback is one of the most frequently used methods of teaching and learning to support learning processes and behavior change (Strijbos and Müller 2014). Giving feedback requires, among other things, knowledge of motivation psychology in combination with knowledge about methods and concepts of teaching and learning. Knowledge about the influence of personal factors (e.g., attributions) on the processing of feedback is relevant here. Attribution of feedback is significant because motivational and emotional effects, as well as beliefs regarding possible individual scopes for action, are associated with the assumed causes of an event. Central here is the question of whether a particular action outcome is judged by a person to be influenceable currently and in the future. This corresponds to an internal, variable, and controllable attribution of causes as conceptualized in Weiner’s (1985) classification scheme of reasons for action outcomes. This knowledge was operationalized by the task in Table 3, which was semantically adapted from the COACTIV test of Voss et al. (2011).

Table 3 Task to measure knowledge of the feedback method

3.3 Hierarchical conceptualization

The examples given above show that different knowledge is relevant for different combinations of social forms and forms of work. A third way to conceptualize knowledge about methods and concepts of teaching and learning is a hierarchical conceptualization. A hierarchical conceptualization includes general and specific knowledge for different combinations of social forms and forms of work. General knowledge can be knowledge about the “orchestration” of situations of teaching and learning, i.e., the conditionality of visible and deep structures (Oser and Baeriswyl 2001). Specific knowledge can be, for example, the knowledge presented above about the teacher lecture, the group work, or the feedback method. Considering the idea that knowledge about methods and concepts of teaching and learning might be acquired and (re)present in ACE teachers mostly in an experiential way and thus “insulated” and “scattered”, it seems plausible that general knowledge and specific subfacets as well as each of the specific subfacets are to be conceptualized independently of each other.

4 Research questions and assumptions

The questions addressed in this paper are whether (1) unidimensional, multidimensional, or hierarchical modeling better represents ACE teachers’ knowledge about methods and concepts of teaching and learning, and (2) how reliably this knowledge is measured.

We assume that a hierarchical conceptualization explains the data best. This is due to the fact that the tasks were developed as a general knowledge facet on the basis of the theoretical-conceptual considerations elaborated in Chap. 3, as well as the fact that in each case concrete methods of teaching and learning and thus specific contents are addressed in the tasks for which no systematic connections are to be expected for the ACE teachers.

5 Methods

5.1 Sample

A total of N = 212 ACE teachers participated in the main study of the ThinK project, which is an ad hoc sample. Teachers aged 24 to 77 years (M = 48.47; SD = 12.16) had between 0.2 and 50 years of teaching experience at the time of the survey (M = 12.71; SD = 10.38) and taught a mean of 12.13 h per week (SD = 11.10). The proportion of university graduates (72.77%), of individuals who find majority part-time employment with multiple educational institutions (64.34%), and of female teachers (51.89%) indicates this sample is typical of ACE (Autorengruppe wb-personalmonitor 2016). The teachers in this sample are active in different ACE reproduction contexts (Schrader 2010) (“companies” and “free market”: 29%; value- or interest-based “communities”: 21%; publicly (co-)funded educational institutions with a legal mandate (“state”): 50%). They offer events on the full range of adult education topics. Only about one-third of teachers working in ACE have a formal major or minor degree in educational sciences (Autorengruppe wb-personalmonitor 2016). This group includes school teachers as well as those trained through educational sciences programs whose commonality is their training in educational science content that includes PPK (Linninger et al. 2015). A second group consists of teachers without a major or minor in education sciences, who make up about two-thirds of teachers in ACE, and who qualify for a teaching position primarily through their subject expertise (Autorengruppe wb-personalmonitor 2016). Teachers with a degree in educational sciences are—almost typically corresponding to these proportions—included in the sample with N = 54 (25.47%). Teachers without a degree in educational sciences comprise the rest of the sample (N = 158 (74.53%)).

5.2 Implementation

The sample was recruited with the help of two large adult education centers in Bavaria and North Rhine-Westphalia, as well as the Chamber of Industry and Commerce in North Rhine-Westphalia and various trainer networks. Participation took place individually on a prepared laptop in the premises of the institutions, was voluntary, and was rewarded. 42.45% of the sample was recruited as described, the remaining part of the sample completed the questionnaire online, also voluntarily and rewarded. All teachers had the total of 67 test items from the entire PPK test. In those cases where the ACE teachers’ test time went far beyond the agreed upon frame of approximately 90 min, they were given the option to skip the remaining test items and complete only the socio-demographic information. Therefore, the test items were presented in three versions with different orders.

5.3 Instrument

The tasks analyzed in this paper represent a subset of the entire test. The subjects of the analyses are tasks from the ThinK project, some of which were developed subsequently to the tasks of Voss et al. (2011) and thus overlap with them, but are not congruent. All five tasks that are the subject of the analyses in this paper are in the same task format as the tasks in Tables 1, 2 and 3, which comprise four or five subtasks in true-false answer format. The true-false responses share a common task stem and thus resemble testlets (Wainer and Kiely 1987). Each subtask was scored 0 (unsolved) or 1 (solved). A total of eight subtasks were excluded from the analyses reported below. The exclusion was made for empirical reasons (subtasks were solved by almost all ACE teachers or remained unsolved by a large number of subjects). The five tasks had their content focus on different methods of teaching and learning.

5.4 Dealing with missing values

Especially in so-called low-stakes assessments, test takers skip items (omitted items) or abort the processing and produce not-reached items in the process for various reasons. Rubin (1976) proposed the distinction between missing completely at random (MCAR), missing at random (MAR), and not missing at random (MNAR) for the classification of missing values, which has become widely accepted in empirical educational research (Lüdtke et al. 2007). In the literature on missing data, MAR and MCAR are also referred to as so-called ignorable nonresponses. The full information maximum likelihood (FIML) estimation procedure or multiple imputation (MI) are currently considered state-of-the-art methods for dealing with these two cases.

Since it is not certain what type of missing values our missing values are according to Rubin (1976), missing values are treated as such in this paper. This is because simulation studies indicate that IRT models are relatively robust when only a few values are missing (Rose et al. 2010). For the five tasks presented here, 114 of 3180 values are missing, or only 3.6%, which is considered a fairly unproblematic relative proportion (Kline 2016). To avoid having to exclude individuals with missing values on individual variables from the analyses, the FIML estimation procedure was used in Mplus (Muthén and Muthén 1998–2015), as listwise case exclusion can lead to erroneous findings in addition to a loss of efficiency in parameter estimation (Lüdtke et al. 2007).

5.5 Measurement framework: Multidimensional item response theory (MIRT)

Item response theory (IRT) is often referred to as modern probabilistic test theory, in which the probability of solving a task is modeled by one or more latent variables (constructs). Before fitting an IRT model, the dimensionality of the tasks used should be carefully considered. For categorical data, as is often the case with knowledge tests, MIRT is a good choice for analyzing the dimensionality of a construct (Reckase 2009; Kelava et al. 2020). This paper focuses on the dimensionality analysis of knowledge about methods and concepts of teaching and learning. In doing so, the conceptualizations presented in Chap. 3 are specified and model fit is compared using information-theoretic measures. Further steps, including checking the specific objectivity of the tasks and differential item functions (see e.g. Mair 2018), are not the subject of the representations in this paper, but were undertaken.

5.6 Model specification

5.6.1 Specification of the unidimensional model

The unidimensional conceptualization, as presented in Chap. 3.1, is modeled by a unidimensional model. If the assumption of unidimensionality exists, the differences of the individuals in the indicators are explained by a common latent variable and one measurement error is modeled per indicator. Of the measurement errors, we assume that they are uncorrelated (Fig. 1a). Of the subtasks of the tasks, we assume that they equally index the respective thematized knowledge, which is specified by equal factor loadings. If person scores are reported, they clearly refer to the one specified latent variable.

Fig. 1
figure 1

Unidimensional model (a), correlated-factors model (b) and hierarchical model (c)

5.6.2 Specification of the multidimensional model

Consistent with the proposed multidimensional conceptualization in Sect. 3.2 is a correlated-factors model, which assumes knowledge dimensions that are separable from one another but correlated (Fig. 1b). In a correlated-factors model, a construct originally conceptualized as unidimensional is often split into different latent variables (Brown 2006). However, in the correlated-factors model, no latent variable is introduced into the model that captures the knowledge unifying the facets across all items (Reise et al. 2010).

We assume that the subtasks of the tasks equally index the respective thematized knowledge, which is specified by equal factor loadings. We also assume that the measurement errors of the indicators are uncorrelated. Which correlations between the subfacets or the individual factors are ultimately revealed is open. However, due to the different content areas addressed in the tasks and the low level of standardized pedagogical training and continuing education of ACE teachers, these are likely to be low. If person scores are reported for individual latent variables of a correlated-factors model, then these are an amalgam of the latent variable that is not co-modeled but is common to all indicators, as well as the specific latent variables modeled in each case (Reise et al. 2010; Reise 2012).

5.6.3 Specification of hierarchical models

Fitting the assumption of a hierarchical conceptualization with general and specific knowledge are models of the bifactor family (Fig. 1c): “Bifactor models are potentially applicable when (a) there is a general factor that is hypothesized to account for the commonality of the items; (b) there are multiple domain specific factors, each of which is hypothesized to account for the unique influence of the specific domain over and above the general factor; and (c) researchers may be interested in the domain specific factors as well as the common factor that is of focal interest.” (Chen et al. 2006). In addition to the intended use outlined by Chen et al. (2006), bifactor models are also used to control for “nuisance factors”, e.g., when only the general factor is in focus but not the specific factors due to, for example, different task stimuli (DeMars 2013).

For each subtask, two loading parameters are specified for models of the bifactor family, for which (at least) three different assumptions can be made:

  1. 1.

    All loading parameters are freely estimated (Reise 2012)—hereafter referred to as the bifactor model.

  2. 2.

    The loading parameters of the subtasks are proportional to each other for the general factor and the specific factor (Bradlow et al. 1999)—hereafter referred to as the testlet model.

  3. 3.

    The loading parameters are the same for the subtasks of a task stem, respectively for the general factor and the specific factor (Wang and Wilson 2005)—hereafter referred to as the Rasch Testlet Model.

We assume that the measurement errors of the indicators are uncorrelated. In accordance with the considerations in Chap. 3.3, the Rasch Testlet Model seems to be the most appropriate, since it is assumed that the subtasks equally index the respective thematic knowledge.

When interpreting the specific factors, note the aspect b) addressed above by Chen et al. (2006): The specific factors explain additional variance to the general factor. If one reports person scores, then it should be noted that the general factor scores would refer to the general knowledge outlined in Sect. 3.1 and the specific factor scores would refer to the specific knowledge about methods of teaching and learning outlined in Sect. 3.2.

5.7 Statistical analyses

All analyses were performed using Mplus version 7.4 (Muthén and Muthén 1998–2015), specifically in the MIRT framework and accordingly the information-theoretic measures AIC, AICc, CAIC, BIC, SABIC are reported for model comparison. An open research issue is what the penalty function of information-theoretic measures ideally looks like (Rost 2004). Therefore, the two most common measures AIC and BIC as well as variants of these are reported (AICc, CAIC, SABIC), which take sample size etc. into account.

6 Results

To answer the question “Does a unidimensional, a multidimensional, or a hierarchical model better explain ACE teachers’ knowledge about methods and concepts of teaching and learning?” the models specified in Sect. 5.6 were fitted to the data: a 1PL correlated-factors model (Fig. 1b) and a unidimensional 1PL model (Fig. 1a), as well as (in Fig. 1c), a Bifactor Model (Reise 2012), a Testlet Model (Bradlow et al. 1999), and a Rasch Testlet Model (Wang and Wilson 2005). The majority of the information-theoretic measures (AIC, AICc, CAIC, and BIC) indicate the superiority of the Rasch Testlet Model over the other models (see Table 4), and the SABIC indicates the superiority of the 1PL correlated-factors model over the other models.

Table 4 Model comparisons: Correlated-factors model, unidimensional model, and hierarchical models (Bifactor Model, Testlet Model, and Rasch Testlet Model)

In order to present as complete a picture of the results as possible, item difficulties and standard errors (of the subtasks) as well as reliabilities are mapped below for the 1PL correlated-factors model, the unidimensional 1PL model, and the Rasch Testlet Model. For all multidimensional models, the loading structure is reported. For the Bifactor Model and the Testlet Model, item difficulties, standard errors, loading structure, and reliabilities can be found in the Appendix (Tables 10, 11 and 12).

6.1 Correlated-factors model and unidimensional model

The loading structure of the correlated-factors model as well as the item difficulties and standard errors of the unidimensional model are shown in Table 5. For the five factors, loading parameters show between 0.475 and 0.720, all of which are statistically significant. Combined with the predominantly low latent correlations between the factors (hypothesized in Sect. 5.6) (between r = 0.011 and r = 0.391)—three out of ten turned out to be statistically significant (see Table 6)—this is an indication that the assumption of unidimensionality of knowledge seems questionable and multidimensionality seems likely. In order to assume unidimensionality or to combine the tasks into an overall value, significantly higher (latent) correlations between the factors would have to be shown, which do not differ substantially from 1.

Table 5 Standardized loading structure and item difficulties of the correlated-factors model and item difficulties of the unidimensional 1PL model
Table 6 Correlation-matrix (standardized) for the correlated-factors model

For the correlated-factors model, item difficulty ranges between −4.115 logits ≤ item difficulty ≤ 0.997 logits. For the unidimensional 1PL model, item difficulty ranges between −4.708 logits ≤ item difficulty ≤ 1.091 logits.

For the 1PL correlated-factors model and the unidimensional 1PL model, the Kuder-Richardson 20 Formula (KR-20) was used to determine the internal consistency of the subtasks. KR-20 is essentially equivalent to Cronbach’s alpha for dichotomous data as given here (Lienert and Raatz 1998). For the unidimensional model, internal consistency was found to be low at 0.56 (see Table 7). This is another indication that the property of unidimensionality is not present for the test. The internal consistency of factors A–E is also low (between 0.29 and 0.60), but it is important to bear in mind that only three sub-tasks per factor were included in the analyses, which may contribute to the low internal consistency.

Table 7 Reliabilities for the unidimensional model and the correlated-factors model

6.2 Rasch testlet model

The loading structure and item difficulties of the Rasch testlet model are shown in Table 8. The item difficulty ranges from −4.066 logits ≤ item difficulty ≤ 0.987 logits. Statistically significant factor loadings are shown for the general factor (g) for all subtasks, as well as for the five specific factors (A–E). The factor loadings of the specific factors are consistently higher than those of the general factor. These findings also indicate here that one should not assume unidimensionality, but rather a general factor.

Table 8 Factor loadings (standardized) of the Rasch Testlet Model and item difficulty with standard error

For the general factor and the specific factors, the question arises as to what proportions these explain the variance in response behavior. To answer this question, the coefficients omega hierarchical (ωh) and omega subscale (ωs) are used (Reise 2012; Revelle and Zinbarg 2009). For ωh, the effects of the specific factors (and measurement error) are controlled; for ωs, the effect of the general factor and the effects of the other specific factors (and measurement error) are controlled. ωh is an important indicator of how reliably a test measures a construct (Revelle and Zinbarg 2009).

For the Rasch Testlet Model, ωh = 0.41 for the general factor—general knowledge across different methods of teaching and learning—showed a comparatively low value (see Table 9). This is another indication that the unidimensionality of the test is questionable. ωs indexes how reliably the specific factors—the specific knowledge about a method of teaching and learning—are measured. Low to satisfactory values were shown with ωs ranging from 0.29 to 0.67, although interpretation should take into account that only three subtasks per factor were included in the analyses, which may contribute to low internal consistency.

Table 9 Reliabilities Rasch Testlet Model

Omega total (ωt) represents the proportion of variance attributable to all factors. ωt = 0.75 indicates that 75% of the differences in response behavior are explained by the total of six factors and 25% of the differences are due to measurement error. Omega (ω) is the reliability of the specific factors, but not adjusted for the effect of the general factor such as ωs (Reise 2012). ω is low to good with values between 0.46 and 0.75.

7 Summary and discussion of the results

This paper addressed the question of whether a unidimensional, a multidimensional, or a hierarchical modeling of ACE teachers’ knowledge about methods and concepts of teaching and learning is more plausible from a theoretical-conceptual perspective and fits the data better empirically. This question arises against the background of a large number of methods and concepts of teaching and learning serving different goals and phases within learning processes, and a presumably more experiential and less standardized knowledge acquisition by the majority of ACE teachers not formally qualified in educational sciences. The aim of this paper is to gain indications for the further development of the present test, focusing on the facet of knowledge about methods and concepts of teaching and learning.

In Chap. 3, a unidimensional, a multidimensional, and a hierarchical conceptualization of knowledge about methods and concepts of teaching and learning were presented. The assumption of a hierarchical structure of this knowledge seems plausible, which assumes general knowledge about different methods of teaching and learning and specific knowledge for specific methods of teaching and learning. Empirically, we tested whether a correlated-factors model or a unidimensional model on the one hand, and a Bifactor Model, Testlet Model, or Rasch Testlet Model on the other fit the data best. Consistent with assumptions, four out of five information-theoretic measures point to the superiority of the Rasch Testlet Model.

For the Rasch Testlet Model, the general factor—general knowledge across different methods of teaching and learning—showed a low omega hierarchical (ωh). One explanation for this may be that only a few tasks were used to measure knowledge and that heterogeneous topics were addressed. This is an indication that a further development of the subscales will be useful. These findings are plausible if one takes into account that ACE teachers have “insulated” and “scattered” knowledge against the background of the mostly unsystematic pedagogical training of ACE teachers. In view of this “insulated” knowledge, i.e., uncorrelated knowledge from single aspects about specific methods, it seems to make sense to consider the subscales and the further development of these into multiple unidimensional tests (or tests and methods of analysis) that already explicitly take into account the within-item multidimensionality during the construction (Sorrel et al. 2016).

For hierarchical models, the question always arises which constructs are modeled in the general factor and which in the specific factors. Whether the general factor models knowledge about the relational necessity of visual and deep structures or domain-independent cognitive abilities cannot be answered conclusively in this paper. The same applies to the specific factors: Whether these are actually attributable to the variance of the true value or are construct-irrelevant needs to be further substantiated by validation studies. Indications that construct-relevant variance is modeled in the general factor and the specific factors are provided by the study of Marx et al. (2018).

One of the information-theoretic measures indicates the superiority of the correlated-factors model. When modeling the data by a correlated-factors model, out of ten possible correlations between the factors, statistically significant correlations are shown for only three. This can be interpreted as a further indication that the knowledge addressed in the tasks should not be conceptualized and measured in a unidimensional way. In addition, a unidimensional modeling showed a low internal consistency, which indicates a low homogeneity of the tasks and is an indication of the usefulness/necessity of a multidimensional or hierarchical conceptualization and measurement.

A limitation to all our findings is the overall small sample that was available for the study. A consequence of this is that only a selection of tasks was included in the analyses in order to keep the number of parameters to be estimated in reasonable proportion to the sample size (see Bentler and Chou 1987). What structure emerges for a larger sample and with the inclusion of additional tasks and facets is an open question. Similarly, it is unknown whether and to what extent the factor loading structure might change if the knowledge test was embedded in a learning situation (Mislevy 2018).

Nevertheless, insights for the further development of the test can be gained from the presented study: In the ThinK project, tasks were adapted, based on preliminary work, and were not explicitly constructed according to one of the presented measurement models. For future projects, the development of multiple unidimensional tests to capture knowledge about methods and concepts of teaching and learning seems reasonable. Another possibility is to develop a test that explicitly addresses multidimensionality. One way to conceptualize knowledge of these methods and concepts for ACE teachers in multidimensional terms is to map key methods as one dimension of that knowledge at a time. A challenge here will be the multitude of different methods and concepts of teaching and learning. The further developed test must then be checked for its dimensionality and, as addressed in Chap. 5, for the specific objectivity of the tasks and differential item functions (DIF), among other things. Further validation steps can be, e.g., the examination of the connection between the knowledge of the teachers mediated by the quality of teaching situations on the (learning) success of participants of ACE. In addition, it would be useful to examine whether (facets of) PPK can be distinguished from pedagogical beliefs or how it might be related to these beliefs, to other aspects of professional knowledge, or to professional experience.

Another challenge in the further development of the test will be to develop tasks that are more difficult to solve in order to be able to better differentiate between persons with a lot of knowledge about methods and concepts of teaching and learning. The comparatively low difficulty of the presented test can be explained by both the answer format, and the omission of technical terms in the test.

Overall, the present work can contribute to the development of tests to assess aspects of ACE teachers’ professional competence that are needed in the context of educational policy and practice initiatives. With construct-inherent interpretation of the specific factors as well as the general factor, a short scale can be offered to assess ACE teachers’ knowledge about methods of teaching and learning. However, before further validation steps pointed out in the paper are conducted, we only recommend this short scale for formative assessment and not for summative “high-stakes” assessments.