1 Introduction

Teachers have been found to be an important factor for student learning (Hattie 2003), and, unsurprisingly, teacher education research has a long tradition of modelling and measuring what constitutes “teacher competence”. For many years, research focussed on professional knowledge and various motivational-affective factors as the core of teacher competence following Shulman (1986). Accordingly, knowledge tests and self-reports were the main kinds of assessment instruments in use (Frey 2006). Ever since these instruments have been used, there has been criticism that these assessment approaches may not be appropriate for certain purposes of teacher education research. Specifically, it was argued that typical knowledge tests might be limited in testing the ability to apply knowledge to real-life problems (Shavelson 2010). For example, to turn students’ misconceptions into learning opportunities in the classroom, it is not enough if teachers know which typical misconceptions might exist for a topic. Instead, based on their professional knowledge, they must use students’ learning processes to diagnose their actual understanding and flexibly apply appropriate strategies to overcome these misconceptions. However, knowledge tests are often not informative in this respect.

To address this challenge, a growing number of research endeavours have followed a complementary approach in the last years (von Aufschnaiter and Blömeke 2010; Kaiser and König 2019; Kunter et al. 2011; Lindmeier 2011). New perspectives on teacher competence as complex ability constructs led to the development of new, “situated” assessment formats that aim at testing competence close(r) to teaching performance (Kaiser et al. 2017). At the same time, ‘performance assessment’, in general, gained increasing attention in higher education in German-speaking countries (DACH) and teacher education research has been identified as leading in this field (Zlatkin-Troitschanskaia et al. 2016). Several recent national funding initiatives with a high impact on teacher education research contributed to these developmentsFootnote 1.

These new kinds of situated, close-to-performance, or performance assessment instruments, were developed in parallel with different approaches. However, the varying understandings and reporting practices challenge building on existing research and interfere with the coherent development of the teacher education research field, which is traditionally fragmented (e.g., according to school subjects) in the DACH region. A cross-subject overview is missing. This scoping literature review addresses the gap and aims to provide an overview of the current state of research regarding what can be subsumed under so-called ‘performance assessment’ instruments in DACH teacher education research.

2 Theoretical background

2.1 Frameworks for modelling and assessing teacher competence

How to assess teacher competence is a long-standing question in educational research. For example, questions regarding teaching effectiveness, teacher education outcomes, or professional development design hinge on the understanding of how teacher competence is conceptually modelled and which instruments are used for assessing it.

Many conceptualisations of teacher competence in the DACH region are influenced by an understanding of competence as a “complex ability construct” that includes cognitive abilities and skills to solve specific problems that are associated with motivational, volitional, and social dispositions and which can be learned (e.g., Koeppen et al. 2008). Competence is hence always defined in relation to specific real-life problems, here to problems of teaching. Based on this, different models of teacher competence have been derived.

A prominent example is the COACTIV model (Fig. 1, left), which analytically delineates factors contributing to professional competence, like professional knowledge, motivational orientations, or self-regulation skills (Baumert and Kunter 2011). The teachers’ professional knowledge constitutes the core of teacher competence in this and similar models, so Kaiser et al. (2017) speak of a cognitive strand of teacher competence frameworks. Assessment instruments typically cover tests or self-reports, e.g., for teacher knowledge or motivational orientations. Due to the modelling strategy of analytically delineating factors of competence (dispositions), they may also be called analytical approaches.

Fig. 1
figure 1

Analytical and holistic approaches to model teacher competence and their relation to the competence-as-continuum model (Blömeke et al. 2015). The illustration contains the COACTIV model (Kunter et al. 2011) on the left as an example of an analytical approach and the model according to Lindmeier (2011) on the right as an example of a holistic approach (RC in dashed lines and AC requirements in continuous lines)

Studies using such analytical approaches led to many important and influential findings, e.g., regarding the structure of teacher knowledge, the relevance of teacher knowledge for student learning (Kunter et al. 2011), or the international variability of outcomes of teacher education (Blömeke et al. 2009). To date, they constitute the majority of research in the area of teachers’ professional competence (see, e.g., König et al. 2022; Voss et al. 2015). Nevertheless, it has been questioned whether these approaches are sufficiently powerful to answer certain questions related to teacher competence, particularly regarding teaching performance. This has led to calls for alternative approaches.

As a result, several researchers suggested an understanding of teachers’ professional competence closer to professional action and the ability to master different situational demands (Kaiser et al. 2015; Shavelson 2010; Zlatkin-Troitschanskaia et al. 2016). Instead of analytically decomposing teacher competence into contributing factors, they consider the professional demands that need to be mastered as unbreakable entities and seek to model competence accordingly. For example, teachers have to lead discussions in the classroom, prepare lessons, explain concepts, or support students emotionally. Accordingly, these frameworks seek to specify areas of competence by distinguishing and describing different demands of teaching (Kaiser et al. 2015). Depending on the strategy dealing with the complexity of modelling teacher competence by identifying different demands but not decomposing it into factors, we will call this second strand holistic approaches in contrast to the analytical approaches (Fig. 1, right).

An example of the holistic approach is the structure model of subject-specific teacher competence, according to (Lindmeier 2011). It conceptualises two different components of competence, so-called Reflective Competence (RC), pertaining to pre- and post-instructional demands and Action-related Competence (AC), pertaining to in-instructional demands, on the base of cognitive dual-processing theories (Evans 2008). Teachers are considered to hold AC if they master professional demands like diagnosing student understanding in instruction on the fly. Similarly, if they master the professional demands of preparing for or reflecting on instruction, they are considered to hold RC (e.g., Jeschke et al. 2021).

Holistic approaches have only recently been proposed more frequently, so there is no common framework yet. However, the different proposals have one thing in common: they assess competence close to teaching performance. They require, for instance, participants to diagnose a student’s understanding in a simulated situation (Kron et al. 2021) or directly provide an explanation to a specific (videotaped) student question under time pressure (Hecker et al. 2020b; Jeschke et al. 2021). The researchers may use demands with reduced complexity compared to real-life teaching, but still, the tasks used for assessment may be understood as “approximations of practice” (Grossman et al. 2009). These so-called performance assessments (PA) are typical of the holistic approach and have often been contrasted against “decontextualised” knowledge tests of the analytical approach (Kaiser et al. 2017; König et al. 2022).

A framework synthesizing the alternative perspectives considers teachers’ professional competence as a continuum (Blömeke et al. 2015). Professional competence is understood to rely on specific professional dispositions (e.g., professional knowledge, motivational orientations) as proposed by the analytical approaches. However, the observable real-life behaviour of a teacher (teaching performance) is influenced by the conditions of the respective situation and needs the integration and application of dispositions according to holistic approaches. Situation-specific skills, like noticing skills, were proposed as mediating between dispositions and performance (Stahnke and Blömeke 2021). Figure 1 summarises how the two contrasted modelling and assessment approaches might be related to the competence-as-a-continuum model and highlights their complementary nature: Professional action needs professional dispositions (e.g., professional knowledge), but it may not be sufficient to understand the latter as the effect of the former. It should be emphasised that two ways of modelling—and consequently assessing—teacher competence should be seen as complementary approaches with certain strengths and shortfalls, like the practical and theoretical tests used for obtaining a driving licence (analogy by Zlatkin-Troitschanskaia et al. 2015).

To sum up, despite the lion’s share of high-impact teacher research being done with knowledge tests and self-reports in the last decades, a significant branch of research using assessment methods close to performance developed. While we have a long history of research and experience in the first, the recent PA approaches have developed independently and lack a common perspective so far.

2.2 Performance assessment and other assessment methods close to teaching performance

Standardised assessment methods that strive to approximate real-life demands of a profession are generally seen as having several advantages: first, like knowledge tests, they are objective measures and thereby superior to subjective measures like self-reports. Second, unlike knowledge tests, they are less prone to test for inertFootnote 2 knowledge because the test situations are close(r) to performance (McClelland 1973). Third, they seem to be especially suited for assessing the ability to master complex professional demands, which depend particularly on integrating various areas of knowledge and skills (Shavelson 2013). In the US, e.g., teacher PA is used (besides research) for summative assessment for certification and placement purposes (Pecheone and Chung 2006; e.g., PACT).

When developing assessments close to performance, educational researchers may draw on the experience of adjacent research fields, like medical education, where such methods have a long tradition (Miller 1990). For instance, tests based on role plays involving trained actors as standardised patients are regularly used to assess the performance of medical students. Systematising the various assessment approaches in medical education, Miller (1990) delineated four hierarchical levels with varying degrees of proximity to the professional action: instruments may target assessing knowledge (knows), competence (knows how), performance (shows how), or action (does). Whereas action refers to the practitioner’s way of dealing with a professional demand, performance refers to its approximation in an assessment situation, e.g., with standardised patients. This framework’s competence and knowledge levels pertain to increasingly decontextualised assessment methods.

It must be stressed that Miller’s terminology partially conflicts with the current one in teacher education research (e.g., see Blömeke et al. (2015), where real-world performance can be compared to action level). However, his distinction aligns with the understanding of the term “PA” as used in further contexts. In psychology, “the defining characteristic of a performance assessment is the close similarity between the type of performance that is actually observed and the type of performance that is of interest.” (Kane et al. 1999). With similar intentions, Shavelson (2013) postulates that assessment in education should “produce observable performance” on tasks with “a high fidelity” regarding the “real world (criterion)” situations to which inferences of competences are to be drawn (p. 75).

However, looking at research practices, a wide variety of different interpretations can be observed. A review by Palm (2008) shows that a common understanding of the defining characteristics of PA is missing, and various definitions (and measurement approaches with different names but similar intentions) are used in parallel. For instance, some researchers tie their understanding of PA to a certain type of response (e.g., constructed responses) or simulations that allow the observation of solution processes (Palm 2008).

Up to now, the perspectives also vary broadly in the specific field of teacher education research. Therefore, when preparing this study, we were challenged first to answer the question: what is the current conceptual understanding of PA in teacher education in the DACH region, and which criteria might be appropriate? Bartels et al. (2019), for example, delineate three necessary criteria: PA instruments should have professional relevance regarding the content (e.g., choice of subject matter), be authentic in the sense that participants need to show professional action, and be interactive. With the latter, they refer to the use of dynamics or adaptations, highlighting an authenticity aspect. Therefore, they consider assessment approaches that fail the interactivity criterion to be only performance-oriented.

Regarding the assessment of AC, Lindmeier (2011) discusses—without referring to the term PA—the necessity to mirror as closely as possible the complex, immediate, uncertain, and interactive nature of teaching. However, the standardisation of tests competes with their interactive design. Considering this, AC assessment instruments use video vignettes that require participants to respond verbally under time pressure and so trade off interactivity in favour of standardisation (Jeschke et al. 2021; Knievel et al. 2015; Lindmeier 2011). These examples show that interactivity may not be mandatory for assessments close to the performance.

Such needs to balance potentially competing factors are evident at many points in the context of PA. It is essential to consider which aspects should be used as defining characteristics. For example, Codreanu et al. (2020) highlight that in developing simulation-based approximations of practice, it is necessary to balance the complexity of the cognitive demands with the authenticity of the simulation in terms of the referenced real-life demands. Particularly, they argue that it is important not only to expose participants to authentic situations but to involve them cognitively and motivationally in the situation. Again, however, it may not be necessary to see this aspect of authenticity (e.g., as high achieved involvement) as a defining aspect of PA. Rather, motivational and affective involvement may be understood as contributing to validity in terms of cognitive processes during the assessment that may have to be balanced against other aspects of validity.

Integrating across the given perspectives, we suggest the following definition for PA in this review: we understand PA as referring to standardised methods to assess teacher competence (as complex ability constructs) objectively, close to performance (“shows how” according to Miller 1990) with a focus on observable behaviour (performance) regarding teaching demands. We include standardisation as an essential test quality criterion, especially regarding test requirements and scoring criteria. Alike, we include the criterion of objective measurement, mainly referring to objective scoring procedures. The criterion of observable behaviour regarding teaching demands relies on the specificity of the holistic approach of modelling teacher competence. It is in line with the psychological understanding of PA. Following the suggestion by Blömeke et al. (2015), we distinguish situation-specific skills, such as teacher noticing, from performance and understand situation-specific skills as a prerequisite of performance. However, in contrast to Blömeke et al. (2015) but in line with Shavelson (2013) and Miller (1990), we acknowledge that performance may refer to part-demands with reduced complexity and may be observed in settings other than real-life teaching situations.

The criterion observable behaviour regarding teaching demands may often be associated with direct measures and open-ended (constructed) responses, which are both related to questions of validity. It should be kept in mind, however, that for certain target competences or in the case of well-advanced research, valid tests based on closed (selected) responses or indirect measurements (e.g., advocatory approach) may also be thought of (Oser et al. 2010). Our definition may be conceived as leading to a comparably broad category with considerable room for interpreting what counts as “teaching demands”. Remarkably, we also decided to omit any explicit criterion related to authenticity. In line with Shavelson (2013), we see the core aspect within the holistic approaches, the sampling of the test tasks from the real-life reference, is sufficiently mirrored by referencing teaching demands.

According to our understanding, PA instruments may consequently use various methods. The few portrayed examples of PA instruments show already considerable variation regarding the presentation of tasks (e.g., video, role play), the type of response (e.g., verbal response, enactment, multiple choice (MC)), the degree of interactivity, or the type and grain size of targeted competences. However, instruments without a clear focus on teaching demands as a real-life criterion (e.g., tests of decontextualised knowledge without situation references, assessment of distinct situation-specific skills), subjective (e.g., self-reports), or unstandardised measurements (e.g., scoring by untrained persons, Schütze et al. 2017) are not considered as PA under this perspective.

2.3 Prior reviews with relevance to this review

There are a few prior reviews relevant to this review. Frey (2006) provides an overview of methods and instruments assessing teachers’ professional competence from the DACH region from 1991 through 2005. Although PA instruments had already been reported, only 6% of the 47 identified instruments were classified as such. However, these did not target teacher-specific professional competences but ones with broader professional relevance, like managerial competence (Etzel and Küppers 2000).

In 2016, a review with an international perspective on assessment in higher education was issued as an update and extension of a white paper from 2010 (Zlatkin-Troitschanskaia and Kuhn 2010; Zlatkin-Troitschanskaia et al. 2016). The main objective of this review was to provide a comprehensive overview (range 2010 to 2014) of objective, reliable and valid instruments to assess higher education learning outcomes with a focus on large-scale assessments in a holistic understanding (Zlatkin-Troitschanskaia et al. 2016, p. 3). Despite varying reporting practices presenting a challenge throughout the field of PA, the review shows that assessment approaches vary considerably and include testing for knowledge as well as self-assessment. So only some of the instruments identified would pass the criterion of observable behaviour regarding teaching demands we suggested. However, the authors found particularly innovative developments in teacher education research, both nationally and internationally, and surfaced 18 instruments in our focus region (Zlatkin-Troitschanskaia et al. 2016, p. 56 ff., list not intended to be exhaustive) mainly with a focus on STEM subjects. Teacher education research is recognised as a pioneer in the field of PA in this overview, but a detailed examination of the instruments is beyond its scope.

To our knowledge, the most recent review with certain relevance is from 2020 (Zlatkin-Troitschanskaia et al. 2020a). It focuses on the state of research regarding the assessment of teaching competences (and teaching quality) in higher education between 2012 and 2018. Again, teacher education was identified as a pioneering field but not the focus of this review.

One may also consult reviews on teacher competence constructs located at other points of the competence-as-a-continuum model. One example is the review on pedagogical knowledge by Voss et al. (2015) (analytical approach, PK as a disposition). Other examples are the reviews focusing on teacher noticing or situation-specific skills of (mathematics) teachers (König et al. 2022; Santagata et al. 2021; Stahnke and Blömeke 2021). Each show a broad variety of assessment approaches, including some sharing certain characteristics with PAs. This indicates that boundaries between constructs and related assessment approaches may not always be drawn sharply. What is still missing is a systematic perspective on PA in teacher education research that details its current state of research.

3 Research questions

PA is still an emerging topic in DACH teacher education research, as in the last years, new instruments have been developed by different research groups in different contexts. However, it is unclear whether there is a common understanding of PA in this still fragmented field, which is also impacted by varying reporting practices.

Therefore, this contribution aims to provide an overview of recent advances and the current state of PA instruments in teacher education research in the DACH region by conducting a scoping literature review and investigating the following research questions:

RQ1

What are the context characteristics (e.g., school subjects, intended purposes, theoretical frameworks) of the PA instruments?

RQ2

How do PA instruments differ in terms of methods and alignment with PA criteria?

To uncover potentially emerging strands within the new research field, it may be of interest whether the designs of the instruments follow certain patterns, so we ask:

RQ3

What types of PA can be distinguished?

In detail, we aim to describe the PA instruments’ variation across different subjects concerning contexts (RQ1), methods, and their ways of reflecting the criteria established for PA (RQ2). We aim to identify and coherently describe characteristics by a category system. We expect that this will allow us to display the diversity of instruments, potential similarities and differences systematically. Based on our systematisation of instrument characteristics, we expect that different types can be distinguished (RQ3).

4 Methods

In autumn/winter 2021, we conducted a systematic literature search to identify the relevant PAs. These were subsequently systematically analysed for differences. Based on the existing literature, analytical categories were refined or complemented inductively and compiled into a system which was applied to all identified instruments to answer RQ1 and RQ2. Finally, we used clustering methods to uncover different types of instruments (RQ3).

4.1 Literature search and processes of instrument identification

The literature search and instrument identification followed several steps (Petticrew and Roberts 2006; Fig. 2). First, we searched for relevant publications in the education-specific literature databases ERIC and peDOCS, as well as the general scientific literature database Web of Science (with restriction to the categories Education and Educational Research). Since there are relevant reviews, although partly with a broader perspective, up to 2016, which found that PA was rarely used in the target region (e.g. Zlatkin-Troitschanskaia et al. 2016), we only consider the period 2016 to 2021. According to our focus on the DACH region, the following English and German keywords were used in any possible combination:

  • English: performance test, performance assessment, measurement, teacher education, performance, teacher, teaching, teaching skills, performance-based testing

  • German: Performanz, Test, Kompetenzmessung, Lehrkräfteausbildung, Lehrerausbildung, Lehrkräfte, Lehrer, Simulation

Fig. 2
figure 2

Schematic illustration of the research process

Publications in German, as well as English, were included. Since authors may use other synonyms that do not match our search criteria, it is difficult to identify all relevant measurement instruments with pre-set keywords. Thus, we additionally used the snowball principle during the next steps of literature processing and identified further publications by investigating their references. Newly identified publications were added during the process until no more additional publications surfaced. In total, 242 publications were considered.

In the second step, these publications were examined on the base of the title, abstract, and keywords. These were analysed in terms of content by the first author. If they proved to be not relevant to our focus (e.g., not from DACH teacher education research; based on self-assessments), they were excluded from further analysis. This resulted in a corpus of 42 publications.

In the third step, we scoped the publications. Starting with the methods section and including, if necessary, other sections, we conducted a content analysis and manually extracted relevant information (Mayring 2015).

According to our research questions, we used the PA instruments as units of analysis. As there are multiple publications on most of the instruments and as some report on several instruments, we determined the unique assessment instruments covered by the identified publications. Accordingly, we restructured the information extracted from the publications per instrument and developed descriptions of every instrument. In cases where the project homepages provided relevant information, we also included it. In a few cases, we contacted the corresponding authors to get further information that was not published (yet) but relevant to the coding process (e.g., time limits, see category system).

To exclude instruments, which are not within the scope of PA as defined above, we used three criteria: only instruments with objective measurement methods (e.g., no self-evaluations or evaluation by untrained raters), standardised methods (e.g., no observation of naturally occurring instruction), and observable behaviour regarding teaching demands (e.g., no decontextualised knowledge tests) were shortlisted. In this step, we examined the assessment tasks and whether they were profession-specific for teachers. For example, assessment instruments may aim to capture subject-independent competences that teachers need in their daily work (general pedagogical competences), like classroom management skills. They may otherwise capture profession-specific competences related to the subjects taught, like diagnosing a student’s mathematical error. However, we also found instruments with criterion demands that must be considered not specifically relevant for teaching (e.g., related to working within the subjects). For example, we excluded an instrument for experimental competence, which was reportedly not teaching-specific (see, e.g., Bruckermann et al. 2017). In total, the procedure led to a final corpus of 20 different PA instruments (Table 1).

Table 1 Identified performance assessment instruments

4.2 Development of the category system

The category system to characterise the identified PA instruments was developed partly inductively using qualitative content analysis (Mayring 2015). We started with three groups of categories based on the literature: context characteristics (e.g., the school subject), test methods, and alignment with PA criteria. We examined the instruments to evaluate their suitability and added categories and codes to describe differences as emerging. In a cyclical manner, we went back and forth between already examined instruments and new instruments, if necessary, we refined categories or discarded those that were redundant or not applicable. Both authors extensively discussed categories and codes as evolving during the structured process. The development of the category system was terminated when the inspection of new instruments did not require any further substantial changes in categories. We further allowed adding codes if they had not appeared so far and if their inclusion did not impact other codes within the category (e.g., adding school subjects).

In the following, we present the resulting category system (Table 2) and describe steps of its development. Regarding context characteristics of the assessment instruments, we started with the categories to capture the underlying theoretical framework, the targeted competence, the referenced school subject, and the purpose of the PA (e.g., certification or research). Comparing the instrument’s underlying theoretical frameworks proved unfeasible, as they partially remained non-transparent and partially were incompatible, e.g., across subjects. Instead, we included the targeted competence, which is closely related to the theoretical frameworks yet revealed to be more clearly reported. Therefore, we inductively generated (nominal) codes for similar competences, even if researchers used different terminology (e.g., “lesson reflection” (Kempin et al. 2019) and “lesson analysis” were both coded as “instructional analysis”). We added a category intended level of outcomes (group or individual) as the instruments differed on this aspect.

Table 2 Category system to describe characteristics of the performance assessment instruments

For test methods, the category system distinguishes between different characteristics of the stimulus representation mode (e.g., text vignette, role play), the response type of the task (e.g., verbal response, acting out), the stimulus delivery medium and response medium (each, e.g., audio device, pen-and-paper). In addition, we identified the administration mode (individual or group), potential implementation of a time limit (e.g., strict time limit), and the degree of openness of tasks (e.g., predominantly open-ended or closed formats). We further extracted the work sample implementation used in the measurement process (Do participants have to complete one or more tasks with one/multiple measurement(s)?).

The third group of categories intends to capture the extent to which the instruments match the criteria for PA. Since the target performance level (Miller 1990) was part of the in-/exclusion criteria, it was not remapped in the category system. In addition, we intended to rate to which degree the instruments represent authentic teaching demands (Bartels et al. 2019; Shavelson 2013). In detail, we delineated three categories relevant to authenticity in the literature: 1. Proximity to real-life situations, 2. Degree of interaction, and 3. Attained feelings of involvement (Bartels et al. 2019; Codreanu et al. 2020; Kron et al. 2021; Shavelson 2013). However, we observed empirical information about the attained feeling of involvement to be absent or only reported unsystematically so that this category could not be applied.

The final category system covers 14 categories (5 regarding context, 7 test methods, 2 PA criteria) with 71 codes (Tables 2 and 3). Whereas for some categories, the codes were distinct codes (single select), we allowed multiple code assignments for other categories.

Table 3 Distribution of code frequencies regarding context characteristics (RQ1), characteristics of test methods, and performance assessment criteria (RQ2)

4.3 Coding and data analysis

Coding was done by the authors with the support of three trained student assistants through MaxQDA by applying the category system per PA instrument. Therefore, the text passages that relate to the implementation of the PA instrument were analysed, and each code was rated dichotomously (agree/disagree). To ensure reproducible coding, text passages indicative for the coders’ judgements were marked. Two coders examined all documents. After coding the first twelve instruments, discrepancies were discussed to make sure the coders had the same understanding of the category system. The interrater reliabilities per PA instrument on the level of the 14 categories were substantial to almost perfect (Cohen’s Kappa M = 0.91, Min = 0.61, Landis and Koch 1977). Discrepancies, which were found to be caused mainly by missing data (i.e., one coder may have missed a piece of information), were adjusted in the master coding.

The resulting data set (unit of analysis: assessment instrument, see Table 1 and Table A in the supplementary materials) was used descriptively to answer RQ1 and RQ2. To identify types of assessment instruments (RQ3), we used triangulation and conducted a qualitative type-building content analysis (Kuckartz 2014) complemented by a statistical explorative hierarchical cluster analysis. Both methods identify types/clusters based on similarities in the data set. A qualitative approach (type-building content analysis) was first chosen to explore the data. An explorative quantitative approach (cluster analysis) was conducted afterwards to check whether a different approach would yield the same types.

Two independent raters conducted qualitative type-building content analysis. According to Kuckartz (2014), we aimed at the formation of homogeneous types (“monothetic types”), i.e., groups of instruments that vary only regarding a few categories and codes. We reduced the category system by context characteristics not relevant for identifying types of PA (school subject, purpose, levels of outcomes). In addition, we observed certain dependencies between the codes of the categories about the methods. For instance, the variation in the dataset between the categories stimulus delivery medium, response medium, and stimulus representation mode could be displayed by two dichotomous categories use of digital devices and use of video(s) to represent teaching situations without loss of information.

Finally, we screened categories with high variance for possible simplifications. Regarding the targeted competence, we decided to restructure the codes following the differentiation by Lindmeier (2011). Hence, we used a new category type of demands to distinguish between instruments targeting in-instructional demands (AC), instruments targeting pre- and post-instructional demands (RC), and instruments targeting mixed demands (AC + RC). For example, instructional analysis competences (Kramer et al. 2020) require the participants to analyse a lesson on the fly from the perspective of a second-row observer, which can be related to in- as well as post-instructional demands. Alike, diagnostic competence is relevant in- and before or after class in different manners. As the previous nominal coding could not simply be mapped to the new ordinal one, the first author revisited the instruments to assign the type of demands accordingly.

After reducing and simplifying the variables, we started (partly exploratively) clustering the instruments. We first selected the variables which, based on considerations related to test-score interpretation (validity argument), appear to be especially important for the classification of PAs (e.g., the variable that informs about the type of demands to be mastered is considered to be more relevant than one that holds information about organizational differences, like the administration mode). Clustering the instruments according to type of demands, response type, and time pressure turned out to be effective for grouping. The resulting groups were further investigated by adding and removing variables until the groups proved to be stable. Dichotomously coded variables were prioritised to allow more precise group separations. Since the instrument ELMaWi RC could not be definitively assigned due to its multifaceted nature, it was removed as an outlier for further analyses. Two raters conducted the qualitative type-building content analysis independently, which led to the same cluster variables and generated identical results.

Second, we validated our results by performing an exploratory cluster analysis. Since we did not aim for a predefined number of clusters, we applied a hierarchical (instead of partitioning) agglomerative cluster analysis. Since we had no metric distance measures, the Ward method could not be used, so we applied the outlier-resistant complete-linkage method with Euclidean distance measures in SPSS (Backhaus et al. 2021).

As this data-based clustering approach requires ordinal variables, we rescaled the nominal variable response type according to their proximity to real-life teacher actions. As many instruments were found to use more than one response type, six codes of increasing proximity to teacher action were needed (MC, MC & written response, written response, written & verbal response, verbal response, acting out). We further reduced the number of resulting variables by eliminating those that covaried substantially with others (Spearman’s ρ > 0.5, Cohen 1988). For example, the use of digital devices was strongly associated with the “group”-administration mode, so retaining both variables does not inform about differences between the instruments beyond retaining one. We applied this process until we reached a minimal set of variables representing the variability within our data. Six variables proved sufficient following this procedure (work sample implementation, response type, use of video(s), time limit, degree of interaction, and proximity to real-life situations). However, we decided from a content perspective to re-include two variables as they were seen as important for describing the clusters (type of demands, use of digital device(s)). As we observed no difference between the clusters resulting from the analyses based on six or eight cluster variables, we report the resulting types based on eight variables.

5 Results

The developed category system informs about differences between PA instruments in teacher education research in the DACH region. Table 3 provides an overview of code frequencies summarised across the identified instruments. The coding for each instrument may be found in Table A in the supplementary material.

5.1 RQ1: What are the context characteristics of the instruments?

Regarding context characteristics, the following categories could be coded: school subject, targeted competence, type of demands, and purpose.

The 20 identified PA instruments spread across 6 school subjects. In line with prior findings, research related to STEM subjects dominates the field of teacher PA. Many school subjects are not represented at all.

The targeted competences fall into different classes related to different teacher demands. Diagnostic and lesson analysis competence are focal. Action-related and explaining competences, as well as competences for lesson planning, are likewise frequently represented. Considering the results against the Lindmeier model (type of demands), 9 refer to in-instructional demands (AC), 7 to pre- and post-instructional demands (RC), and 4 instruments contain aspects from both.

Considering the purposes, all instruments are used for research purposes. Half are also used for instructional purposes (e.g., as learning opportunities for prospective teachers). However, different research purposes could be delineated: in 8 cases, the research is related to instrument development, while advanced purposes are reported in 12 cases (e.g., examining competence constructs). Two instruments are used to assess students’ competence. However, no instrument is used to grade or certify.

5.2 RQ2: Differences in terms of test methods and alignment with performance assessment criteria

As expected, the instruments show considerable differences in respect to test methods. Regarding the stimulus representation mode, many instruments are based on teaching vignettes, either in videos, text vignettes and/or role plays. Only one instrument could not be assigned to the category system because instead of presenting a stimulus, it uses the (standardised) context of a teaching practicum (König et al. 2020). Across instruments, participants must mainly submit their responses in an open (written or verbal) form. Occasionally, closed answer formats (e.g., rankings) were found. Accordingly, the degree of openness of tasks varies, sometimes even within an instrument, but open formats are dominant. A similar situation applies to time limits, which are predominantly implemented (except for 4 instruments), albeit in different forms (time for considerations vs strict time limit). The medium used to deliver the stimulus (stimulus medium) is typically also used as response medium. Hence, 15 instruments used the computer as the main medium. Regarding the administration mode, most instruments are administered in group settings. Regarding the work sample implementation, all but 4 instruments ask for more than one work sample.

The degree of proximity to real-life situations and the degree of interaction could be coded as indications of alignment of the instruments with PA criteria. In 16 cases, tasks are situated by providing a specific teaching situation (e.g., videos, handwritten student solutions). One instrument uses a situation that the participant experience in real life (original lesson plan) (König et al. 2020). Two instruments use only loose references to teaching (Lachner and Nückles 2016; Schröder et al. 2020), and one does not use any situating (Lachner et al. 2019). Most instruments use no dynamics.

5.3 RQ3: Types of performance assessment instruments

The codes developed to answer RQ1 and RQ2 inform about certain similarities between the instruments (e.g., similar approaches regarding methods in ELMaWi-AC math, ELMaWi-AC wiwi, DaZKom video; regarding types of demands in DiKoBi, Profile-P+ (reflect), TEDS-Validierung). However, they also show considerable differences (e.g., three different assessment strategies for explanation competence in Profile-P-DET, ESCOP, Profile-P+ (explain)).

To uncover patterns between the instruments, we identified types of PAs. The qualitative type-building content analysis and the statistical exploratory cluster analysis both converged into three clusters (Fig. 3), which we describe below.

Fig. 3
figure 3

Dendrogram of the cluster analysis with optimal solution of 3 clusters

Type A (action): instruments of this type are characterised by formats with high action demands. They usually require participants to answer in a fast and immediate manner. The proximity to real-life situations is high: specific approximations of practice are presented by videos, or role plays in the assessment. Response types are typically open-ended and require verbal answers or even acting out. Interactive settings are common but do not apply to all instruments. The instruments aim to capture the targeted competence close to teaching performance, with the underlying mindset of “Show me what you do!”.

Type B (analysis): instruments of this type are characterised by focusing on analysis or diagnosis conducted by the participants. Typically, video or text vignettes with teaching situations are delivered in a computer-based assessment environment. These situations clearly reference to real-life situations but can take a generalised, abstract perspective. The assessment often asks the participants to take the perspective of a second-row observer. For example, participants must evaluate instructional situations or teaching materials and/or produce (multiple) possible teaching actions. Response types and the degree of openness differ between instruments: participants have to provide written responses, select the best option, or rate given statements. The instruments provide answers to the question “What could be done?” to (indirectly) obtain indications of the target competence.

Type C (product): instruments of this type are distinct from the others, as they focus on a certain product created by the participants in a written form, in two of three cases, on a plan for a specific lesson. The third instrument asks for a written explanation for a specific (mathematical) task. The researchers extract different performance indicators through the analysis of the product. Hence, measures are nested within one work sample. All three instruments reference real-life situations, with two providing a hypothetical situation. The last instrument uses a lesson plan that participants had to create during a teaching practicum. In a nutshell, these instruments have in common that they take a specific (given or naturally occurring) teaching problem as a starting point and ask: “What will you do?”, hence they aim at foreshadowing teaching performance.

6 Discussion

This study provides an overview of the current state of PA instruments across different subjects in teacher education research in the DACH region. The aim was to systematically identify and record differences and similarities of approaches in the still fragmented research field, which so far lacks a common understanding and varies regarding reporting practices. We first identified and synthesised similarities and differences between underlying theoretical approaches and suggested criteria to delineate PA from other assessment approaches. We argued that PA pertains to standardised, objective measures of observable behaviour regarding teaching demands. Authenticity was found to be a central umbrella term closely related to questions of the validity of assessment tasks.

Based on our definition, a scoping literature review was conducted. The literature search yielded 20 relevant instruments whose differences could be described by a category system related to context characteristics, methods, and alignment to assessment performance criteria. In terms of context characteristics (RQ1), we observed a focus on STEM subjects confirming earlier observations. Two instruments (ELMaWi instruments) were aligned in two subjects (mathematics and economics), indicating a certain transferability of the approach across subjects. For many school subjects, we could yet not find any instrument.

Comparing the instruments against their underlying theoretical frameworks was not possible. Despite teacher education is seen as the pioneering field in PA, the target competences were diverse and described with varying terminology. Mapping against the types of demands as pre-/post or in-instructional demands (see AC/RC, Lindmeier model) helped to structure the assessment targets. So far, all instruments have been used for research purposes, although some were also used for instructional purposes in university teacher education. Other purposes (e.g., certification) have not been reported in the DACH region, which is unsurprising, given that teacher certification has traditionally been a state affair. However, as publications may primarily report on research purposes, other purposes (e.g., instructional purposes, grading purposes) may be more varied than the results suggest.

Looking at the methods (RQ2), we found widespread use of vignettes representing teaching situations that are digitally delivered and combined with open-ended response formats. However, the instruments showed considerable differences again, so a sophisticated category system regarding methods had to be developed. In turn, it shows that developers of instruments have to deal with many degrees of freedom, and summarizing research findings in the future may be challenging.

Finally, the fine-grained analysis of characteristics enabled us to map the similarities and differences and thus uncover patterns. Clustering led to three types of PA instruments in current DACH teacher education research. Type A (action) instruments focus on in-instructional demands with close real-world approximation. In contrast, type B (analysis) instruments take a more reflective stance and test for the ability to provide (various) action possibilities. Few instruments (type C, product) ask participants to plan and hence foreshadow their actions.

Regarding the alignment of these types with the PA criteria, type A instruments show close similarity between the observed type of performance and type of interest (Kane et al. 1999). Despite this, instruments of type A differ in design (e.g., use of static methods based on video vignettes vs interactive role play, Jeschke et al. 2021; Wiesbeck et al. 2017). Typically, these assessments are very complex, for instance, concerning cognitive demands and scoring costs (Gartmeier et al. 2015; Kron et al. 2021). Davey et al. (2015) refer to the tension between the competing aspects of complexity and closeness to real-life criteria as the “elephant in the room” regarding PA, which must be addressed with increasing use (e.g., through reducing the cognitive demands and scoring costs with closed answer formats). Indeed, instruments of type B are typically of reduced complexity but nonetheless clearly related to teaching demands (see in-/exclusion criteria). The instruments show similarities to instruments to measure situation-specific skills that include decision-making, for example, under the notion of teacher noticing (Santagata et al. 2021). Hence, instruments of type B might provide interesting prototypes for further instrument development.

It should be noted that the typology reflects more than just how different groups of researchers have chosen to design instruments. In fact, for some projects, researchers developed instruments with different targets and methods (e.g., Profile-P). Similarly, the typology does not reflect our choice of grouping types of demands according to the Lindmeier model: whereas type A instruments show a strong coherence with the in-instructional demands (AC), type B and C instruments reflect different aspects related to the RC component (pre-/post-instructional demands). Notably, the ELMaWi RC instrument could not be classified due to its multifaceted design with text and video vignettes as well as written and verbal responses (Jeschke et al. 2021).

6.1 Limitations

The fact that, to date, there is no common understanding of PA in educational research made it necessary to first synthesise a variety of approaches and suggest a definition of PA for this review. Our definition hinges on the understanding of the suggested criterion of observable behaviour regarding teaching demands. Although this criterion proved intersubjectively applicable in our process, it is an open question whether it is generally suited to delineate what should be regarded as PA in teacher education research.

Teacher education research spreads across different disciplines and is partially compartmentalised. To address the danger of missing relevant work, we started with broad search parameters and applied snowballing, but we cannot guarantee that we surfaced all relevant studies.

This review focuses on the diversity of instruments in an emerging field. Therefore, we took an inclusive stance, refrained from rating the quality of the instruments, and did not exclude any instruments, even if reporting suggested a lack of quality. In some cases, we faced the problem of incomplete, inconsistent, or unreported essential information. Accordingly, applying the category system was difficult in some cases. So, despite the thorough double-coding process, we cannot exclude that other coders may partly come to different conclusions. For future applications, it will be necessary to consider whether the category system needs to be inductively adapted.

Our intensive search yielded only reports on 20 instruments within the last five years, and many still focus on instrument development. Accordingly, our findings primarily reflect the current state of research regarding the design characteristics of instruments. Conclusions regarding other essential criteria, including the psychometrical quality or practical usefulness of approaches regarding different purposes, can currently not be drawn.

6.2 Looking back and looking forward

The study aimed to systematically record the current state of PA instruments in DACH teacher education research. Compared to the findings of previous reviews, we see continuity as well as development. Regarding continuity, many of the instruments we identified are adapted or extended versions of the instruments identified by Zlatkin-Troitschanskaia et al. (2016). Still, the number of available instruments is small, and their use is limited to specific research purposes. At the same time, new developments and significant advances were made in connection with the German research program Modeling and Measuring Competencies in Higher Education (Zlatkin-Troitschanskaia et al. 2020b) funded by the Federal Ministry of Education and a Munich-based research unit funded by the German national science foundation (F. Fischer and Opitz 2022). This indicates that developing PA instruments requires prolonged, sustained research effort and highly specialised knowledge.

When we started this review, we hoped to also report on findings attained with PA instruments, and to provide a systematic literature review (e.g., evidence for the convergent or discriminant validity of approaches through contrasting instruments to assess teacher competence from a holistic and an analytical approach). When scoping the first publications, we quickly had to realise that the reported research is very diverse in every aspect. In most cases, the development process so far does not cover systematic investigations of aspects of psychometric quality which would be mandatory before the instruments could be used at a larger scale in research and teacher education. Our review hence focused on surfacing differences and similarities of existing instruments, which was also missing in the earlier reviews. Researchers new to the field of PA might benefit from the proposed definition of PA and the category system, which identifies possible design options and might inform future work on important reporting demands. The three types of instruments may be read as three major strands within the new research field and help to summarise future findings appropriately. In any case, working towards a common understanding of the emerging topic of PA will help researchers (across domains) to learn from each other and collaborate more efficiently.