Abstract
In recent years, a new branch of teacher competence research has emerged in which competence is measured close to teaching performance. Such so-called performance assessment approaches are gaining increasing attention, but the research field is still fragmented. A lack of overview and varying reporting practices interfere with its coherent development. This scoping literature review provides an overview of recent advances and the current state of performance assessment instruments in teacher education research in German-speaking countries. We examined assessment instruments that provide objective, standardised measurement procedures based on observable behaviour regarding teaching demands. Based on 20 assessment instruments, a category system with 14 categories was inductively developed, capturing their characteristics in terms of context, test methods, and alignment with criteria for performance assessment. Despite the considerable variation, three types of teacher performance assessment instruments could be identified through qualitative and exploratory statistical analyses. The results show continuity as well as development compared to previous reviews and provide suggestions on advancing the still-emerging research field. For example, they can be used to foster the coherence of the research field by providing information on typical instrument differences and similarities as well as essential reporting demands.
Zusammenfassung
In den letzten Jahren ist ein neuer Zweig der Lehrkräftekompetenzforschung entstanden, bei dem die Kompetenz nahe am tatsächlichen beruflichen Handeln von Lehrkräften gemessen wird. Solche sogenannten handlungsnahen Erhebungsmethoden (performance assessments) gewinnen zunehmend an Aufmerksamkeit, jedoch ist das Forschungsfeld immer noch fragmentiert. Diese Übersichtsarbeit gibt einen methodischen Überblick über die jüngsten Fortschritte und den aktuellen Stand der Messinstrumente zur handlungsnahen Kompetenzmessung in der Lehrkräftebildungsforschung im deutschsprachigen Raum. Untersucht wurden Testinstrumente, die objektive, standardisierte Messverfahren auf der Grundlage von beobachtbarem Verhalten bezüglich beruflicher Anforderungssituationen von Lehrkräften bieten. Ausgehend von 20 Beurteilungsinstrumenten wurde induktiv ein Kategoriensystem mit 14 Kategorien entwickelt. Dieses erfasst verschiedene Charakteristika in Bezug auf den Kontext, die Testverfahren und die Passung der Instrumente zu den Kriterien für handlungsnahe Erhebungsmethoden. Trotz beobachtbarer Unterschiede konnten sowohl durch qualitative wie auch explorative statistische Verfahren drei Typen von Instrumenten identifiziert werden. Die Ergebnisse zeigen sowohl Kontinuität als auch neue Entwicklungen im Vergleich zu früheren Reviews und geben Anregungen, wie das noch junge Forschungsfeld weiter vorangebracht werden kann. Die Ergebnisse können genutzt werden, um die Kohärenz des Forschungsfeldes zu fördern, da sie über typische Unterschiede und Gemeinsamkeiten der Instrumente sowie über wichtige Anforderungen an die Berichterstattung informieren.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Teachers have been found to be an important factor for student learning (Hattie 2003), and, unsurprisingly, teacher education research has a long tradition of modelling and measuring what constitutes “teacher competence”. For many years, research focussed on professional knowledge and various motivational-affective factors as the core of teacher competence following Shulman (1986). Accordingly, knowledge tests and self-reports were the main kinds of assessment instruments in use (Frey 2006). Ever since these instruments have been used, there has been criticism that these assessment approaches may not be appropriate for certain purposes of teacher education research. Specifically, it was argued that typical knowledge tests might be limited in testing the ability to apply knowledge to real-life problems (Shavelson 2010). For example, to turn students’ misconceptions into learning opportunities in the classroom, it is not enough if teachers know which typical misconceptions might exist for a topic. Instead, based on their professional knowledge, they must use students’ learning processes to diagnose their actual understanding and flexibly apply appropriate strategies to overcome these misconceptions. However, knowledge tests are often not informative in this respect.
To address this challenge, a growing number of research endeavours have followed a complementary approach in the last years (von Aufschnaiter and Blömeke 2010; Kaiser and König 2019; Kunter et al. 2011; Lindmeier 2011). New perspectives on teacher competence as complex ability constructs led to the development of new, “situated” assessment formats that aim at testing competence close(r) to teaching performance (Kaiser et al. 2017). At the same time, ‘performance assessment’, in general, gained increasing attention in higher education in German-speaking countries (DACH) and teacher education research has been identified as leading in this field (Zlatkin-Troitschanskaia et al. 2016). Several recent national funding initiatives with a high impact on teacher education research contributed to these developmentsFootnote 1.
These new kinds of situated, close-to-performance, or performance assessment instruments, were developed in parallel with different approaches. However, the varying understandings and reporting practices challenge building on existing research and interfere with the coherent development of the teacher education research field, which is traditionally fragmented (e.g., according to school subjects) in the DACH region. A cross-subject overview is missing. This scoping literature review addresses the gap and aims to provide an overview of the current state of research regarding what can be subsumed under so-called ‘performance assessment’ instruments in DACH teacher education research.
2 Theoretical background
2.1 Frameworks for modelling and assessing teacher competence
How to assess teacher competence is a long-standing question in educational research. For example, questions regarding teaching effectiveness, teacher education outcomes, or professional development design hinge on the understanding of how teacher competence is conceptually modelled and which instruments are used for assessing it.
Many conceptualisations of teacher competence in the DACH region are influenced by an understanding of competence as a “complex ability construct” that includes cognitive abilities and skills to solve specific problems that are associated with motivational, volitional, and social dispositions and which can be learned (e.g., Koeppen et al. 2008). Competence is hence always defined in relation to specific real-life problems, here to problems of teaching. Based on this, different models of teacher competence have been derived.
A prominent example is the COACTIV model (Fig. 1, left), which analytically delineates factors contributing to professional competence, like professional knowledge, motivational orientations, or self-regulation skills (Baumert and Kunter 2011). The teachers’ professional knowledge constitutes the core of teacher competence in this and similar models, so Kaiser et al. (2017) speak of a cognitive strand of teacher competence frameworks. Assessment instruments typically cover tests or self-reports, e.g., for teacher knowledge or motivational orientations. Due to the modelling strategy of analytically delineating factors of competence (dispositions), they may also be called analytical approaches.
Analytical and holistic approaches to model teacher competence and their relation to the competence-as-continuum model (Blömeke et al. 2015). The illustration contains the COACTIV model (Kunter et al. 2011) on the left as an example of an analytical approach and the model according to Lindmeier (2011) on the right as an example of a holistic approach (RC in dashed lines and AC requirements in continuous lines)
Studies using such analytical approaches led to many important and influential findings, e.g., regarding the structure of teacher knowledge, the relevance of teacher knowledge for student learning (Kunter et al. 2011), or the international variability of outcomes of teacher education (Blömeke et al. 2009). To date, they constitute the majority of research in the area of teachers’ professional competence (see, e.g., König et al. 2022; Voss et al. 2015). Nevertheless, it has been questioned whether these approaches are sufficiently powerful to answer certain questions related to teacher competence, particularly regarding teaching performance. This has led to calls for alternative approaches.
As a result, several researchers suggested an understanding of teachers’ professional competence closer to professional action and the ability to master different situational demands (Kaiser et al. 2015; Shavelson 2010; Zlatkin-Troitschanskaia et al. 2016). Instead of analytically decomposing teacher competence into contributing factors, they consider the professional demands that need to be mastered as unbreakable entities and seek to model competence accordingly. For example, teachers have to lead discussions in the classroom, prepare lessons, explain concepts, or support students emotionally. Accordingly, these frameworks seek to specify areas of competence by distinguishing and describing different demands of teaching (Kaiser et al. 2015). Depending on the strategy dealing with the complexity of modelling teacher competence by identifying different demands but not decomposing it into factors, we will call this second strand holistic approaches in contrast to the analytical approaches (Fig. 1, right).
An example of the holistic approach is the structure model of subject-specific teacher competence, according to (Lindmeier 2011). It conceptualises two different components of competence, so-called Reflective Competence (RC), pertaining to pre- and post-instructional demands and Action-related Competence (AC), pertaining to in-instructional demands, on the base of cognitive dual-processing theories (Evans 2008). Teachers are considered to hold AC if they master professional demands like diagnosing student understanding in instruction on the fly. Similarly, if they master the professional demands of preparing for or reflecting on instruction, they are considered to hold RC (e.g., Jeschke et al. 2021).
Holistic approaches have only recently been proposed more frequently, so there is no common framework yet. However, the different proposals have one thing in common: they assess competence close to teaching performance. They require, for instance, participants to diagnose a student’s understanding in a simulated situation (Kron et al. 2021) or directly provide an explanation to a specific (videotaped) student question under time pressure (Hecker et al. 2020b; Jeschke et al. 2021). The researchers may use demands with reduced complexity compared to real-life teaching, but still, the tasks used for assessment may be understood as “approximations of practice” (Grossman et al. 2009). These so-called performance assessments (PA) are typical of the holistic approach and have often been contrasted against “decontextualised” knowledge tests of the analytical approach (Kaiser et al. 2017; König et al. 2022).
A framework synthesizing the alternative perspectives considers teachers’ professional competence as a continuum (Blömeke et al. 2015). Professional competence is understood to rely on specific professional dispositions (e.g., professional knowledge, motivational orientations) as proposed by the analytical approaches. However, the observable real-life behaviour of a teacher (teaching performance) is influenced by the conditions of the respective situation and needs the integration and application of dispositions according to holistic approaches. Situation-specific skills, like noticing skills, were proposed as mediating between dispositions and performance (Stahnke and Blömeke 2021). Figure 1 summarises how the two contrasted modelling and assessment approaches might be related to the competence-as-a-continuum model and highlights their complementary nature: Professional action needs professional dispositions (e.g., professional knowledge), but it may not be sufficient to understand the latter as the effect of the former. It should be emphasised that two ways of modelling—and consequently assessing—teacher competence should be seen as complementary approaches with certain strengths and shortfalls, like the practical and theoretical tests used for obtaining a driving licence (analogy by Zlatkin-Troitschanskaia et al. 2015).
To sum up, despite the lion’s share of high-impact teacher research being done with knowledge tests and self-reports in the last decades, a significant branch of research using assessment methods close to performance developed. While we have a long history of research and experience in the first, the recent PA approaches have developed independently and lack a common perspective so far.
2.2 Performance assessment and other assessment methods close to teaching performance
Standardised assessment methods that strive to approximate real-life demands of a profession are generally seen as having several advantages: first, like knowledge tests, they are objective measures and thereby superior to subjective measures like self-reports. Second, unlike knowledge tests, they are less prone to test for inertFootnote 2 knowledge because the test situations are close(r) to performance (McClelland 1973). Third, they seem to be especially suited for assessing the ability to master complex professional demands, which depend particularly on integrating various areas of knowledge and skills (Shavelson 2013). In the US, e.g., teacher PA is used (besides research) for summative assessment for certification and placement purposes (Pecheone and Chung 2006; e.g., PACT).
When developing assessments close to performance, educational researchers may draw on the experience of adjacent research fields, like medical education, where such methods have a long tradition (Miller 1990). For instance, tests based on role plays involving trained actors as standardised patients are regularly used to assess the performance of medical students. Systematising the various assessment approaches in medical education, Miller (1990) delineated four hierarchical levels with varying degrees of proximity to the professional action: instruments may target assessing knowledge (knows), competence (knows how), performance (shows how), or action (does). Whereas action refers to the practitioner’s way of dealing with a professional demand, performance refers to its approximation in an assessment situation, e.g., with standardised patients. This framework’s competence and knowledge levels pertain to increasingly decontextualised assessment methods.
It must be stressed that Miller’s terminology partially conflicts with the current one in teacher education research (e.g., see Blömeke et al. (2015), where real-world performance can be compared to action level). However, his distinction aligns with the understanding of the term “PA” as used in further contexts. In psychology, “the defining characteristic of a performance assessment is the close similarity between the type of performance that is actually observed and the type of performance that is of interest.” (Kane et al. 1999). With similar intentions, Shavelson (2013) postulates that assessment in education should “produce observable performance” on tasks with “a high fidelity” regarding the “real world (criterion)” situations to which inferences of competences are to be drawn (p. 75).
However, looking at research practices, a wide variety of different interpretations can be observed. A review by Palm (2008) shows that a common understanding of the defining characteristics of PA is missing, and various definitions (and measurement approaches with different names but similar intentions) are used in parallel. For instance, some researchers tie their understanding of PA to a certain type of response (e.g., constructed responses) or simulations that allow the observation of solution processes (Palm 2008).
Up to now, the perspectives also vary broadly in the specific field of teacher education research. Therefore, when preparing this study, we were challenged first to answer the question: what is the current conceptual understanding of PA in teacher education in the DACH region, and which criteria might be appropriate? Bartels et al. (2019), for example, delineate three necessary criteria: PA instruments should have professional relevance regarding the content (e.g., choice of subject matter), be authentic in the sense that participants need to show professional action, and be interactive. With the latter, they refer to the use of dynamics or adaptations, highlighting an authenticity aspect. Therefore, they consider assessment approaches that fail the interactivity criterion to be only performance-oriented.
Regarding the assessment of AC, Lindmeier (2011) discusses—without referring to the term PA—the necessity to mirror as closely as possible the complex, immediate, uncertain, and interactive nature of teaching. However, the standardisation of tests competes with their interactive design. Considering this, AC assessment instruments use video vignettes that require participants to respond verbally under time pressure and so trade off interactivity in favour of standardisation (Jeschke et al. 2021; Knievel et al. 2015; Lindmeier 2011). These examples show that interactivity may not be mandatory for assessments close to the performance.
Such needs to balance potentially competing factors are evident at many points in the context of PA. It is essential to consider which aspects should be used as defining characteristics. For example, Codreanu et al. (2020) highlight that in developing simulation-based approximations of practice, it is necessary to balance the complexity of the cognitive demands with the authenticity of the simulation in terms of the referenced real-life demands. Particularly, they argue that it is important not only to expose participants to authentic situations but to involve them cognitively and motivationally in the situation. Again, however, it may not be necessary to see this aspect of authenticity (e.g., as high achieved involvement) as a defining aspect of PA. Rather, motivational and affective involvement may be understood as contributing to validity in terms of cognitive processes during the assessment that may have to be balanced against other aspects of validity.
Integrating across the given perspectives, we suggest the following definition for PA in this review: we understand PA as referring to standardised methods to assess teacher competence (as complex ability constructs) objectively, close to performance (“shows how” according to Miller 1990) with a focus on observable behaviour (performance) regarding teaching demands. We include standardisation as an essential test quality criterion, especially regarding test requirements and scoring criteria. Alike, we include the criterion of objective measurement, mainly referring to objective scoring procedures. The criterion of observable behaviour regarding teaching demands relies on the specificity of the holistic approach of modelling teacher competence. It is in line with the psychological understanding of PA. Following the suggestion by Blömeke et al. (2015), we distinguish situation-specific skills, such as teacher noticing, from performance and understand situation-specific skills as a prerequisite of performance. However, in contrast to Blömeke et al. (2015) but in line with Shavelson (2013) and Miller (1990), we acknowledge that performance may refer to part-demands with reduced complexity and may be observed in settings other than real-life teaching situations.
The criterion observable behaviour regarding teaching demands may often be associated with direct measures and open-ended (constructed) responses, which are both related to questions of validity. It should be kept in mind, however, that for certain target competences or in the case of well-advanced research, valid tests based on closed (selected) responses or indirect measurements (e.g., advocatory approach) may also be thought of (Oser et al. 2010). Our definition may be conceived as leading to a comparably broad category with considerable room for interpreting what counts as “teaching demands”. Remarkably, we also decided to omit any explicit criterion related to authenticity. In line with Shavelson (2013), we see the core aspect within the holistic approaches, the sampling of the test tasks from the real-life reference, is sufficiently mirrored by referencing teaching demands.
According to our understanding, PA instruments may consequently use various methods. The few portrayed examples of PA instruments show already considerable variation regarding the presentation of tasks (e.g., video, role play), the type of response (e.g., verbal response, enactment, multiple choice (MC)), the degree of interactivity, or the type and grain size of targeted competences. However, instruments without a clear focus on teaching demands as a real-life criterion (e.g., tests of decontextualised knowledge without situation references, assessment of distinct situation-specific skills), subjective (e.g., self-reports), or unstandardised measurements (e.g., scoring by untrained persons, Schütze et al. 2017) are not considered as PA under this perspective.
2.3 Prior reviews with relevance to this review
There are a few prior reviews relevant to this review. Frey (2006) provides an overview of methods and instruments assessing teachers’ professional competence from the DACH region from 1991 through 2005. Although PA instruments had already been reported, only 6% of the 47 identified instruments were classified as such. However, these did not target teacher-specific professional competences but ones with broader professional relevance, like managerial competence (Etzel and Küppers 2000).
In 2016, a review with an international perspective on assessment in higher education was issued as an update and extension of a white paper from 2010 (Zlatkin-Troitschanskaia and Kuhn 2010; Zlatkin-Troitschanskaia et al. 2016). The main objective of this review was to provide a comprehensive overview (range 2010 to 2014) of objective, reliable and valid instruments to assess higher education learning outcomes with a focus on large-scale assessments in a holistic understanding (Zlatkin-Troitschanskaia et al. 2016, p. 3). Despite varying reporting practices presenting a challenge throughout the field of PA, the review shows that assessment approaches vary considerably and include testing for knowledge as well as self-assessment. So only some of the instruments identified would pass the criterion of observable behaviour regarding teaching demands we suggested. However, the authors found particularly innovative developments in teacher education research, both nationally and internationally, and surfaced 18 instruments in our focus region (Zlatkin-Troitschanskaia et al. 2016, p. 56 ff., list not intended to be exhaustive) mainly with a focus on STEM subjects. Teacher education research is recognised as a pioneer in the field of PA in this overview, but a detailed examination of the instruments is beyond its scope.
To our knowledge, the most recent review with certain relevance is from 2020 (Zlatkin-Troitschanskaia et al. 2020a). It focuses on the state of research regarding the assessment of teaching competences (and teaching quality) in higher education between 2012 and 2018. Again, teacher education was identified as a pioneering field but not the focus of this review.
One may also consult reviews on teacher competence constructs located at other points of the competence-as-a-continuum model. One example is the review on pedagogical knowledge by Voss et al. (2015) (analytical approach, PK as a disposition). Other examples are the reviews focusing on teacher noticing or situation-specific skills of (mathematics) teachers (König et al. 2022; Santagata et al. 2021; Stahnke and Blömeke 2021). Each show a broad variety of assessment approaches, including some sharing certain characteristics with PAs. This indicates that boundaries between constructs and related assessment approaches may not always be drawn sharply. What is still missing is a systematic perspective on PA in teacher education research that details its current state of research.
3 Research questions
PA is still an emerging topic in DACH teacher education research, as in the last years, new instruments have been developed by different research groups in different contexts. However, it is unclear whether there is a common understanding of PA in this still fragmented field, which is also impacted by varying reporting practices.
Therefore, this contribution aims to provide an overview of recent advances and the current state of PA instruments in teacher education research in the DACH region by conducting a scoping literature review and investigating the following research questions:
RQ1
What are the context characteristics (e.g., school subjects, intended purposes, theoretical frameworks) of the PA instruments?
RQ2
How do PA instruments differ in terms of methods and alignment with PA criteria?
To uncover potentially emerging strands within the new research field, it may be of interest whether the designs of the instruments follow certain patterns, so we ask:
RQ3
What types of PA can be distinguished?
In detail, we aim to describe the PA instruments’ variation across different subjects concerning contexts (RQ1), methods, and their ways of reflecting the criteria established for PA (RQ2). We aim to identify and coherently describe characteristics by a category system. We expect that this will allow us to display the diversity of instruments, potential similarities and differences systematically. Based on our systematisation of instrument characteristics, we expect that different types can be distinguished (RQ3).
4 Methods
In autumn/winter 2021, we conducted a systematic literature search to identify the relevant PAs. These were subsequently systematically analysed for differences. Based on the existing literature, analytical categories were refined or complemented inductively and compiled into a system which was applied to all identified instruments to answer RQ1 and RQ2. Finally, we used clustering methods to uncover different types of instruments (RQ3).
4.1 Literature search and processes of instrument identification
The literature search and instrument identification followed several steps (Petticrew and Roberts 2006; Fig. 2). First, we searched for relevant publications in the education-specific literature databases ERIC and peDOCS, as well as the general scientific literature database Web of Science (with restriction to the categories Education and Educational Research). Since there are relevant reviews, although partly with a broader perspective, up to 2016, which found that PA was rarely used in the target region (e.g. Zlatkin-Troitschanskaia et al. 2016), we only consider the period 2016 to 2021. According to our focus on the DACH region, the following English and German keywords were used in any possible combination:
-
English: performance test, performance assessment, measurement, teacher education, performance, teacher, teaching, teaching skills, performance-based testing
-
German: Performanz, Test, Kompetenzmessung, Lehrkräfteausbildung, Lehrerausbildung, Lehrkräfte, Lehrer, Simulation
Publications in German, as well as English, were included. Since authors may use other synonyms that do not match our search criteria, it is difficult to identify all relevant measurement instruments with pre-set keywords. Thus, we additionally used the snowball principle during the next steps of literature processing and identified further publications by investigating their references. Newly identified publications were added during the process until no more additional publications surfaced. In total, 242 publications were considered.
In the second step, these publications were examined on the base of the title, abstract, and keywords. These were analysed in terms of content by the first author. If they proved to be not relevant to our focus (e.g., not from DACH teacher education research; based on self-assessments), they were excluded from further analysis. This resulted in a corpus of 42 publications.
In the third step, we scoped the publications. Starting with the methods section and including, if necessary, other sections, we conducted a content analysis and manually extracted relevant information (Mayring 2015).
According to our research questions, we used the PA instruments as units of analysis. As there are multiple publications on most of the instruments and as some report on several instruments, we determined the unique assessment instruments covered by the identified publications. Accordingly, we restructured the information extracted from the publications per instrument and developed descriptions of every instrument. In cases where the project homepages provided relevant information, we also included it. In a few cases, we contacted the corresponding authors to get further information that was not published (yet) but relevant to the coding process (e.g., time limits, see category system).
To exclude instruments, which are not within the scope of PA as defined above, we used three criteria: only instruments with objective measurement methods (e.g., no self-evaluations or evaluation by untrained raters), standardised methods (e.g., no observation of naturally occurring instruction), and observable behaviour regarding teaching demands (e.g., no decontextualised knowledge tests) were shortlisted. In this step, we examined the assessment tasks and whether they were profession-specific for teachers. For example, assessment instruments may aim to capture subject-independent competences that teachers need in their daily work (general pedagogical competences), like classroom management skills. They may otherwise capture profession-specific competences related to the subjects taught, like diagnosing a student’s mathematical error. However, we also found instruments with criterion demands that must be considered not specifically relevant for teaching (e.g., related to working within the subjects). For example, we excluded an instrument for experimental competence, which was reportedly not teaching-specific (see, e.g., Bruckermann et al. 2017). In total, the procedure led to a final corpus of 20 different PA instruments (Table 1).
4.2 Development of the category system
The category system to characterise the identified PA instruments was developed partly inductively using qualitative content analysis (Mayring 2015). We started with three groups of categories based on the literature: context characteristics (e.g., the school subject), test methods, and alignment with PA criteria. We examined the instruments to evaluate their suitability and added categories and codes to describe differences as emerging. In a cyclical manner, we went back and forth between already examined instruments and new instruments, if necessary, we refined categories or discarded those that were redundant or not applicable. Both authors extensively discussed categories and codes as evolving during the structured process. The development of the category system was terminated when the inspection of new instruments did not require any further substantial changes in categories. We further allowed adding codes if they had not appeared so far and if their inclusion did not impact other codes within the category (e.g., adding school subjects).
In the following, we present the resulting category system (Table 2) and describe steps of its development. Regarding context characteristics of the assessment instruments, we started with the categories to capture the underlying theoretical framework, the targeted competence, the referenced school subject, and the purpose of the PA (e.g., certification or research). Comparing the instrument’s underlying theoretical frameworks proved unfeasible, as they partially remained non-transparent and partially were incompatible, e.g., across subjects. Instead, we included the targeted competence, which is closely related to the theoretical frameworks yet revealed to be more clearly reported. Therefore, we inductively generated (nominal) codes for similar competences, even if researchers used different terminology (e.g., “lesson reflection” (Kempin et al. 2019) and “lesson analysis” were both coded as “instructional analysis”). We added a category intended level of outcomes (group or individual) as the instruments differed on this aspect.
For test methods, the category system distinguishes between different characteristics of the stimulus representation mode (e.g., text vignette, role play), the response type of the task (e.g., verbal response, acting out), the stimulus delivery medium and response medium (each, e.g., audio device, pen-and-paper). In addition, we identified the administration mode (individual or group), potential implementation of a time limit (e.g., strict time limit), and the degree of openness of tasks (e.g., predominantly open-ended or closed formats). We further extracted the work sample implementation used in the measurement process (Do participants have to complete one or more tasks with one/multiple measurement(s)?).
The third group of categories intends to capture the extent to which the instruments match the criteria for PA. Since the target performance level (Miller 1990) was part of the in-/exclusion criteria, it was not remapped in the category system. In addition, we intended to rate to which degree the instruments represent authentic teaching demands (Bartels et al. 2019; Shavelson 2013). In detail, we delineated three categories relevant to authenticity in the literature: 1. Proximity to real-life situations, 2. Degree of interaction, and 3. Attained feelings of involvement (Bartels et al. 2019; Codreanu et al. 2020; Kron et al. 2021; Shavelson 2013). However, we observed empirical information about the attained feeling of involvement to be absent or only reported unsystematically so that this category could not be applied.
The final category system covers 14 categories (5 regarding context, 7 test methods, 2 PA criteria) with 71 codes (Tables 2 and 3). Whereas for some categories, the codes were distinct codes (single select), we allowed multiple code assignments for other categories.
4.3 Coding and data analysis
Coding was done by the authors with the support of three trained student assistants through MaxQDA by applying the category system per PA instrument. Therefore, the text passages that relate to the implementation of the PA instrument were analysed, and each code was rated dichotomously (agree/disagree). To ensure reproducible coding, text passages indicative for the coders’ judgements were marked. Two coders examined all documents. After coding the first twelve instruments, discrepancies were discussed to make sure the coders had the same understanding of the category system. The interrater reliabilities per PA instrument on the level of the 14 categories were substantial to almost perfect (Cohen’s Kappa M = 0.91, Min = 0.61, Landis and Koch 1977). Discrepancies, which were found to be caused mainly by missing data (i.e., one coder may have missed a piece of information), were adjusted in the master coding.
The resulting data set (unit of analysis: assessment instrument, see Table 1 and Table A in the supplementary materials) was used descriptively to answer RQ1 and RQ2. To identify types of assessment instruments (RQ3), we used triangulation and conducted a qualitative type-building content analysis (Kuckartz 2014) complemented by a statistical explorative hierarchical cluster analysis. Both methods identify types/clusters based on similarities in the data set. A qualitative approach (type-building content analysis) was first chosen to explore the data. An explorative quantitative approach (cluster analysis) was conducted afterwards to check whether a different approach would yield the same types.
Two independent raters conducted qualitative type-building content analysis. According to Kuckartz (2014), we aimed at the formation of homogeneous types (“monothetic types”), i.e., groups of instruments that vary only regarding a few categories and codes. We reduced the category system by context characteristics not relevant for identifying types of PA (school subject, purpose, levels of outcomes). In addition, we observed certain dependencies between the codes of the categories about the methods. For instance, the variation in the dataset between the categories stimulus delivery medium, response medium, and stimulus representation mode could be displayed by two dichotomous categories use of digital devices and use of video(s) to represent teaching situations without loss of information.
Finally, we screened categories with high variance for possible simplifications. Regarding the targeted competence, we decided to restructure the codes following the differentiation by Lindmeier (2011). Hence, we used a new category type of demands to distinguish between instruments targeting in-instructional demands (AC), instruments targeting pre- and post-instructional demands (RC), and instruments targeting mixed demands (AC + RC). For example, instructional analysis competences (Kramer et al. 2020) require the participants to analyse a lesson on the fly from the perspective of a second-row observer, which can be related to in- as well as post-instructional demands. Alike, diagnostic competence is relevant in- and before or after class in different manners. As the previous nominal coding could not simply be mapped to the new ordinal one, the first author revisited the instruments to assign the type of demands accordingly.
After reducing and simplifying the variables, we started (partly exploratively) clustering the instruments. We first selected the variables which, based on considerations related to test-score interpretation (validity argument), appear to be especially important for the classification of PAs (e.g., the variable that informs about the type of demands to be mastered is considered to be more relevant than one that holds information about organizational differences, like the administration mode). Clustering the instruments according to type of demands, response type, and time pressure turned out to be effective for grouping. The resulting groups were further investigated by adding and removing variables until the groups proved to be stable. Dichotomously coded variables were prioritised to allow more precise group separations. Since the instrument ELMaWi RC could not be definitively assigned due to its multifaceted nature, it was removed as an outlier for further analyses. Two raters conducted the qualitative type-building content analysis independently, which led to the same cluster variables and generated identical results.
Second, we validated our results by performing an exploratory cluster analysis. Since we did not aim for a predefined number of clusters, we applied a hierarchical (instead of partitioning) agglomerative cluster analysis. Since we had no metric distance measures, the Ward method could not be used, so we applied the outlier-resistant complete-linkage method with Euclidean distance measures in SPSS (Backhaus et al. 2021).
As this data-based clustering approach requires ordinal variables, we rescaled the nominal variable response type according to their proximity to real-life teacher actions. As many instruments were found to use more than one response type, six codes of increasing proximity to teacher action were needed (MC, MC & written response, written response, written & verbal response, verbal response, acting out). We further reduced the number of resulting variables by eliminating those that covaried substantially with others (Spearman’s ρ > 0.5, Cohen 1988). For example, the use of digital devices was strongly associated with the “group”-administration mode, so retaining both variables does not inform about differences between the instruments beyond retaining one. We applied this process until we reached a minimal set of variables representing the variability within our data. Six variables proved sufficient following this procedure (work sample implementation, response type, use of video(s), time limit, degree of interaction, and proximity to real-life situations). However, we decided from a content perspective to re-include two variables as they were seen as important for describing the clusters (type of demands, use of digital device(s)). As we observed no difference between the clusters resulting from the analyses based on six or eight cluster variables, we report the resulting types based on eight variables.
5 Results
The developed category system informs about differences between PA instruments in teacher education research in the DACH region. Table 3 provides an overview of code frequencies summarised across the identified instruments. The coding for each instrument may be found in Table A in the supplementary material.
5.1 RQ1: What are the context characteristics of the instruments?
Regarding context characteristics, the following categories could be coded: school subject, targeted competence, type of demands, and purpose.
The 20 identified PA instruments spread across 6 school subjects. In line with prior findings, research related to STEM subjects dominates the field of teacher PA. Many school subjects are not represented at all.
The targeted competences fall into different classes related to different teacher demands. Diagnostic and lesson analysis competence are focal. Action-related and explaining competences, as well as competences for lesson planning, are likewise frequently represented. Considering the results against the Lindmeier model (type of demands), 9 refer to in-instructional demands (AC), 7 to pre- and post-instructional demands (RC), and 4 instruments contain aspects from both.
Considering the purposes, all instruments are used for research purposes. Half are also used for instructional purposes (e.g., as learning opportunities for prospective teachers). However, different research purposes could be delineated: in 8 cases, the research is related to instrument development, while advanced purposes are reported in 12 cases (e.g., examining competence constructs). Two instruments are used to assess students’ competence. However, no instrument is used to grade or certify.
5.2 RQ2: Differences in terms of test methods and alignment with performance assessment criteria
As expected, the instruments show considerable differences in respect to test methods. Regarding the stimulus representation mode, many instruments are based on teaching vignettes, either in videos, text vignettes and/or role plays. Only one instrument could not be assigned to the category system because instead of presenting a stimulus, it uses the (standardised) context of a teaching practicum (König et al. 2020). Across instruments, participants must mainly submit their responses in an open (written or verbal) form. Occasionally, closed answer formats (e.g., rankings) were found. Accordingly, the degree of openness of tasks varies, sometimes even within an instrument, but open formats are dominant. A similar situation applies to time limits, which are predominantly implemented (except for 4 instruments), albeit in different forms (time for considerations vs strict time limit). The medium used to deliver the stimulus (stimulus medium) is typically also used as response medium. Hence, 15 instruments used the computer as the main medium. Regarding the administration mode, most instruments are administered in group settings. Regarding the work sample implementation, all but 4 instruments ask for more than one work sample.
The degree of proximity to real-life situations and the degree of interaction could be coded as indications of alignment of the instruments with PA criteria. In 16 cases, tasks are situated by providing a specific teaching situation (e.g., videos, handwritten student solutions). One instrument uses a situation that the participant experience in real life (original lesson plan) (König et al. 2020). Two instruments use only loose references to teaching (Lachner and Nückles 2016; Schröder et al. 2020), and one does not use any situating (Lachner et al. 2019). Most instruments use no dynamics.
5.3 RQ3: Types of performance assessment instruments
The codes developed to answer RQ1 and RQ2 inform about certain similarities between the instruments (e.g., similar approaches regarding methods in ELMaWi-AC math, ELMaWi-AC wiwi, DaZKom video; regarding types of demands in DiKoBi, Profile-P+ (reflect), TEDS-Validierung). However, they also show considerable differences (e.g., three different assessment strategies for explanation competence in Profile-P-DET, ESCOP, Profile-P+ (explain)).
To uncover patterns between the instruments, we identified types of PAs. The qualitative type-building content analysis and the statistical exploratory cluster analysis both converged into three clusters (Fig. 3), which we describe below.
Type A (action): instruments of this type are characterised by formats with high action demands. They usually require participants to answer in a fast and immediate manner. The proximity to real-life situations is high: specific approximations of practice are presented by videos, or role plays in the assessment. Response types are typically open-ended and require verbal answers or even acting out. Interactive settings are common but do not apply to all instruments. The instruments aim to capture the targeted competence close to teaching performance, with the underlying mindset of “Show me what you do!”.
Type B (analysis): instruments of this type are characterised by focusing on analysis or diagnosis conducted by the participants. Typically, video or text vignettes with teaching situations are delivered in a computer-based assessment environment. These situations clearly reference to real-life situations but can take a generalised, abstract perspective. The assessment often asks the participants to take the perspective of a second-row observer. For example, participants must evaluate instructional situations or teaching materials and/or produce (multiple) possible teaching actions. Response types and the degree of openness differ between instruments: participants have to provide written responses, select the best option, or rate given statements. The instruments provide answers to the question “What could be done?” to (indirectly) obtain indications of the target competence.
Type C (product): instruments of this type are distinct from the others, as they focus on a certain product created by the participants in a written form, in two of three cases, on a plan for a specific lesson. The third instrument asks for a written explanation for a specific (mathematical) task. The researchers extract different performance indicators through the analysis of the product. Hence, measures are nested within one work sample. All three instruments reference real-life situations, with two providing a hypothetical situation. The last instrument uses a lesson plan that participants had to create during a teaching practicum. In a nutshell, these instruments have in common that they take a specific (given or naturally occurring) teaching problem as a starting point and ask: “What will you do?”, hence they aim at foreshadowing teaching performance.
6 Discussion
This study provides an overview of the current state of PA instruments across different subjects in teacher education research in the DACH region. The aim was to systematically identify and record differences and similarities of approaches in the still fragmented research field, which so far lacks a common understanding and varies regarding reporting practices. We first identified and synthesised similarities and differences between underlying theoretical approaches and suggested criteria to delineate PA from other assessment approaches. We argued that PA pertains to standardised, objective measures of observable behaviour regarding teaching demands. Authenticity was found to be a central umbrella term closely related to questions of the validity of assessment tasks.
Based on our definition, a scoping literature review was conducted. The literature search yielded 20 relevant instruments whose differences could be described by a category system related to context characteristics, methods, and alignment to assessment performance criteria. In terms of context characteristics (RQ1), we observed a focus on STEM subjects confirming earlier observations. Two instruments (ELMaWi instruments) were aligned in two subjects (mathematics and economics), indicating a certain transferability of the approach across subjects. For many school subjects, we could yet not find any instrument.
Comparing the instruments against their underlying theoretical frameworks was not possible. Despite teacher education is seen as the pioneering field in PA, the target competences were diverse and described with varying terminology. Mapping against the types of demands as pre-/post or in-instructional demands (see AC/RC, Lindmeier model) helped to structure the assessment targets. So far, all instruments have been used for research purposes, although some were also used for instructional purposes in university teacher education. Other purposes (e.g., certification) have not been reported in the DACH region, which is unsurprising, given that teacher certification has traditionally been a state affair. However, as publications may primarily report on research purposes, other purposes (e.g., instructional purposes, grading purposes) may be more varied than the results suggest.
Looking at the methods (RQ2), we found widespread use of vignettes representing teaching situations that are digitally delivered and combined with open-ended response formats. However, the instruments showed considerable differences again, so a sophisticated category system regarding methods had to be developed. In turn, it shows that developers of instruments have to deal with many degrees of freedom, and summarizing research findings in the future may be challenging.
Finally, the fine-grained analysis of characteristics enabled us to map the similarities and differences and thus uncover patterns. Clustering led to three types of PA instruments in current DACH teacher education research. Type A (action) instruments focus on in-instructional demands with close real-world approximation. In contrast, type B (analysis) instruments take a more reflective stance and test for the ability to provide (various) action possibilities. Few instruments (type C, product) ask participants to plan and hence foreshadow their actions.
Regarding the alignment of these types with the PA criteria, type A instruments show close similarity between the observed type of performance and type of interest (Kane et al. 1999). Despite this, instruments of type A differ in design (e.g., use of static methods based on video vignettes vs interactive role play, Jeschke et al. 2021; Wiesbeck et al. 2017). Typically, these assessments are very complex, for instance, concerning cognitive demands and scoring costs (Gartmeier et al. 2015; Kron et al. 2021). Davey et al. (2015) refer to the tension between the competing aspects of complexity and closeness to real-life criteria as the “elephant in the room” regarding PA, which must be addressed with increasing use (e.g., through reducing the cognitive demands and scoring costs with closed answer formats). Indeed, instruments of type B are typically of reduced complexity but nonetheless clearly related to teaching demands (see in-/exclusion criteria). The instruments show similarities to instruments to measure situation-specific skills that include decision-making, for example, under the notion of teacher noticing (Santagata et al. 2021). Hence, instruments of type B might provide interesting prototypes for further instrument development.
It should be noted that the typology reflects more than just how different groups of researchers have chosen to design instruments. In fact, for some projects, researchers developed instruments with different targets and methods (e.g., Profile-P). Similarly, the typology does not reflect our choice of grouping types of demands according to the Lindmeier model: whereas type A instruments show a strong coherence with the in-instructional demands (AC), type B and C instruments reflect different aspects related to the RC component (pre-/post-instructional demands). Notably, the ELMaWi RC instrument could not be classified due to its multifaceted design with text and video vignettes as well as written and verbal responses (Jeschke et al. 2021).
6.1 Limitations
The fact that, to date, there is no common understanding of PA in educational research made it necessary to first synthesise a variety of approaches and suggest a definition of PA for this review. Our definition hinges on the understanding of the suggested criterion of observable behaviour regarding teaching demands. Although this criterion proved intersubjectively applicable in our process, it is an open question whether it is generally suited to delineate what should be regarded as PA in teacher education research.
Teacher education research spreads across different disciplines and is partially compartmentalised. To address the danger of missing relevant work, we started with broad search parameters and applied snowballing, but we cannot guarantee that we surfaced all relevant studies.
This review focuses on the diversity of instruments in an emerging field. Therefore, we took an inclusive stance, refrained from rating the quality of the instruments, and did not exclude any instruments, even if reporting suggested a lack of quality. In some cases, we faced the problem of incomplete, inconsistent, or unreported essential information. Accordingly, applying the category system was difficult in some cases. So, despite the thorough double-coding process, we cannot exclude that other coders may partly come to different conclusions. For future applications, it will be necessary to consider whether the category system needs to be inductively adapted.
Our intensive search yielded only reports on 20 instruments within the last five years, and many still focus on instrument development. Accordingly, our findings primarily reflect the current state of research regarding the design characteristics of instruments. Conclusions regarding other essential criteria, including the psychometrical quality or practical usefulness of approaches regarding different purposes, can currently not be drawn.
6.2 Looking back and looking forward
The study aimed to systematically record the current state of PA instruments in DACH teacher education research. Compared to the findings of previous reviews, we see continuity as well as development. Regarding continuity, many of the instruments we identified are adapted or extended versions of the instruments identified by Zlatkin-Troitschanskaia et al. (2016). Still, the number of available instruments is small, and their use is limited to specific research purposes. At the same time, new developments and significant advances were made in connection with the German research program Modeling and Measuring Competencies in Higher Education (Zlatkin-Troitschanskaia et al. 2020b) funded by the Federal Ministry of Education and a Munich-based research unit funded by the German national science foundation (F. Fischer and Opitz 2022). This indicates that developing PA instruments requires prolonged, sustained research effort and highly specialised knowledge.
When we started this review, we hoped to also report on findings attained with PA instruments, and to provide a systematic literature review (e.g., evidence for the convergent or discriminant validity of approaches through contrasting instruments to assess teacher competence from a holistic and an analytical approach). When scoping the first publications, we quickly had to realise that the reported research is very diverse in every aspect. In most cases, the development process so far does not cover systematic investigations of aspects of psychometric quality which would be mandatory before the instruments could be used at a larger scale in research and teacher education. Our review hence focused on surfacing differences and similarities of existing instruments, which was also missing in the earlier reviews. Researchers new to the field of PA might benefit from the proposed definition of PA and the category system, which identifies possible design options and might inform future work on important reporting demands. The three types of instruments may be read as three major strands within the new research field and help to summarise future findings appropriately. In any case, working towards a common understanding of the emerging topic of PA will help researchers (across domains) to learn from each other and collaborate more efficiently.
Notes
Quality Offensive for Teacher Education (Qualitätsoffensive Lehrerbildung), Modeling and Measuring Competencies in Higher Education (KoKoHS).
Inert knowledge is knowledge a person shows in a decontextualised test situation but cannot use in a performance situation (Renkl et al. 1996).
References
Aufschnaiter, C., & Blömeke, S. (2010). Professionelle Kompetenz von (angehenden) Lehrkräften erfassen – Desiderata [Assessing professional competence of pre-service teachers—desiderata]. Zeitschrift Für Didaktik Der Naturwissenschaften, 16, 361–367.
Backhaus, K., Erichson, B., Gensler, S., Weiber, R., & Weiber, T. (2021). Multivariate Analysemethoden: Eine anwendungsorientierte Einführung [Multivariate Analysis Methods: An Application-Oriented Introduction] (16th edn.). Wiesbaden: Springer Gabler. https://doi.org/10.1007/978-3-658-32425-4.
Bartels, H., Geelan, D., & Kulgemeyer, C. (2019). Developing an approach to the performance-oriented testing of science teachers’ action-related competencies. International Journal of Science Education, 41(14), 2024–2048. https://doi.org/10.1080/09500693.2019.1658241.
Baumert, J., & Kunter, M. (2011). Das Kompetenzmodell von COACTIV [The COACTIV competence model]. In M. Kunter, J. Baumert, W. Blum, U. Klusmann, S. Krauss & M. Neubrand (Eds.), Professionelle Kompetenz von Lehrkräften: Ergebnisse des Forschungsprogramms COACTIV (pp. 29–53). Münster: Waxmann.
Blömeke, S., Kaiser, G., Lehmann, R., König, J., Döhrmann, M., Buchholtz, C., & Hacke, S. (2009). TEDS-M: Messung von Lehrerkompetenzen im internationalen Vergleich [TEDS-M: Measuring teacher competences in international comparison]. In O. Zlatkin-Troitschanskaia, K. Beck, D. Sembill, R. Nickolaus & R. H. Mulder (Eds.), Beltz-Bibliothek. Lehrprofessionalität: Bedingungen, Genese, Wirkungen und ihre Messung (pp. 181–209). Weinheim: Beltz.
Blömeke, S., Gustafsson, J.-E., & Shavelson, R. J. (2015). Beyond Dichotomies. Zeitschrift Für Psychologie, 223(1), 3–13. https://doi.org/10.1027/2151-2604/a000194.
Bruckermann, T., Aschermann, E., Bresges, A., & Schlüter, K. (2017). Metacognitive and multimedia support of experiments in inquiry learning for science teacher preparation. International Journal of Science Education, 39(6), 701–722. https://doi.org/10.1080/09500693.2017.1301691.
Casale, G., Strauß, S., Hennemann, T., & König, J. (2016). Wie lässt sich Klassenführungsexpertise messen? Überprüfung eines videobasierten Erhebungsinstruments für Lehrkräfte unter Anwendung der Generalisierbarkeitstheorie [How can classroom management expertise be measured? Review of a video-based survey instrument for teachers using generalisability theory]. Empirische Sonderpädagogik, 8(2), 119–139. https://doi.org/10.25656/01:12300.
Codreanu, E., Sommerhoff, D., Huber, S., Ufer, S., & Seidel, T. (2020). Between authenticity and cognitive demand: finding a balance in designing a video-based simulation in the context of mathematics teacher education. Teaching and Teacher Education, 95, 103146. https://doi.org/10.1016/j.tate.2020.103146.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edn.). Hillsdale: Erlbaum. https://doi.org/10.4324/9780203771587.
Davey, T., Holland, P. W., Shavelson, R., Webb, N. M., & Noreen, M. L. L. (2015). Psychometric considerations for the next generation of performance assessment. Center for K–12 Assessment & Performance Management at ETS. https://www.ets.org/media/research/pdf/psychometric_considerations_white_paper.pdf. Accessed 10 January 2022.
Etzel, S., & Küppers, A. (2000). Managementarbeitsprobe (MAP).
Evans, J. S. B. T. (2008). Dual-processing accounts of reasoning, judgment, and social cognition. Annual Review of Psychology, 59, 255–278. https://doi.org/10.1146/annurev.psych.59.103006.093629.
Fischer, F., & Opitz, A. (2022). Learning to diagnose with simulations: examples from teacher education and medical education (1st edn.). Springer eBook Collection. Springer. https://doi.org/10.1007/978-3-030-89147-3.
Fischer, J., Jansen, T., Möller, J., & Harms, U. (2021a). Measuring biology trainee teachers’ professional knowledge about evolution—introducing the Student Inventory. Evolution: Education and Outreach. https://doi.org/10.1186/s12052-021-00144-0.
Fischer, J., Machts, N., Möller, J., & Harms, U. (2021b). Der Simulierte Klassenraum Biologie – Erfassung deklarativen und prozeduralen Wissens bei Lehramtsstudierenden der Biologie [The Simulated Biology Classroom—Capturing Declarative and Procedural Knowledge in Biology Student Teachers]. Zeitschrift Für Didaktik Der Naturwissenschaften, 27(1), 215–229. https://doi.org/10.1007/s40573-021-00136-z.
Frey, A. (2006). Methoden und Instrumente zur Diagnose beruflicher Kompetenzen von Lehrkräften. Eine erste Standortbestimmung zu bereits publizierten Instrumenten [Methods and instruments for the diagnosis of teachers’ professional competences. An initial appraisal of the status quo of instruments already published.]. In C. Allemann-Ghionda & E. Terhart (Eds.), Zeitschrift für Pädagogik Beiheft: Vol. 51. Kompetenzen und Kompetenzentwicklung von Lehrerinnen und Lehrern: Ausbildung und Beruf (pp. 30–46). Weinheim: Beltz.
Frommelt, M., Hugener, I., & Krammer, K. (2019). Fostering teaching-related analytical skills through case-based learning with classroom videos in initial teacher education. Journal for Educational Research Online, 11(2), 37–60. https://doi.org/10.25656/01:18002.
Gartmeier, M. (2018). Gespräche zwischen Lehrpersonen und Eltern: Herausforderungen und Strategien der Förderung kommunikativer Kompetenz [Conversations between teachers and parents: Challenges and strategies for promoting communicative competence]. SpringerLink Bücher. Wiesbaden: Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-19055-2
Gartmeier, M., Bauer, J., Fischer, M. R., Hoppe-Seyler, T., Karsten, G., Kiessling, C., Möller, G. E., Wiesbeck, A., & Prenzel, M. (2015). Fostering professional communication skills of future physicians and teachers: effects of e‑learning with video cases and role-play. Instructional Science, 43(4), 443–462. https://doi.org/10.1007/s11251-014-9341-6.
Grossman, P., Compton, C., Igra, D., Ronfeldt, M., Shahan, E., & Williamson, P. W. (2009). Teaching practice: a cross-professional perspective. Teachers College Record: The Voice of Scholarship in Education, 111(9), 2055–2100. https://doi.org/10.1177/016146810911100905.
Hattie, J. (2003). Teachers Make a Difference, What is the research evidence? Australian Council for Educational Research.
Hecker, S.-L., Falkenstern, S., & Lemmrich, S. (2020a). Zwischen DaZ-Kompetenz und Performanz [Between DaZ competence and performance]. Herausforderung Lehrer*innenbildung – Zeitschrift zur Konzeption, Gestaltung und Diskussion, 3(1), 565–584. https://doi.org/10.4119/HLZ-3374.
Hecker, S.-L., Falkenstern, S., Lemmrich, S., & Ehmke, T. (2020b). Zum Verbalisierungsdilemma bei der Erfassung der situationsspezifischen Fähigkeiten von Lehrkräften [On the verbalisation dilemma when recording teachers’ situation-specific skills]. Zeitschrift Für Bildungsforschung, 10(2), 175–190. https://doi.org/10.1007/s35834-020-00268-1.
Jeschke, C., Kuhn, C., Lindmeier, A., Zlatkin-Troitschanskaia, O., Saas, H., & Heinze, A. (2019). Performance assessment to investigate the domain specificity of instructional skills among pre-service and in-service teachers of mathematics and economics. The British Journal of Educational Psychology, 89(3), 538–550. https://doi.org/10.1111/bjep.12277.
Jeschke, C., Lindmeier, A., & Heinze, A. (2021). Vom Wissen zum Handeln: Vermittelt die Kompetenz zur Unterrichtsreflexion zwischen mathematischem Professionswissen und der Kompetenz zum Handeln im Mathematikunterricht? Eine Mediationsanalyse [From knowledge to action: Does the competence to reflect on teaching mediate between mathematical professional knowledge and the competence to act in mathematics teaching? A mediation analysis]. Journal Für Mathematik-Didaktik, 42(1), 159–186. https://doi.org/10.1007/s13138-020-00171-2.
Kaiser, G., & König, J. (2019). Competence measurement in (mathematics) teacher education and beyond: implications for policy. Higher Education Policy, 32(4), 597–615. https://doi.org/10.1057/s41307-019-00139-z.
Kaiser, G., Busse, A., Hoth, J., König, J., & Blömeke, S. (2015). About the complexities of video-based assessments: Theoretical and methodological approaches to overcoming shortcomings of research on teachers’ competence. International Journal of Science and Mathematics Education, 13(2), 369–387. https://doi.org/10.1007/s10763-015-9616-7.
Kaiser, G., Blömeke, S., König, J., Busse, A., Döhrmann, M., & Hoth, J. (2017). Professional competencies of (prospective) mathematics teachers—cognitive versus situated approaches. Educational Studies in Mathematics, 94(2), 161–182. https://doi.org/10.1007/s10649-016-9713-8.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17. https://doi.org/10.1111/j.1745-3992.1999.tb00010.x.
Kempin, M., Kulgemeyer, C., & Schecker, H. (2019). Wirkung von Professionswissen und Praxisphasen auf die Reflexionsfähigkeit von Physiklehramtsstudierenden [Effect of professional knowledge and practical phases on the reflective ability of physics students]. In Naturwissenschaftliche Kompetenzen in der Gesellschaft von morgen. Gesellschaft für Didaktik der Chemie und Physik Jahrestagung (pp. 439–442). Wien.
Knievel, I., Lindmeier, A. M., & Heinze, A. (2015). Beyond knowledge: measuring primary teachers’ subject-specific Competences in and for teaching mathematics with items based on video vignettes. International Journal of Science and Mathematics Education, 13(2), 309–329. https://doi.org/10.1007/s10763-014-9608-z.
Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift Für Psychologie/Journal of Psychology, 216(2), 61–73. https://doi.org/10.1027/0044-3409.216.2.61.
König, J., Bremerich-Vos, A., Buchholtz, C., Fladung, I., & Glutsch, N. (2020). Pre-service teachers’ generic and subject-specific lesson-planning skills: On learning adaptive teaching during initial teacher education. European Journal of Teacher Education, 43(2), 131–150. https://doi.org/10.1080/02619768.2019.1679115.
König, J., Santagata, R., Scheiner, T., Adleff, A.-K., Yang, X., & Kaiser, G. (2022). Teacher noticing: a systematic literature review of conceptualizations, research designs, and findings on learning to notice. Educational Research Review, 36, 100453. https://doi.org/10.1016/j.edurev.2022.100453.
Kramer, M., Förtsch, C., Stürmer, J., Förtsch, S., Seidel, T., & Neuhaus, B. J. (2020). Measuring biology teachers’ professional vision: development and validation of a video-based assessment tool. Cogent Education, 7(1), 1823155. https://doi.org/10.1080/2331186X.2020.1823155.
Kramer, M., Förtsch, C., Boone, W. J., Seidel, T., & Neuhaus, B. J. (2021a). Investigating pre-service biology teachers’ diagnostic competences: relationships between professional knowledge, diagnostic activities, and diagnostic accuracy. Education Sciences, 11(3), 89. https://doi.org/10.3390/educsci11030089.
Kramer, M., Förtsch, C., & Neuhaus, B. J. (2021b). Can pre-service biology teachers’ professional knowledge and diagnostic activities be fostered by self-directed knowledge acquisition via texts? Education Sciences, 11(5), 244. https://doi.org/10.3390/educsci11050244.
Kron, S., Sommerhoff, D., Ufer, S., Siebeck, M., Stürmer, K., & Wecker, C. (2020). Rollenspielbasierte simulierte Diagnoseinterviews [Role-play based simulated diagnostic interviews in teacher education]. In Beiträge zum Mathematikunterricht 2020.
Kron, S., Sommerhoff, D., Achtner, M., & Ufer, S. (2021). Selecting mathematical tasks for assessing student’s understanding: pre-service teachers’ sensitivity to and adaptive use of diagnostic task potential in simulated diagnostic one-to-one interviews. Frontiers in Education, 6, Article 604568. https://doi.org/10.3389/feduc.2021.604568.
Kuckartz, U. (2014). Qualitative text analysis: A guide to methods, practice & using software (A. Mcwhertor, Trans.). SAGE. http://lib.myilibrary.com/detail.asp?id=617246
Kuhn, C., Zlatkin-Troitschanskaia, O., Saas, H., & Brückner, S. (2018). Konstruktion, Implementation und Evaluation eines multimedialen Assessmenttools in der wirtschaftspädagogischen Ausbildung [Construction, implementation and evaluation of a multimedia assessment tool in business education training]. In J. Schlicht & U. Moschner (Eds.), Berufliche Bildung an der Grenze zwischen Wirtschaft und Pädagogik: Reflexionen aus Theorie und Praxis (pp. 339–355). Wiesbaden: Springer VS.
Kulgemeyer, C., & Riese, J. (2018). From professional knowledge to professional performance: the impact of CK and PCK on teaching quality in explaining situations. Journal of Research in Science Teaching, 55(10), 1393–1418. https://doi.org/10.1002/tea.21457.
Kulgemeyer, C., & Tomczyszyn, E. (2015). Physik erklären – Messung der Erklärensfähigkeit angehender Physiklehrkräfte in einer simulierten Unterrichtssituation [Explaining physics—measuring the ability of prospective physics teachers to explain in a simulated teaching situation]. Zeitschrift Für Didaktik Der Naturwissenschaften, 21(1), 111–126. https://doi.org/10.1007/s40573-015-0029-5.
Kulgemeyer, C., Kempin, M., & Weißbach, A. (2021). Entwicklung von Professionswissen und Reflexionsfähigkeit im Praxissemester [Development of professional knowledge and reflection skills in the practical semester]. In Naturwissenschaftlicher Unterricht und Lehrerbildung im Umbruch? Gesellschaft für Didaktik der Chemie und Physik online Jahrestagung (pp. 262–265).
Kunter, M., Baumert, J., Blum, W., Klusmann, U., Krauss, S., & Neubrand, M. (2011). Professionelle Kompetenz von Lehrkräften: Ergebnisse des Forschungsprogramms COACTIV [Professional competence of teachers: results of the COACTIV research programme]. Münster: Waxmann. https://doi.org/10.31244/9783830974338.
Lachner, A., & Nückles, M. (2016). Tell me why! Content knowledge predicts process-orientation of math researchers’ and math teachers’ explanations. Instructional Science, 44(3), 221–242. https://doi.org/10.1007/s11251-015-9365-6.
Lachner, A., Backfisch, I., & Stürmer, K. (2019). A test-based approach of modeling and measuring technological pedagogical knowledge. Computers & Education, 142, 103645. https://doi.org/10.1016/j.compedu.2019.103645.
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310.
Lindmeier, A. (2011). Modeling and measuring knowledge and competencies of teachers: a threefold domain-specific structure model for mathematics (Empirische Studien zur Didaktik der Mathematik, Vol. 7). Münster: Waxmann.
Mayring, P. (2015). Qualitative content analysis: theoretical background and procedures. In A. Bikner-Ahsbahs (Ed.), Advances in mathematics education Ser. Approaches to qualitative research in mathematics education: examples of methodology and methods (pp. 365–380). Springer. https://doi.org/10.1007/978-94-017-9181-6_13.
McClelland, D. C. (1973). Testing for competence rather than for “intelligence”. The American Psychologist, 28(1), 1–14. https://doi.org/10.1037/h0034092.
Miller, G. E. (1990). The assessment of clinical skills/competence/performance. Academic Medicine, 65(9 Suppl), S63–S67. https://doi.org/10.1097/00001888-199009000-00045.
Oser, F., Forster-Heinzer, S., & Salzmann, P. (2010). Die Messung der Qualität von professionellen Kompetenzprofilen von Lehrpersonen mit Hilfe der Einschätzung von Filmvignetten. Chancen und Grenzen des advokatorischen Ansatzes [Measuring the Quality of Professional Teaching Competency Profiles by Means of Evaluating Film-Vignettes]. Unterrichtswissenschaft, 38(1), 5–28.
Palm, T. (2008). Performance assessment and authentic assessment: a conceptual analysis of the literature. https://doi.org/10.7275/0QPC-WS45.
Pecheone, R. L., & Chung, R. R. (2006). Evidence in teacher education. Journal of Teacher Education, 57(1), 22–36. https://doi.org/10.1177/0022487105284045.
Petticrew, M., & Roberts, H. (2006). Systematic reviews in the social sciences. Hoboken: Blackwell. https://doi.org/10.1002/9780470754887.
Renkl, A., Mandl, H., & Gruber, H. (1996). Inert knowledge: analyses and remedies. Educational Psychologist, 31(2), 115–121. https://doi.org/10.1207/s15326985ep3102_3.
Santagata, R., König, J., Scheiner, T., Nguyen, H., Adleff, A.-K., Yang, X., & Kaiser, G. (2021). Mathematics teacher learning to notice: a systematic review of studies of video-based programs. ZDM—Mathematics Education, 53(1), 119–134. https://doi.org/10.1007/s11858-020-01216-z.
Schröder, J., Riese, J., Vogelsang, C., Borowski, A., Buschhüter, D., Enkrott, P., Kempin, M., Kulgemeyer, C., Reinhold, P., & Schecker, H. (2020). Die Messung der Fähigkeit zur Unterrichtsplanung im Fach Physik mit Hilfe eines standardisierten Performanztests [Measuring the ability to plan lessons in physics using a standardised performance test]. Zeitschrift Für Didaktik Der Naturwissenschaften, 26(1), 103–122. https://doi.org/10.1007/s40573-020-00115-w.
Schütze, B., Rakoczy, K., Klieme, E., Besser, M., & Leiss, D. (2017). Training effects on teachers’ feedback practice: the mediating function of feedback knowledge and the moderating role of self-efficacy. ZDM, 49(3), 475–489. https://doi.org/10.1007/s11858-017-0855-7.
Shavelson, R. J. (2010). On the measurement of competency. Empirical Research in Vocational Education and Training, 2(1), 41–63. https://doi.org/10.1007/BF03546488.
Shavelson, R. J. (2013). On an approach to testing and modeling competence. Educational Psychologist, 48(2), 73–86. https://doi.org/10.1080/00461520.2013.779483.
Shulman, L. S. (1986). Those who understand: knowledge growth in teaching. Educational Researcher, 15(2), 4. https://doi.org/10.2307/1175860.
Stahnke, R., & Blömeke, S. (2021). Novice and expert teachers’ situation-specific skills regarding classroom management: what do they perceive, interpret and suggest? Teaching and Teacher Education, 98, 103243. https://doi.org/10.1016/j.tate.2020.103243.
Voss, T., Kunina-Habenicht, O., Hoehne, V., & Kunter, M. (2015). Stichwort Pädagogisches Wissen von Lehrkräften: Empirische Zugänge und Befunde. Zeitschrift für Erziehungswissenschaft, 18(2), 187–223. https://doi.org/10.1007/s11618-015-0626-6.
Wiesbeck, A. B., Bauer, J., Gartmeier, M., Kiessling, C., Möller, G. E., Karsten, G., Fischer, M. R., & Prenzel, M. (2017). Simulated conversations for assessing professional conversation competence in teacher-parent and physician-patient conversations. Journal for Educational Research Online, 9(3), 82–101. https://doi.org/10.25656/01:15302.
Wildgans-Lang, A., Scheuerer, S., Obersteiner, A., Fischer, F., & Reiss, K. (2020). Analyzing prospective mathematics teachers’ diagnostic processes in a simulated environment. ZDM, 52(2), 241–254. https://doi.org/10.1007/s11858-020-01139-9.
Zlatkin-Troitschanskaia, O., & Kuhn, C. (2010). Messung akademisch vermittelter Fertigkeiten und Kenntnisse von Studierenden bzw. Hochschulabsolventen: Analyse zum Forschungsstand [Measuring academically taught skills and knowledge of students and graduates: Analysis of the state of research]. Mainz: Johannes Gutenberg-Universität.
Zlatkin-Troitschanskaia, O., Shavelson, R. J., & Kuhn, C. (2015). The international state of research on measurement of competency in higher education. Studies in Higher Education, 40(3), 393–411. https://doi.org/10.1080/03075079.2015.1004241.
Zlatkin-Troitschanskaia, O., Pant, H. A., Kuhn, C., Toepper, M., & Lautenbach, C. (2016). Messung akademisch vermittelter Kompetenzen von Studierenden und Hochschulabsolventen: Ein Überblick zum nationalen und internationalen Forschungsstand [Measuring academically taught competencies of students and graduates: An overview of the national and international state of research] (1. Aufl.). Wiesbaden: Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-10830-4
Zlatkin-Troitschanskaia, O., Fischer, J., & Pant, H. A. (2020a). Messung von Lehrkompetenzen – Analyse des nationalen und internationalen Forschungsstandes [Measurement of teaching competences—analysis of the national and international state of research]. In I. M. Welpe, J. Stumpf-Wollersheim, N. Folger & M. Prenzel (Eds.), Leistungsbewertung in wissenschaftlichen Institutionen und Universitäten (pp. 108–133). De Gruyter Oldenbourg. https://doi.org/10.1515/9783110689884-006.
Zlatkin-Troitschanskaia, O., Pant, H. A., Nagel, M.-T., Molerov, D., Lautenbach, C., & Toepper, M., (Eds.). (2020b). KoKoHs Assessment-Portfolio-Testverfahren zur Modellierung und Messung generischer und domänenspezifischer Kompetenzen bei Studierenden und Hochschulabsolventen [KoKoH’s assessment portfolio test procedure for modelling and measuring generic and domain-specific competences in students and graduates]. Dannstadt-Schauernheim: pfalzdruck.
Funding
No funds, grants, or other support was received.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
C. Albu and A. Lindmeier declare that they have no competing interests.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Albu, C., Lindmeier, A. Performance assessment in teacher education research—A scoping review of characteristics of assessment instruments in the DACH region. Z Erziehungswiss 26, 751–778 (2023). https://doi.org/10.1007/s11618-023-01167-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11618-023-01167-7