1 Introduction

Structuring has emerged as a key feature of effective instruction in several empirical studies and meta-analyses (e.g., Brophy 2000; Hattie 2009; Seidel and Shavelson 2007; Walberg and Paik 2000). Generally, structuring as a planning aspect is a clear, recognizable organization of instruction into separate phases and sections in which the teacher gradually builds the complexity of the knowledge to be acquired and ensures a smooth flow of instruction by sequencing it appropriately (Doyle 2006; Kounin 1970).

However, only few empirical studies have examined modelling and measured planning competence (e.g., Cochran-Smith and Villegas 2016), particularly structuring (König et al. 2015). Only recently have projects emerged that focus on modelling and assessing teachers’ planning competence in a standardized manner. Representative projects include PlanvoLL (Planning Competence of Teachers [Planungskompetenz von Lehrerinnen und Lehrern], König et al. 2015) and PlanvoLL‑D (The Role of Professional Knowledge of Pre-Service German Teachers in their Lesson Planning [Die Bedeutung des professionellen Wissens angehender Deutschlehrkräfte für ihre Planung von Unterricht], König et al. 43,44,a, b).

Following this work, Krepf and König (in press) investigated in their study whether structuring as a situation-specific skill of planning competence could be measured and modelled while maintaining validity and reliability. In their study, written lesson plans created by more than 100 pre-service teachers in Berlin were analyzed at two measurement points from the PlanvoLL project (König et al. 2015). The lesson plans were evaluated based on a newly developed category system. Under the current research methodology, the category system comprised two different structuring factors: (1) a context factor and (2) a process factor. Furthermore, the explication levels of the planning decisions made (naming, reasoning, and linking) also were examined. The coding results then were quantified and assessed using IRT scaling (Rasch scaling).

The results indicate reliability and validity of the test instrument. However, the question remains as to how robust these findings are and whether they can be replicated. Replication studies are necessary to safeguard observed findings against chance (Rost 2007). Findings can only be viewed as validated if several studies on the same research question arrive at similar results (Döring and Bortz 2016). Therefore, a scaling-up study to examine whether structuring as a dimension of planning can be modeled reliably by using a larger sample. In our replication study, the data base comprises 310 written lesson plans from pre-service teachers during induction in North Rhine-Westphalia (NRW) and Berlin. Based on these data, various analyses (reliability and IRT scaling) will be conducted, presented, and subsequently discussed.

2 Theoretical background

2.1 Planning as a dimension of teacher competence

Broad agreement exists that knowledge is a key component of teachers’ professional competence (Baumert and Kunter 2006; Shulman 1986). In a number of empirical educational studies, researchers have distinguished between general pedagogical knowledge, content knowledge, and pedagogical content knowledge (e.g., Baumert and Kunter 2006). In contrast, research on lesson planning skills is scarce, and a shared (competence) concept of lesson planning cannot be identified (König et al. 2015). One reason might be that lesson planning depends on the teaching context (John 2006). For example, teachers usually plan their lessons for a specific learning group. How they plan a concrete lesson also depends on the classroom routines and organizational structures with which a specific learning group is familiar with.

Because situation-specific aspects of teaching are relevant, many researchers today agree on an extended understanding of teacher competence. In suggesting a model of “competence as a continuum” (Blömeke et al. 2015), competence can be characterized as existing at two poles: latent dispositions (cognitive and affective-motivational skills) at one pole and manifest performance (observed behavior) at the other. Blömeke et al. (2015) distinguished among (underlying) disposition (teacher knowledge) and situation-specific skills, with the latter assumed to be more proximal to classroom teaching (performance) than teacher knowledge as discussed in previous competence models (Baumert and Kunter 2006).

In terms of this classification, we view planning competence as a situation-specific skill comprising perception, interpretation, and decision-making (detailed König et al. 43,44,a, b). Planning competence develops in a planning situation that is characterized, in turn, by various conditions.

2.2 Theoretical modelling of structuring as an aspect of planning competence in PlanvoLL-D

Lesson planning is a central task of teachers (Baumert and Kunter 2006); therefore, it is an important part of the first phase of university teacher education (Hohenstein et al. 2014) and during the second phase of teacher training (induction phase, Referendariat, Vorbereitungsdienst). As part of this training, pre-service teachers should receive scientific and practical training in lesson planning. One focus concerns imparting knowledge on lesson planning (Rey et al. 2018).

Extant research has demonstrated that pre-service teachers have difficulties in planning lessons (Calderhead 1996). One reason is that they lack the broad knowledge base necessary for planning, as well as the flexibility to react appropriately to the unexpected. While experienced teachers take on a holistic approach to teaching, pre-service teachers often view teaching as a chronological sequence of events connected only partially.

Westermann (1991) found that significant differences exist between experts and novices in lesson planning. While novices tend to be oriented toward specific instructional objectives in developing structured lessons, experts place great emphasis on adopting the student perspective and orienting their planning toward students’ specific (learning) needs. This orientation toward students’ needs in lesson planning is linked closely to the aspect of structuring lesson content. Considering that the knowledge to be acquired is usually too complex for the students, it must be processed in a structured manner regarding both content and form.

Structuring is both a central aspect of effective teaching (see, e.g., Brophy 2000; Muijs and Reynolds 2011; Walberg and Paik 2000) and a core issue of general didactics and empirical teaching-learning research (see König 2012). In line with central didactic models and findings from empirical classroom research, at least two meanings of structuring can be distinguished (e.g., Doenau 1987; Kleickmann 2012; Lipowsky 2015): (1) content structuring (unit contextualization) and (2) procedural/organizational structuring (phasing).

2.2.1 Content structuring (unit contextualization)

In the context of lesson planning, the specific lesson to be presented is not only planned, but also located in a larger overall context to demonstrate how individual lessons are interrelated (John 2006; Yinger 1980). In order to help students building a well-organized knowledge base, it is necessary to assess their prior knowledge and to decompose the lesson unit’s content (not just for one lesson) into appropriate sub-parts (Leinhardt and Greeno 1986; Muijs and Reynolds 2011). These sub-parts must be connected (e.g., Drollinger-Vetter 2011).

The subdivision of the learning process into sub-parts and the connection of these usually is done from a temporal perspective (Scholl and Plöger 2020; Shavelson 1983; Yinger 1980). A lesson’s outcome should build on previous lessons to allow for cumulative knowledge growth because knowledge, skills, and understanding usually emerge gradually through interaction among several individual lessons, forming a unit (Shavelson 1987). Such units are spread over different periods, usually one to four weeks (Borich 2004). The unit contains the topics to be covered in each lesson (Plöger 2008). When planning a unit, teachers must determine the subject area and material relevant to learning and formulate learning objectives (Maier 2012). Teachers also must select the content and objectives of the unit according to individual students’ needs (Jäger and Maier 2019; Shulman 1987). To do this, teachers must consider students’ knowledge levels and integrate their needs into the planning process (Berliner 2004; Kagan 1992). The learning content and associated learning objectives are distributed over several lessons with a decisive impact on content and method of individual lessons. Thus, the individual lesson is “part of a larger system of interrelated learning” (Borich 2004, p. 126). Therefore, the requirement for “unit contextualization” reflects that students often are expected to accumulate knowledge in a gradual, structured, or cumulative manner (Leinhardt and Greeno 1986; Muijs and Reynolds 2011).

Therefore, the written lesson plan should clarify the topic chosen for the unit and how this topic will be organized into subtopics. Furthermore, the individual lesson must be embedded in this larger context (unit). The lesson plans should specify what is expected of students at a particular point within the unit or what they are expected to accomplish, that is, what content will be discussed and how the content aspects relate to each other. Both the unit and each of the lessons need a clear, coherent thematic structure aiming at facilitating student’s learning (Seidel et al. 2005). Three essential aspects should be elaborated upon:

  1. 1.

    In accordance with the German didactic tradition (for an overview, see Arnold and Koch-Priewe 2011), a content analysis should be conducted first to select a topic and legitimize it with reference to the curriculum (Jäger and Maier 2019). The teacher then should think about how this topic should be developed in terms of content within the framework of unit planning (regarding content).

  2. 2.

    The teacher must clarify each individual lesson’s function in this unit, that is the individual lesson’s position in the unit (Pauli and Reusser 2003). For example, an introductory lesson on a new topic will be designed differently from a lesson primarily intended to apply knowledge and transfer it to some other context. The teacher also should clarify how individual lessons will be connected, what function the lesson will have within the unit, how it will relate to preceding and subsequent lessons (Arnold and Koch-Priewe 2011), and whether the whole arrangement enables cumulative learning.

  3. 3.

    The content must be selected in a well-founded manner and prepared didactically; thus, a didactic analysis (Klafki 1995) is needed. This analysis serves as the didactic justification and legitimation of the (individual) lesson content. The teacher clarifies what content will be covered during the individual lesson and the new content’s meaning for students. Added to justifying the content (“what”), the teacher must consider “how” the relevant facts and content can be taught to the students. Here, the “how” refers primarily to the naming, reasoning, and (at best) linking of methodological choices to one another. The various methodological decisions should be linked for a structural connection to emerge between content, methods, and lesson goals (see Heimann 1972; Klafki 2007; Maier 2012).

2.2.2 Procedural/organizational structuring (phasing)

Furthermore, teachers must consider how learning content will be developed during each lesson. In terms of the teaching process, the need for procedural/organizational structuring of the lesson includes dividing the lesson into phases, marking transitions, and naming students’ behavior to be exhibited during transitions. Considering these aspects of the teaching process is necessary to maintain the lesson’s flow (e.g., Doyle 2006; Kounin 1970).

For the flow of instruction to occur smoothly, the teacher must plan the organizational flow and individual student activities (Shavelson and Stern 1981; Yinger 1980). “Activity refers to the timing and sequencing (or pace or flow) of content and materials during instruction” (Shavelson 1987, p. 485). These activities and tasks should have “stimulating variety and optimal challenge that help students maintain their engagement in the task and minimize interruptions due to boredom or distraction” (Brophy 2000, p. 11). For momentum to be generated within an instructional activity, a teacher must anticipate before the lesson where momentum might be lost and think about how to prevent it. An essential aspect of this is the clear, recognizable structuring of the lesson into individual phases and sections and breaking down the lesson content into individual components. This includes the teachers’ management of transitions between phases.

Lesson structure has become an important facet of instructional quality (Brophy 2000; Kyriakides et al. 2018). It is crucial to have a clear, identifiable structuring of instruction into discrete phases and sections in which the teacher ensures a coherent flow of instruction with minimal interruptions through proper sequencing (Muijs et al. 2014). Phasing typically refers to activity patterns (e.g., seatwork, presentation) that support the organization of instruction (e.g., Doyle 2006).

In written lesson plans, the phasing of the lesson is addressed in at least two places: in the lesson plan itself and in an articulation scheme. In the lesson plan, the didactic-methodological decisions can be described in detail. The articulation scheme (also progress plan) then reflects the condensed decisions, usually in tabular form. It puts teaching-learning processes into a chronological sequence. Articulation scheme plans not only have an orientation function, but also present an overview of “the temporal, factual, and social order of the planned lessons” (Plöger 2008, p. 170).

2.3 Previous study

A previous study (Krepf and König in press) used the data from the PlanvoLL project and examined if structuring could be modelled as an aspect of lesson planning. In view of the short theoretical explanations of the different meanings of structuring (see Sect. 2.2), we designed a category system that comprised the two factors contextualization and phasing.

Additionally, the category system is hierarchically structured in formal terms because lesson planning is a systematic, extensive decision-making process (Zahorik 1970). The (formal) hierarchy includes the explication level of the planning decisions taken. At the lowest level (naming), it is checked whether relevant information is included in their plans. The middle level (reasoning) entails checking whether the decisions taken are reasonable. At the highest level (linking), it is a question of whether and how the individual decisions made are related to each other.

Table 1 presents the eight categories (items) for the scale of contextualization. The categories describe the content that links the individual lesson with the unit.

Table 1 Contextualization (C) subscale categories (Krepf and König in press)

Since the explication levels of the planning decisions take on the lowest level (naming), the topics of the unit and the lesson are only named and derived from the framing standards.

At the second level (reasoning), the structure is described and explained. Indicators for reasoning could be the explanations of what happened within the unit, i.e., which content already had been discussed and how the content aspects are related to each other. At least two statements must be made: first, how the unit is structured, and second, what the function of the lesson to be planned is in this unit.

At the highest level (linking), didactic decisions are linked together so that a coherent overall picture can emerge. Therefore, a didactic analysis should be conducted (Klafki 1995). The didactic analysis deals with the didactic interpretation, justification, and structuring of the content in relation to the specific lesson. It therefore should clarify what the content of the lesson is and what relevance the content has for the students. In addition, the methodological decisions need to be related to the content to clarify how appropriate facts and content are presented to the students.

The scale of phasing includes five upper categories and 15 sub-categories (items) to capture the phased structure of the lesson being planned and how these phases are connected (see Table 2).

Table 2 Phasing (P) subscale categories (Krepf and König in press)

As in our description of the contextualization facet of lesson planning, we distinguish between three explication levels. Indicative of the lowest level (naming) would be the mentioning of an articulation scheme (PN1). The following categories (of the naming level) focuses on the lesson’s phase structure. The function of dividing the lesson into phases is to provide meaningful lesson phases and, thus, represent factual and temporal significance as a process structure (Plöger 2008). Myriad phase models can be found in the literature and in empirical studies. These phase models provide a basic framework through which instruction can be structured. Although a three-phase structure often is used in larger international studies, such as TIMSS 1999 (Review—Introducing New Content—Practicing/Applying/Consolidation; Hiebert et al. 2003) or in TEDS‑M 2008 (König and Blömeke 2009a), and the three-phase structure represents a minimal consensus that can be derived from U.S. and German-language literature on lesson planning, we initially assumed a four-phase structure. Based on Herbart’s four-phase model, this still is used today in a modified form (Maulana et al. 2012). The phases—including introduction, working on the topic, and checking results and providing feedback—can be found in relevant textbooks for (pre-service) teachers (see Meyer 2004). In our case, we chose a four-phase lesson structure comprising the introduction (PN2P1), working on the topic (PN3P2), checking results and providing feedback (PN4P3), and application/transfer (PN5P4). Here, the coding of the phase structure is done in two steps: (1) In the first, following TEDS‑M (König and Blömeke 39,40,a, b), the phase names were coded low-inferentially using a comprehensive coding guide that included different phase names deductively derived from the German- and English-language literature (e.g., Good and Brophy 2007; Kaiser and Kaiser 1991; Klingberg 1982; Meyer 1999), among others. (2) In the second step, the individual coded phase labels were assigned to the superordinate four-phase structure.

The next level (reasoning) indicates whether the functions of the individual phases have been described or reasoned (Rosenshine and Stevens 1986). For example, for the lesson introduction, “activation of prior knowledge” could be mentioned as a function, followed by an explanation of how this activation of prior knowledge will be achieved.

At the highest level (linking), the individual phases should be related to each other. Due to this exposed position of managing transitions (Arlin 1979; Doyle 2006), concrete information about how each activity (or phase) is linked should be found during instruction. We focused on simple transitions. For example, the transition from the introductory phase (social form: plenary) to the elaboration phase (social form: group work) can be addressed in one lesson plan. The focus should be on change in social form (e.g., “What rules apply to group work? Are certain rules that apply to group work repeated/discussed?”) and/or the didactic approach (e.g., formulation of the work assignment; anticipation of possible comprehension problems). Linking individual decisions can occur at two different points: in the formulated lesson plan and/or in the articulation scheme. Both presentation areas were recorded separately to gain information about the exact location in the lesson plan where such links are addressed.

In the previous study by Krepf and König (in press), 106 Berlin pre-service teachers took part. With exception of one, each of them provided two lesson plans, resulting in a total sample of 210 lesson plans from the two measurement time points. The written lesson plans were evaluated content-analytically using deductively formed categories (see Tables 1 and 2). Dichotomous codes (1 or 0) were given to indicate whether each criterion was met or not. A maximum of 23 possible points could therefore be scored. The respective coding events subsequently were quantified. IRT scaling was used to estimate item and person parameters (Rasch 1960). One-, two-, and three-dimensional IRT scaling analyses were conducted to test dimensionality, and different models were evaluated to determine which model better reflected the data. The three central results were (Krepf and König in press):

  1. 1.

    The scales contextualization and phasing could be separated empirically (scale contextualization EAP = 0.66; Theta = 1.49; scale phasing EAP = 0.75; Theta = 1.79). The two-dimensional modelling showed a better fit than one-dimensional modelling (Chi-square test between difference two deviance values = 88.14; df = 2; p < 0.001).

  2. 2.

    The scales were organized hierarchically. The hierarchization corresponds to the explication levels of the planning decisions made. The three explication levels could be separated empirically (subscale naming EAP = 0.61; Theta = 0.87; subscale reasoning EAP = 0.73; Theta = 1.96; subscale linking EAP = 0.68; Theta = 1.18). The difference between the two deviance values (40.91) was significant for df = 5 (p < 0.001), implying a better fit of the three-dimensional model.

  3. 3.

    An increase was demonstrated in measured planning competence under the aspect of structuring during induction phase. The score increased significantly (tScore (103) = −5.77, p< 0.001) from the beginning (T1: M = 10.01, SE = 0.34, SD = 3.46) to the end (T2: M = 12.46, SE = 0.29, SD = 3.01) of the induction. An increase could also be observed for each of the three explication levels (NamingT1: M = 5.12, SE = 0.11, SD = 1.1; NamingT2: M = 5.45, SE = 0.1, SD = 0.97; ReasoningT1: M = 2.45, SE = 0.16, SD = 1.6; ReasoningT2: M = 3.23, SE = 0.13, SD = 1.31; LinkingT1: M = 2.44, SE = 0.17, SD = 1.8; LinkingT2: M = 3.77, SE = 0.16, SD = 1.68). By the end of induction, pre-service teachers are better able to explain their thoughts and provide explanations for their decisions (tNam (103) = −2.41, p = 0.02; tRea (103) = −4.39, p < 0.001; tLink (103) = −6.19, p < 0.001). The results to date suggest that a reliable and valid test instrument has been developed.

2.4 Research questions

The present study was designed as a replication study to test the validity and reliability of the findings of the previous study using a new and larger sample. Considering that a replication study is presented, the three research questions (RQ) of the previous study were investigated again. However, a fourth RQ was added, taking into account that pre-service teachers from two German states participated in the replication study.

RQ 1

Can structuring as an aspect of planning competence be measured and mapped reliably on a continuum within the IRT scaling framework? Is it a one- or two-dimensional construct?

RQ 2

Can the different explication levels (naming, reasoning, and linking) be mapped through an item-level analysis?

RQ 3

Does structuring as a part of planning competence increase over time (during induction)? Which explication levels (naming, reasoning, and linking) change throughout the induction and, thus, are a prominent aspect of competence development during the induction of pre-service teachers?

RQ 4

Can the empirically found structure of structuring as an aspect of planning competence be confirmed for different subsamples (NRW and Berlin)?

3 Method

3.1 Sample

In the project PlanvoLL‑D, pre-service teachers who entered the second phase of teacher education (induction) were surveyed in two German states, North Rhine-Westphalia (NRW) and Berlin, in spring 2016. The sample (NRW) included pre-service teachers who attended a teacher education program to qualify for lower secondary school (Haupt‑/Real‑/Gesamtschule) or for lower and upper secondary school (Gymnasium/Gesamtschule). In Berlin, the corresponding teacher education qualifies pre-service teachers for lower and upper secondary school (Integrierte Sekundarschule/Gymnasium). In both states, these teacher education programs are the only ones that qualify teachers of German for secondary schools.

The data consisted of 310 written lesson plans from pre-service teachers from two measurement points: T1 at the beginning (first lesson plan) and T2 at the end (last lesson plan) of their internship. The time between T1 and T2 was approximately 1.5 years. At T1, 172 pre-service teachers participated. They were approximately 29 years old (M = 28.9, SD = 4.9). 85% were female; 35% (60 pre-service teachers) came from Berlin and were qualifying themselves to teach at Secondary Levels I and II; 65% (112) qualifying to teach at Secondary Level I came from NRW. 42% (73) qualifying to teach at Secondary Levels I and II, and 23% (39) were qualifying to teach at Secondary Level I. At T2 138 pre-service teachers submitted a lesson plan (response rate: 80%). Out of this, a panel sample of 116 could be formed from pre-service teachers who submitted their lesson plans at both time points.

3.2 Instruments

Altogether, the evaluation instrument comprised 23 categories (items, see Tables 1 und 2). Regarding content, the two subscales contextualization (eight items) and phasing (15 items) are distinguished. Furthermore, three different explication levels (naming, reasoning, and linking) were taken into consideration.

3.3 The written lesson plan as an authentic planning document

Various methodological approaches have been established to measure lesson-planning competence (see Rothland 2021), including self-assessment procedures, vignette tests, planning knowledge tests, and analyses of authentic planning documents (written lesson plans). We focused on written lesson plans, which are a product of lesson planning (Grzesik 1979) and long have been an integral part of practical teacher training (Besa et al. 2020).

Lesson planning requires a series of decisions. Writing a lesson plan represents a mental effort to put these decisions into written form. Written lesson plans “document the teacher’s planning considerations; they provide information about the choice and justification of the topic, about the learning goals to be achieved, about the methods, media, and social forms used” (Plöger 2008, p. 170). The teacher must critically reflect on their own approach and justify the decisions (Esslinger-Hinz et al. 2013). These justifications should be coherent to external people (e.g., subject teachers) because written lesson plans are part of the grading process. Therefore, it should be described explicitly what will happen during the lesson. The lesson plan does not reflect the entire planning process (e.g., weighing alternatives), but they do contain the teacher’s final planning considerations. Thus, the written lesson plan represents a first product of lesson planning. The category system (instrument) should be used to reconstruct pre-service teachers’ planning decisions in written lesson plans.

3.4 Measures and data analyses

Data analysis was undertaken in two steps. First, the lesson plans were analyzed using the qualitative content analysis method (Mayring 2014) using the deductively formed category system. Overall, 23 categories were used, each of which was defined, as well as made concrete, using a prototypical text passage derived from written lesson plans (see Tables 1 and 2). These descriptions and text examples together formed a category system (Mayring 2014). Conducting the content analysis required a correct and precise application of each category. Therefore, it was necessary to prepare the coders for the coding process. Intensive practice periods were used to train the coders in applying the category system. At first, text passages were coded together, then separately. Afterward, errors, discrepancies, and deviations were discussed. Once they gained sufficient confidence in using the categories, the coding process began.

Two trained raters independently coded the written lesson plans using dichotomous items. If a corresponding criterion was covered in the lesson plans, the raters were assigned a code of 1. A code of 0 was assigned if the criterion was not found in the written plan. If the written plans did not contain sufficient information on the analysis criterion, a value of 9 was assigned (at T1 one lesson plan was coded 9 throughout, so 171 lesson plans could be validly analyzed).

The category system used was not only an important prerequisite for the coding process itself, but also for the evaluation of the coding’s stability and reproducibility. To ensure intercoder reliability, the two coders double-coded approximately 15% of the lesson plans, and these results were compared with each other. A Kappa between 0.60 and 0.75 was established in advance as a sufficient level for (intercoder) reliability (e.g., Fleiss and Cohen 1973). The overall Kappa was good (κ > 0.8). For the subscale phasing, the Kappa was 0.86. For the subscale contextualization, the kappa was slightly lower (0.71). Nevertheless, acceptable-to-good Kappa values for intercoder reliability were elicited, indicating reliable and stable coding.

Second, the coding registered was processed quantitatively. For each coded text passage, one point was awarded—like a correct answer on a test. The categories used were viewed as scored—like test items. Accordingly, the introduced category system contained a total of 23 categories (items) with which the lesson plans were examined: eight items for the subscale contextualization and 15 for the subscale phasing. Both scales can be interpreted—analogous to item scales—as subscales of a construct containing a set of defined items. In accordance with this, the category system formed a scale that included 23 variables (categories) with two scales (contextualization and phasing) and three scales for explication levels (naming, reasoning, and linking).

Due to the size of the sample, the virtual case approach was used for scaling (e.g., Rost 2004). This is also known as “concurrent calibration” (von Davier et al. 2006), in which longitudinal subjects (i.e., subjects in which two measurements are available) represent two cases in a scaling file (one “real” and one “virtual”), thereby increasing the estimate’s power (Bond and Fox 2007). The first measurement (at the beginning of induction) included 171 pre-service teachers’ lesson plans. The second measurement included 138 pre-service teachers’ lesson plans. The result was a scaling of 309 cases. With this increased number of cases, IRT scaling was possible (n < 150) (e.g., Bond and Fox 2007).

The coding was conducted with dichotomous items, so a 1‑PL Rasch model (IRT) was used (Rasch 1960). The data analyses were conducted using ConQuest (Wu et al. 1997), which assigns a difficulty parameter to each item based on its solution rate and an ability parameter to each subject according to demonstrated performance using a maximum likelihood procedure.

A major advantage of IRT scaling is that it allows for mapping of person ability scores and item difficulties to a common scale. Another advantage is that ConQuest provides information about the scale reliability and global quality of the models examined. Empirical reliability was calculated using multidimensional expected a posteriori estimation (EAP; de Ayala et al. 1995) which allows an unbiased description of population parameters (Adams et al. 1997). It can be interpreted similar to Cronbach’s Alpha (Rost 2004). Item fit statistics provide detailed insights into the quality of items and how they work in the assumed measurement model (Wright and Linacre 1994).

Furthermore, the deviation index (deviance; Wu and Adams 2006) compares the investigated models’ global fit and provides information on which models provide a better fit to the data (degree of goodness of fit). Lower deviance indicates a better fit. To find dimensionality evidence in the data, one-dimensional, two-dimensional (the contextualization and phasing subscales), and three-dimensional (the explication levels naming, reasoning, and linking) IRT scaling analyses were conducted. Moreover, the models’ goodness of fit (chi-square difference test) was examined to determine which model fit the data better.

To check the differences between explication levels, we used a single factor analysis of variance (ANOVA), then applied a t‑test for dependent samples and calculated the effect size for the differences between measurement points 1 and 2 to answer our third RQ.

IRT scaling allows data from two measurement time points to be related to each other so that the weighted likelihood estimates (WLEs; see Warm 1989) found in the virtual cases approach can be exported as person parameters. Furthermore, we examined whether separate scaling was possible for both measurement time points. Separate scaling for each measurement time point allows for a comparison of the item difficulty parameters obtained. For this purpose, item difficulty parameters were correlated with each other (T1 and T2). High correlations then can be interpreted as indicating the presence of measurement invariance (see Bond and Fox 2007). We viewed the prerequisite for longitudinal scaling as met if a high correlation (above 0.7) is present. Moreover, we conducted a regression analysis for the two measurement time points. This included planning competence as the dependent variable and teaching type as the independent variable.

4 Results

4.1 RQ 1

First, a one-dimensional scaling was conducted within the framework of the virtual cases approach, resulting in acceptable values of the EAP reliabilities for the total scale (EAP = 0.81; Theta = 1.71), that is, the items sufficiently scatter across the ability spectrum. Table S1 (see Electronic Supplementary Material, ESM) provides the item statistics. The modelling reveals that different solution frequencies exist for the one-dimensional model: Some items were used very often (the naming of the phases), while others were found very rarely (the linking of the individual phases with each other).

During the scaling procedure, a difficulty parameter was assigned to each item based on its solution rate using a maximum likelihood method (estimates). Items with an estimate of less than 0 are classified as “too easy” because many pre-service teachers have “solved” this item. The weighted mean squares (WMNSQ) (from 0.8 to 1.2; see, e.g., Adams and Wu 2002), as well as t‑values (> −1.96 and < 1.96; see Bond and Fox 2007) largely fell within the expected range. Negative t-values indicated little variation in response patterns, which could be interpreted as an indication that a “minimal consensus” exists on the requirements for lesson plans.

Significant correlations were found regarding the total score to be achieved (T1+T2) and the two subscales, indicating a homogeneous construct (Table S2 in the ESM). The correlations also were calculated regarding the two measurement time points. The intercorrelation at T1 was 0.32** which is significant at the p < 0.01 level. No significant intercorrelation was found at T2 (0.11).

We performed two-dimensional scaling in the virtual cases approach. The EAP reliabilities for the two subscales lied within an acceptable range (subscale contextualization EAP = 0.82; Theta = 5.79; subscale phasing EAP = 0.78; Theta = 1.15). Next, we compared the one-dimensional and two-dimensional models, with a chi-square test performed to determine the best model. Table 3 provides the chi-square test results, which were significant for df = 3, implying that the two-dimensional model is more suitable.

Table 3 Findings from the IRT scaling analysis on content dimensions

4.2 RQ 2

RQ 2 tests whether the differences between explication levels can be replicated (naming, reasoning, and linking). For scaling, we used the approach of virtual cases (see above) and performed three-dimensional scaling. The reliabilities of the three explication levels fell within an appropriate range (naming subscale: EAP = 0.8; Theta = 0.99; reasoning subscale: EAP = 0.75; Theta = 2.01; linking subscale: EAP = 0.81; Theta = 2.65). Next, we compared the one-dimensional scaling model’s deviance with that of the three-dimensional model that assumes the three explication levels. The difference between the two deviance values (34.7) was significant for df = 4, suggesting that the three-dimensional model provided a better fit than the one-dimensional model (see Table 4). Thus, the model comparison favors the three-dimensional model.

Table 4 Findings from an IRT scaling analysis of explication

Fig. 1 provides the distribution of item threshold parameters from the unidimensional IRT scaling, with each item indicated by a circle. The distribution of items was divided according to the three explication levels. The rectangles in Fig. 1 are the specific mean values of the item parameter estimates for the three explication levels (Naming: M = −2.78, SE = 0.92, SD = 2.43; Reasoning: M = −0.15, SE = 0.68, SD = 1.79; Linking: M = 1.39, SE = 0.51, SD = 1.53). The results indicate that different explication levels can be differentiated in the data. Using a single-factor ANOVA, the overall mean difference between the three explication levels was found to be significant (p < 0.001). A post-hoc test (Bonferroni) indicated a highly significant difference (p < 0.001) between naming and linking, and a barely non-significant difference between naming and reasoning (p = 0.055). Therefore, we view our assumptions underlying RQ 2 as supported.

Fig. 1
figure 1

Item threshold parameters (circles) and means (rectangles) from one-dimensional IRT scaling split into the explication levels naming (left side), reasoning (middle), and linking (right side)

4.3 RQ 3

RQ 3 aimed to examine if structuring as a part of planning competence increased during the induction. Table 5 indicates that the solution frequencies for all items increased during the induction phase. Almost all the increases were significant (Wilcoxon test). Some items had very high solution frequencies. For our coding, one could speak of a consensus emerging regarding certain requirements for the written lesson plans in NRW and Berlin.

Table 5 Items’ solution frequencies at the beginning (T1) and end (T2) of the induction phase

Subsequently, one-dimensional Rasch scaling was conducted with respect to T1 and T2 (T1 n = 172; T2 n = 138). Due to the extremely high solution frequency (100%) at T2, Item PN4P3 (checking results and feedback) was removed because it could not be used in the ConQuest analysis.

At both measurement time points, the scale proved to be reliable, with the items spreading sufficiently across the ability range (T1: EAP = 0.75; Theta = 1.25). However, at the second time point, EAP reliability and theta variance were significantly lower (T2 2: EAP = 0.41; Theta = 0.31), possibly because the empirical variance determined at the second measurement time point was not very large. The low variances indicate a greater homogeneity in the data which could be interpreted as showing the effect of the training on the teacher’ lesson planning competence.

A high correlation (r = 0.77) exists between the item difficulties of the two measurement time points (59% common variance). The correlation illustrates that the item difficulties at both measurement time points were in a sufficiently invariant order for the two scales. We view this as an important indication of the measurement invariance (see Bond and Fox 2007). Fig. 2 provides the item difficulty parameter estimates in a bivariate scatter plot. Considering that the item parameters are quite close to the diagonal, we assumed sufficient sample independence in the tests.

Fig. 2
figure 2

Item difficulty parameters of the scales at both measurement time points

The mean scoreFootnote 1 (total score) increased significantly (p< 0.000) from the beginning (T1: M = 9.36, SE = 0.25, SD = 3.3) to the end (t2: M = 15.28; SE = 0.19, SD = 2.2) of the induction. To determine whether an increase in measured planning competence existed in relation to structuring, corresponding mean differences were calculated. Table 6’s second column contains the differences in the means at the two measurement time points. The mean differences of the total score and the two subscales are significant. The two columns on the right side of Table 7 indicate that these significant mean differences also were reflected in the corresponding effect sizes (Cohen’s d for different-size groups; T1 = 171, T2 = 138). Using the panel sample of pre-service teachers who submitted their lesson plans at T1 and T2 (n = 116), the mean differences were significant (t [1,115] = −14.71, p < 0.000) and practically relevant (d = 2.05). Furthermore, significant increases were found in all other areas for the panel sample.

Table 6 Mean value differences and effect strengths
Table 7 Findings from regression analysis predicting planning skill scores at each time point

The negative correlations could be interpreted as an indication of the induction’s efficacy because they demonstrate that the lesson plans about structuring as a planning aspect are significantly worse at the beginning of the induction than at the end.

In the next step, we examined the extent to which a change occurred in the different explication levels during the induction phase. Fig. 3 provides the distribution of explication levels at each time point. We used ability estimates from the three-dimensional IRT scaling in the virtual cases approach to indicate the three explication levels in the written plans at T1 and T2. We used the item parameter estimates’ mean values (Fig. 1) to create an ability summary at both time points. The black bars in Fig. 3 contained the results at T1 and the light gray bar at T2. Naming was at a significantly high level at both T1 and T2. The pre-service teachers made significant progress in reasoning and linking between T1 and T2. At the beginning of the internship, less than 30% of the instructional designs reached the justification level, and less than 5% of the instructional designs showed signs of linking decisions. By the end of induction, more than 80% of the lesson plans demonstrated reasoned decision making, and more than 30% demonstrated linking of these decisions.

Fig. 3
figure 3

Distribution of explication levels at each time point (with a 95% confidence interval)

This increase also is reflected in the mean differences and effect sizes, as provided in Table 6 (see above). Mean effect sizes are provided for the two lower levels. The results indicate that the ability to structure a lesson improves during the introductory phase of teacher education. Consequently, pre-service teachers can better justify and link their planning decisions at the end of the introductory phase (T2).

4.4 RQ 4

Furthermore, we will examine whether the empirically found structure of structuring as an aspect of planning competence can be confirmed in different sub-samples (NRW and Berlin). In the previous sections, it was demonstrated that a statistically and practically significant increase takes place over time. However, a question arises: To what extent does this growth occur to a similar extent at the different locations (NRW and Berlin), depending on the pre-service teachers’ teaching type? In view of the assumption that teacher education does have an impact, this may be expected, but it cannot be taken for granted due to the repeatedly mentioned problem of arbitrariness in teacher education curriculum and assessment (Strietholt and Terhart 2009).

Using a single-factor ANOVA for the panel sample, the overall mean difference between the three teaching types was found to be significant for the first time point (p < 0.001). A post-hoc test (Bonferroni) found a significant difference (p < 0.001) between Secondary Level I and Secondary Levels I and II in NRW (p < 0.001) and a significant difference between Secondary Levels I and II and the pre-service teachers in Berlin (p < 0.000). At the second measurement time point, no significant differences were found between the three teaching types.

To obtain a better insight into the development of structuring as part of planning competence, we conducted a regression analysis for each time point. We recorded planning skills as a dependent variable and teaching type as an independent variable. Teaching type was specified using dummy variables: Type 1 was coded as the first dichotomous variable and included the type “lower secondary level” (coded as 1). As the second dichotomous variable, Type 2 recorded the type “Secondary Levels I and II.” The pre-service teachers in Berlin served as a reference group. At the first time point, significant differences were found between teaching types, and 16% (corrected R2) of the differences between the planning performance (total score) based on the teaching type can be explained. At the second time point, the training locations did not differ, indicating a high degree of agreement in teachers’ lesson plans (approximately 1% in differences can be explained; corrected R2).

The findings reveal that at T1, the quality of teacher’s lesson plans depend on their teaching type, but hardly any differences exist between teaching types and teacher training at both locations. Teaching types contributed to the fact that significant growth exists regarding lesson planning under the aspect of structuring.

5 Summary and discussion

Only few studies on modelling and measuring planning competence have been conducted so far. We used a standardized method for assessing written lesson plans developed in a previous study. To verify and replicate the findings, we conducted the present scaling-up study and examined 310 written lesson plans at two measurement time points from pre-service teachers during induction in NRW and Berlin.

The findings from the previous study were confirmed. At the content level, two subscales (contextualization and phasing) could be identified. The Chi-square test provides a better fit for the two-dimensional modelling than the one-dimensional (RQ 1). Furthermore, three different explication levels for planning decisions (naming, reasoning, and linking) were theoretically derived and empirically proven, which also showed a better fit compared with the one-dimensional modelling (RQ 2). At the end of the induction, the pre-service teachers showed a higher degree of planning competence. Measures of lesson-structure planning increased during induction significantly (RQ 3) in all three teaching types (RQ 4). However, the increase was higher in NRW, although performance values at the second time point did not differ significantly between the three groups.

Overall, the results indicate that the present test instrument reflects existing content in teacher education across federal state boundaries. This indicates that the construct has high curricular validity, which is oriented closely toward the induction context and provides evidence that lesson planning in general and structuring specifically are central elements of teacher training. The significant increase in instructional planning competence also demonstrates the effectiveness of teacher induction. The pre-service teachers succeeded in better justifying and linking their instructional planning decisions. Therefore, it can be assumed that the category inventory used captures content that is curricular relevant during induction.

However, this also raises the question: To what extent do institutional requirements determine pre-service teachers’ planning decisions? So, the evaluation procedure could be criticized because the lesson plans are essential part of the examination and certification processes and, thus, may contain teacher educators’ individual preferences regarding demonstration lessons. Therefore, to gather further evidence on content validity, it would be necessary to ask teacher educators whether the system of categories established here actually covers what is taught in everyday educational contexts and what requirements they view as important in the context of initial teacher education. This also would provide evidence of the transparency of the assessment criteria applied in teacher induction, as Strietholt and Terhart (2009) found that only about 50% of teacher educators in Germany make their assessment criteria sufficiently transparent to pre-service teachers.

A possible interpretation is that these increases relate to the examination procedure during induction. The increase could just be the consequence of the intensive preparation for the final examination. Surely a deficit exists in (quasi-)experimental studies that deal with the (causal) analysis of the development of lesson-planning competence during induction. So far, learning opportunities have hardly been investigated (König et al. 2017). It is unclear how pre-service teachers receive content input and/or gain practical experience during induction that help them develop their planning competences.

The findings from the previous study and the replication study can be used to develop quality indicators for assessment practices in teacher education. However, the category system is very extensive. It could be examined whether this category system can be made more efficient and whether individual categories might need to be combined. It would be conceivable that categories S4 and S5, which record the linking of the phase decisions both in the text and in the articulation scheme, represent one category. A teacher who has linked the individual phases argumentatively in the text could do without such explanations in the articulation scheme to “inflate” the lesson plan too much.

Furthermore, the question arises as to whether the “contextualization” and “phasing” subscales together form one scale (under the aspect of structuring) or whether they do not represent two separate constructs that are parts of pre-service teachers’ planning competence. Information concerning the complexity of the construct of planning competence was provided by König et al. (2021). In their CODE-PLAN model (cognitive demands of lesson planning), six factors (content transformation, task creation, adaptation to student learning dispositions, clarity of learning objectives, unit contextualization, and phasing) were combined as indicators of planning competence. The two subscales used here were included in the analysis as separate constructs. Taking the model into account, the six different cognitive demands on lesson planning can be combined into an overall construct of lesson planning competence. This model would be empirically founded and could be used as a criterion variable at the end of the induction phase.

Lesson planning is viewed as being important for the teacher’s performance in the classroom, presumably influencing the procedural flow of lesson design and students’ learning outcomes. These assumptions are plausible, but so far hardly supported by evidence from empirical research (Besa et al. 2020; Rothland 2021). Currently, there is the need for research on the relationship between lesson planning measures and the characteristics of process and outcome quality in teaching. From a practical perspective, the question is whether a good plan can predict performance in the classroom (Shavelson 1987). The existence of a good plan does not guarantee that instruction actually will proceed in a well-structured way. The teaching process is too complex and unpredictable for that (Doyle 2006). Therefore, further research is needed to apply the aspect of structuring not only to the planning situation, but also to the actual teaching process. For example, an important question is whether the planned structuring measures contribute to smoother teaching. This also would relate to effective classroom management. Previous studies on classroom management have focused on planning teacher behavior in advance (Doyle 2006; Evertson and Weinstein 2006). The present construct could be used to test whether well-structured instruction positively impacts effective classroom management and positively influences student learning. To gather evidence for prognostic validity, written plans for demonstration lessons need to be collected from pre-service teachers, and other lessons also need to be assessed. Students could conduct these assessments, or they could be conducted through video recordings of lessons or by trained observers.