1 Introduction

Over the last two decades, research on teacher competence has focused primarily on the acquisition of knowledge, understood as a prerequisite for successful teaching and measured with standardized knowledge tests (e.g., Baumert and Kunter 2013; Kunina-Habenicht et al. 2013; Voss et al. 2015). However, reforms in teacher education in many parts of the world, including Germany, have included a stronger orientation toward professional practice, as shown in the implementation of extensive practical learning opportunities in schools (Ulrich and Gröschner 2020) and in the standards specified by the Standing Conference of the Ministers of Education and Cultural Affairs (KMK), which recommended that theoretical concepts be illustrated through the use of practical examples, simulations of teaching situations, and analysis of videotaped instructional practice (KMK 2019). Consequently, to evaluate the effectiveness of university teacher education, test instruments are needed that not only assess knowledge but also the application of knowledge into practice.

One promising measure of pre-service teachers’ competence as an outcome of teacher education is the contextualized assessment of competence, which embeds test items into a practical context, commonly using videotaped instructional practice (Gold and Holodynski 2017; Seidel and Stürmer 2014; Wiens et al. 2013). Contextualized assessment aims at providing a measure of competence that is related more closely to performance and reflects implicit instead of inert knowledge (Neuweg 2015). A central framework underlying contextualized assessment is teacher noticing, broadly defined as “specialized ways in which teachers observe and make sense of classroom events and instructional details” (Choy and Dindyal 2020).Footnote 1 In the current discourse, a construct similar to teacher noticing has been established using the term professional vision. Following Santagata et al. (2021), professional vision does not necessarily indicate a different theoretical perspective on teacher noticing. For this reason, the terms noticing and professional vision are used synonymously for the present paper with both representing a set of mental processes that teachers engage in (see also Sect. 2.1).

Especially in the domain of mathematics teaching, noticing has become widely accepted as a component of teachers’ professional competence (Jacobs et al. 2010; Santagata et al. 2021; Sherin et al. 2011a). However, the standardized measurement of noticing is challenging, and only a few high-quality test instruments have been developed and implemented for pre-service teachers.

Against this background, we draw on a standardized noticing test instrument developed in a German follow-up study to the international comparative study Teacher Education and Development Study in Mathematics (TEDS-M), the Teacher Education and Development Study in Mathematics Follow-Up (TEDS-FU; Blömeke et al. 2014). The TEDS-FU Video Test captures secondary mathematics teachers’ noticing, which is conceptualized as perception, interpretation, and decision-making skills. While the instrument was successfully implemented for in-service teachers within the projects TEDS-Validate and TEDS-Instruct (Kaiser and König 2020), its use for pre-service teachers has not been explored yet.

Our study builds on the concept of transfer, which generally describes the process of a phenomenon or construct being conveyed to another context; more specifically for education science, it denotes the dissemination of innovations from research into educational practice (Gräsel 2010). The TEDS-FU Video Test is transferred to a new context, namely initial teacher education, and—concerning the specific understanding of transfer—its use to measure a learning outcome of university teacher education is investigated. For this purpose, specific validity evidence needs to be provided with respect to the particular group of pre-service teachers (American Educational Research Association [AERA] et al. 2014). This procedure is crucial, since pre-service teachers who have received little explicit training in teaching are not necessarily capable of analyzing video-taped instruction by connecting theoretical concepts and pedagogical practice (the “theory-practice-gap”; see Korthagen 2010).

2 Theoretical background

2.1 Teacher noticing as part of professional competence

Research on teacher noticing is framed by heterogeneous conceptualizations and terminologies. In a systematic literature review, Santagata et al. (2021) identified four perspectives on teacher noticing: (1) a cognitive-psychological perspective that conceptualizes noticing as a set of mental processes that teachers engage in during instruction (e.g., van Es and Sherin 2002), (2) a socio-cultural perspective, often associated with the term “professional vision,” that points out the role of social interaction within groups of professionals in shaping a common perception and understanding of meaningful events (Goodwin 1994), (3) a discipline-specific perspective that conceptualizes noticing as a set of practices teachers engage in to support their own sensitivity (Mason 2002), and (4) an expertise-related perspective that highlights differences between novice teachers and experts with respect to their ways of seeing and making sense of observed instructional practice (Berliner 1988). The practice of measuring teacher noticing with standardized test instruments was especially influenced by the cognitive-psychological perspective described in detail below.

Seen from the cognitive-psychological perspective, noticing is conceived of as a set of closely interrelated mental processes, called noticing facets, that teachers engage in during instruction and “through which teachers manage the ‘blooming, buzzing confusion of sensory data’ with which they are faced” (Sherin et al. 2011b, p. 7). Three noticing facets were differentiated by van Es and Sherin (2002): “(a) identifying what is important or noteworthy about a classroom situation; (b) making connections between the specifics of classroom interactions and the broader principles of teaching and learning they represent; and (c) using what one knows about the context to reason about classroom interactions” (p. 573). Focusing on noticing children’s mathematical thinking, this approach was restructured and expanded by Jacobs et al. (2010), who differentiated the noticing facets as (a) “attending to children’s strategies,” (b) “interpreting children’s mathematical understandings,” and (c) “deciding how to respond on the basis of children’s understanding” (pp. 172–173). However, no consensus has been reached so far with respect to how many and what kind of facets are relevant to conceptualize and investigate teacher noticing (Dindyal et al. 2021).

Given the heterogeneity of perspectives and conceptualizations of teacher noticing, the development of a consistent theoretical framework serving as basis for test development is challenging. The present study builds upon the widely accepted theoretical framework by Blömeke et al. (2015a) of competence as a continuum, in which a set of situation-specific skills, that is, perception, interpretation, and decision-making, is conceptualized as mediator between dispositions (e.g., knowledge or beliefs) and performance. This model can be seen as extension of cognitive approaches to competence, primarily focusing on professional knowledge (e.g., Baumert and Kunter 2013). Noticing, in our framework, is thus seen as part of professional competence and conceptualized as a set of situations-specific skills, which is comparable to the mental processes focused on within the psychological perspective on teacher noticing (e.g., Jacobs et al. 2010). For our own framework, we use the more neutral term “noticing facet” when referring to the different qualities of skills/processes.

The competence as a continuum model was transferred to mathematics teaching by Kaiser et al. (2015, p. 374) conceptualizing teacher noticing as “(a) Perceiving particular events in an instructional setting, (b) Interpreting the perceived activities in the instructional setting and (c) Decision-making, either as anticipating responses to students’ activities or as proposing alternative instructional strategies”. Although this model’s focus is on mathematics teaching, the scope of noticing is broadened by considering subject-specific as well as generic pedagogical issues in whole lessons, including noticing of students’ and teachers’ actions. For this framework, the first facet is termed “perception” instead of “attending” with reference to the research on teacher expertise. Within this research strand, perception denotes teachers’ processing of relevant sensory information (e.g., Carter et al. 1988). While attending emphasizes the selectivity of information processing (e.g., attending to a relevant detail within a complex perceptual field), the term perception implies a stronger focus on perceiving (and remembering) clearly discernable events.Footnote 2 Consequently, the accurate perception of classroom events does not necessarily require professional knowledge (or experience), even though knowledge inevitably shapes perception.

The second facet, interpretation, refers to the teacher’s thinking about what they have observed using their knowledge and experience, thereby “relating observed events to abstract categories and characterizing what they see in terms of familiar instructional episodes” (Sherin et al. 2011b, p. 5). Based on their perceptions and interpretations, teachers must determine appropriate instructional responses to classroom events; this is the third facet, decision-making.

2.2 Standardized testing of teacher noticing: Test design and validation

Standardized noticing tests commonly include the presentation of classroom artifacts—usually short videos of classroom practice—combined with rating items or open questions to capture the various noticing facets (Jacobs et al. 2010; Kaiser et al. 2015; Star and Strickland 2008). Since noticing does not represent one homogeneous construct, measurement instruments commonly target a specific focused domain, which is related to subject-specific aspects, such as children’s mathematical thinking (Jacobs et al. 2010) or instructional support in primary science teaching (Todorova et al. 2017), or generic pedagogical aspects (Seidel and Stürmer 2014; Wiens et al. 2020).

Given that this measurement approach is comparably new, the investigation of validity is of particular relevance. Following AERA et al. (2014), validity is understood as a unitary concept, which “refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (p. 11). Consequently, researchers are required to specify the intended interpretation(s) of tests scores and the intended test use—including a precise definition of the underlying construct—and collect theoretical and/or empirical validity evidence to support these interpretation(s).

For valid interpretations of noticing tests, it should be investigated whether the theoretical conceptualization of noticing, especially regarding the differentiation of noticing facets, corresponds to the measurement using factor analysis or item response theory (IRT). However, previous findings vary. For example, Seidel and Stürmer (2014) found that a three-dimensional model, distinguishing the facets descriptionFootnote 3, explanation, and prediction, fitted the data better than a one-dimensional model even though the intercorrelations were large (0.77 ≤ r ≤ 0.89). Measuring perception, interpretation, and decision-making, Bastian et al. (2021) favored a three-dimensional over a one-dimensional model with high latent correlations between perception and interpretation (0.814), and interpretation and decision-making (0.815), but a lower correlation between perception and decision-making (r = 0.462). By contrast, other studies’ findings are interpreted in favor of a unidimensional structure (Gold and Holodynski 2017; Meschede et al. 2015). For example, Meschede et al. (2015) report that describing and interpreting are almost inseparable (r = 0.99).

2.3 Noticing as a learning outcome of teacher education

Using amongst others a noticing test for the evaluation of teacher education programs requires evidence that a substantial proportion of the variance in test scores can be explained by relevant factors in teacher education. In line with the educational concept of learning opportunities in teacher education (Floden 2002; Schmidt et al. 2011), the acquisition of professional competence is conceptualized as an interplay of (1) pre-service teachers’ individual prerequisites and (2) their perception of having been exposed to formal opportunity to learn (OTL). Considering both aspects, the following subsections give reason for the variables selected for the present validation study and summarize existing evidence.

2.3.1 Individual prerequisites

The average grade in the final secondary school examinations is commonly used as a distal indicator of cognitive abilities in research on teacher competence (e.g., Kunina-Habenicht et al. 2013). However, the average grade is further related to knowledge, (academic) motivation and learning strategies (Mayr 2010) and predictive for future academic achievement (Trapmann et al. 2007).

As a broad indicator of academic ability, the average grade in final secondary school examinations has also been shown to predict teachers’ professional knowledge regarding general pedagogy (Kunina-Habenicht et al. 2013) as well as teachers’ subject-specific content knowledge and pedagogical content knowledge for several domains (König et al. 2018; Lindl and Krauss 2017). As noticing has been conceptualized as knowledge-based, the average grade should therefore also be related to the acquisition of noticing. However, existing findings vary: An effect of the average grade on noticing was found by Wiens and Gromlich (2018) (β = 0.17) as well as by Todorova et al. (2017) (β = −0.25/−0.31; lower grades indicate better performance). By contrast, other studies did not find such effects (Stürmer et al. 2015; Wiens et al. 2013), suggesting that the relationship depends on the investigated sample and the specific operationalization of noticing.

Before and during their studies, pre-service teachers can gain pedagogical experience in such contexts as private tutoring and coaching sports teams. These activities can be conceptualized both as informal learning opportunities and as individual prerequisites that facilitate the acquisition of professional knowledge (König et al. 2012; Kunina-Habenicht et al. 2013). Pedagogical experience in the context of teaching might also promote pre-service teachers’ noticing by providing opportunities for using the acquired knowledge in situations of pedagogical action.

However, some cross-sectional studies have not found a correlation between noticing and pedagogical experience, internship experience, or teaching experience (Jamil et al. 2015; Stürmer et al. 2015; Todorova et al. 2017); although, teaching practice can generally support pre-service teachers’ noticing, which has been shown for long-term teaching internships (e.g., Mertens and Gräsel 2018, d = 0.79).

2.3.2 Use of opportunity to learn

OTL can be broadly defined as experiences that aim to achieve a learning outcome (Tatto et al. 2008). There is substantial variation in the perceived amount of OTL for mathematics pedagogy and teaching mathematics among pre-service teachers (Christiansen and Erixon 2021). However, only a few studies have investigated the influence of program features within teacher education on the acquisition of noticing (Stürmer et al. 2015; Todorova et al. 2017; Wiens et al. 2013). For example, Stürmer et al. (2015) showed that noticing conceptualized as professional vision was associated with the number of generic pedagogical courses (β = 0.31). Furthermore, Todorova et al. (2017) found that pre-service teachers with a study focus on science teaching outperformed their colleagues with respect to noticing science-specific aspects (β = 0.30/0.33).

In line with international studies on school achievement and teacher competence, OTL can be operationalized by the specific content a learner has dealt with up to a certain time (Kunina-Habenicht et al. 2013; Schmidt et al. 2011). The amount of OTL experienced is related to the acquisition of professional knowledge during teacher education; thus, OTL is an appropriate variable to use for validation purposes with respect to measures of learning outcomes (König et al. 2018; Kunina-Habenicht et al. 2013). As differentiated assessments of OTL have not been linked to teacher noticing until now, this makes them an interesting measure for validation purposes and for exploring the effects of teacher education on noticing.

3 Research questions and background of the study

In recent years, considerable efforts have been made to develop video-based instruments that enable the contextualized assessment of teachers’ competence (e.g., Gold and Holodynski 2017; Jamil et al. 2015; Seidel and Stürmer 2014). Our study draws on an instrument developed within the study TEDS-FU, namely the TEDS-FU Video Test, which targeted early career secondary mathematics teachers’ noticing skills.

Within the study TEDS‑M, teachers’ competence was addressed using standardized knowledge tests. To evaluate teachers’ competence more closely connected to teaching practice, the conceptual framework of TEDS‑M was extended within TEDS-FU by considering teachers’ situation-specific skills—that is, perception, interpretation, and decision-making skills—and assessing them using video-based test instruments (Kaiser et al. 2015). In TEDS-FU, the original participants, who had been at the end of their teacher education when participating in TEDS‑M, were approached another time after 2.5–3 years of work as early-career teachers. The test development was accompanied by curricular analyses to ensure the accuracy of the mathematical content and expert workshops to discuss the suitability of the test items and instructional events presented in the videos (Kaiser et al. 2015; Hoth et al. 2016).

Test performance was empirically correlated with professional knowledge (Blömeke et al. 2015b). In further studies, namely TEDS-Instruct and TEDS-Validate, the TEDS-FU Video Test was used with practicing teachers with different lengths of teaching experience (Bastian et al. 2021). These findings suggest that the TEDS-FU Video Test can be validly interpreted as a measure of in-service teachers’ noticing skills. However, it remains an open question whether the instrument can be used with pre-service teachers who have little teaching experience.

Our study aims to provide specific validity evidence that the TEDS-FU Video Test can be used with pre-service teachers and measures their noticing skills as one learning outcome of teacher education. For this purpose, we focus (1) on validity evidence based on the internal test structure and (2) associations with relevant factors within university teacher education.

RQ1

Does the TEDS-FU Video Test reliably measure pre-service teachers’ noticing skills—perception, interpretation, and decision-making skills—as the three interrelated facets of noticing?

Evidence based on internal test structure is crucial to the certainty that a reliable measure of the differentiated facets is provided for pre-service teachers, and measurement is not affected by limited variance. Therefore, a one-dimensional scaling model—noticing as one holistic facet—is compared to a three-dimensional model that distinguishes pre-service teachers’ perception, interpretation, and decision-making skills. We hypothesize that the three-dimensional model is superior to the one-dimensional model.

To provide validity evidence, the pattern of intercorrelations should correspond to theoretical presumptions (see AERA et al. 2014). Perception and interpretation are discussed as closely related (Sherin et al. 2011b), both being informed by professional knowledge (e.g., Wolff et al. 2021). We thus predict a high correlation between these noticing facets. Similarly, since teachers’ decision-making should be based on a sound interpretation (Bastian et al. 2021; Jacobs et al. 2010), we also expect a high correlation between interpretation and decision-making. By contrast, pre-service teachers may perceive and remember discernable features of the classroom without being able to propose an adequate response. So, we expect only a moderate correlation between perception and decision-making.

RQ2

Can pre-service teachers’ noticing scores be explained by (a) the participants’ individual prerequisites, namely, the average grade in the final secondary school examinations and pedagogical experience, or (b) the participants’ use of formal OTL?

We expect pre-service teachers with higher academic ability—indicated by the average grade—to score higher in noticing, as their better prerequisites support them to acquire and apply knowledge. We further expect teaching experiences (e.g., private tutoring) but not nurturing experiences (e.g., caring for younger brothers and sisters) to correlate with noticing, as only teaching experiences provide opportunities to use the theoretical knowledge acquired for reflecting on teaching situations. As formal OTL provide situations for acquiring and possibly applying professional knowledge, we expected that OTL in general pedagogy and mathematics education predicts noticing. With reference to previous findings, we expect small effect sizes for all factors considered. However, the explained variance should be taken into account for the evaluation of validity evidence.

4 Methodology

4.1 Sample

A sample of 313 pre-service mathematics teachers was surveyed between spring 2019 and fall 2020 at six German universities. Table 1 shows the demographic statistics of the present sample. Pre-service teachers were recruited before they entered their first long-term school internship and so had little teaching experience in the context of university teacher education. For all universities except Würzburg, the internships took place during the master’s degree phase. At the University of Würzburg, the study program is organized as a state examination and not divided into bachelor’s and master’s degrees. Therefore, participants from this university were recruited before entering their study-related teaching internships in the fourth to sixth study semesters.

Table 1 Demographic statistics

The participants were contacted by the lecturers of their courses, which focused on mathematics teaching preparatory for the long-term teaching internships. They received an internet link via e‑mail that led to an online platform hosting the survey, including noticing tests and questions on supplemental information. Completing the questionnaire took approximately 90 min, and participants were reimbursed with a financial compensation of 15 Euros. Data collection and processing was in accordance with the requirements of the General Data Protection Regulation.

4.2 Measures

Teacher noticing

Pre-service teachers’ noticing skills were assessed using the TEDS-FU Video Test (Kaiser et al. 2015), which includes three scripted video vignettes about 3.5 min long: (1) Frog King based on a German fairy tale, (2) Box, and (3) Solids. Scripted vignettes were used, rather than videos of authentic instruction, to ensure a sufficient density of mathematics and generally pedagogically relevant events. The vignettes show compilations of ninth-grade mathematics lessons in different school types that cover a wide range of mathematical topics (e.g., volume calculations, functions, and surfaces) and different instructional phases. Before watching each vignette, the participants received some information about the students, the learning context, and the mathematical topics. Participants were permitted to watch each vignette only once.

Since a detailed description of the three video vignettes can be found in previous publications (see Kaiser et al. 2015), we restrict ourselves to describing one vignette only. Box refers to a secondary mathematics classroom of academic-track ninth-grade students who are asked to compute the volume of an open box made from a rectangular sheet with four congruent squares cut off the corners. The volume of the box can be determined based on a function of the size of the cut-off squares. Three pairs of students are shown solving the task in diverse ways. The results are then collected in the whole-class discussion.

After each vignette, rating items and open response items were administered to access the participants’ perception, interpretation, and decision-making skills (see Table 2) focusing on both subject-specific and general pedagogical aspects of mathematics teaching. For rating items, the participants indicated the extent to which they (dis)agreed with statements on the observed practice on four-point Likert scales (fully correct to not correct at all).

Table 2 Item number and distribution of items across noticing facets

The items with a focus on perception mainly consisted of rating scales including descriptive statements (e.g., “Most students take an active part in the lesson”). Working on these items required the participants to carefully watch the video clips, but not to draw on their professional knowledge. By contrast, items focusing interpretation, which included both item formats, required the participants to link the observed practice to broader principles of teaching and learning. For the example item in Fig. 1, the participants had to connect the approaches of three pairs of students shown in the video to different modes of representation (enactive, iconic, and symbolic). Working on such items also requires a certain degree of perceptual processing. However, the items were constructed to explicitly focus on interpretative processes (e.g., by addressing contrasting descriptions or the application of concepts), and the participants’ perception was supported using pictures of the teaching situations and short introducing texts. Items focusing on decision-making solely comprised open response items and required the participants to propose possible continuations to the instructional practice observed or to create alternatives to the teacher’s actions in the video (see Fig. 2). To create unambiguous items with a clear focus on decision-making, the item texts suggested an interpretation of the relevant situation in the video (e.g., the specification of a learning goal for the class).

Fig. 1
figure 1

Open response item targeting interpretation with a focus on mathematics teaching (a) and general pedagogy (b)

Fig. 2
figure 2

Open response item focusing on decision-making with respect to mathematics-related aspects of teaching

The scoring procedure was based on an expert survey. For the rating scales, the participants’ answers were coded as “correct” if they matched the expert master rating. Scoring of open response items was conducted with an extensive coding manual based on the experts’ solutions; it resulted in good interrater reliability (κmean = 0.80; κmin = 0.47; κmax = 1.0).

Individual prerequisites

In addition to the average grade in the final secondary school examinations (minimum: 4.0; maximum: 1.0), the participants’ pedagogical experience was assessed using a measure by König et al. (2013). On five dichotomous items (yes/no), the participants indicated whether they had had a specific pedagogical experience or not. The items aim at nurturing on the one hand and at teaching experience outside of formal studies on the other. Example items and descriptive statistics are shown in Table 3.

Table 3 Overview of measures for OTL and pedagogical experience

Opportunity to learn

Pre-service teachers’ formal OTL was assessed with respect to (1) teaching mathematics (Doll et al. 2018) and (2) general pedagogy (König et al. 2017). The participants had to indicate (yes/no) whether specific content had been treated within their previous teacher training. The content represented central topics of German teacher training within the two areas focused on. Subscales, example items, and descriptive statistics can be found in Table 3. Internal consistency was at least acceptable for all subscales.

4.3 Data analysis

Test data were scaled based on item response theory (IRT) with ConQuest software (Wu et al. 1997) using Rasch models. To investigate the internal test structure of the TEDS-FU Video Test (RQ1), an IRT scaling model was initially estimated with one latent variable. Then, a multidimensional IRT model with three latent variables—(1) perception, (2) interpretation, and (3) decision-making—was specified. The two scaling models (see Fig. 3) were compared with respect to the expected a posteriori/plausible values (EAP/PV) reliability, the weighted likelihood estimates (WLE) reliability, and the theta variance, model deviance, and sample-size-adjusted Bayesian information criterion (BIC).

Fig. 3
figure 3

Scaling models of teacher noticing

To explore the relationship between test performance (WLE estimates of person ability), and factors within teacher education (RQ2), multiple regression models were conducted using Mplus (Muthén and Muthén 1998–2006). The stratified structure of the sample was considered by using the option “type = complex” and specifying a combined variable of university and teacher education program as a stratum.Footnote 4

5 Results

5.1 RQ1: Internal test structure and reliability

Item analysis was conducted for the one-dimensional and three-dimensional partial credit models. Seven items were removed from analysis since they exceeded a weighted mean square (WMSQ) of 1.25 or showed poor item discrimination (< 0.15). Two further items with critical fit statistics were kept for theoretical reasons. For the remaining items, item discrimination was, on average, good (M = 0.30, min. = 0.15, max. = 0.54) and WMSQs were in an appropriate range (0.88 < WMSQ < 1.13; Bond and Fox 2015).

The results of both scaling models are depicted in Table 4. The model deviance and the corresponding likelihood ratio test revealed that the three-dimensional model fitted the data significantly better than the one-dimensional model. The WLE and EAP/PV reliability can be considered as very good for the one-dimensional model. Regarding the three-dimensional model, the reliability was still acceptable or good for perception and interpretation, but the WLE reliability for decision-making was very low, which is in part due to the smaller number of items used for this dimension (see Table 2).

Table 4 Comparison of the one-dimensional and the three-dimensional scaling model

With respect to the latent intercorrelations, correlation was high between perception and interpretation (rPI = 0.704) and between interpretation and decision-making (rID = 0.730); it was lower between perception and decision-making (rPD = 0.292). This latter correlation (rPD) was significantlyFootnote 5 lower than rPI (z = −11.928; p < 0.001) and rID (z = −12.361; p < 0.001), which is in line with our hypotheses. In sum, scaling analysis and intercorrelations support the superiority of the three-dimensional model.

In exploratory analyses, two further models were tested with two dimensions respectively: (1) perception and interpretation vs. decision-making (PI‑D; rPI‑D = 0.681), (2) perception vs. interpretation and decision-making (P-ID; rP‑ID = 0.671). Both models showed better model fit than the one-dimensional model, and the P‑ID even fitted better than the three-dimensional model (see Table 4). This result suggests that combining interpretation and decision-making may provide a more efficient approach of measuring noticing when using this instrument. However, the lower deviance of the P‑ID model is partly explained by the low reliability—and the low number of items—focusing on decision-making. To account for possible differences between interpretation and decision-making regarding their relationship with other variables, the three-dimensional model was used for the subsequent analyses.

It should be noted that the items measuring perception mainly have a focus on general pedagogy, while decision-making items predominantly address mathematics teaching. Consequently, the correlation between decision-making and perception may be underestimated. However, further analyses revealed that the low correlation between perception and decision-making is likely not a result of different domains focused on by the items (see Online Resource 1).

5.2 RQ2: Effects of individual prerequisites and opportunity to learn

The (manifest) correlations between noticing facets and factors within teacher education including individual prerequisites and OTL can be found in Online Resource 2. Study semester, average grade in the final secondary school examinations, and a dichotomous indicator of the teacher education program (0 = lower and upper secondary school/vocational school, 1 = lower secondary school/special needs education) were included as control variables for all models. Other demographic variables were not included since they did not correlate with noticing. For each facet, four regression models were specified: One model focusing on individual prerequisites, two separate models including OTL in mathematics education or general pedagogy, and one model with all predictors being included. Separate models were specified for the two domains of OTL to avoid a loss of statistical power owing to the high correlation between mathematics teaching and general pedagogy (r = 0.48).

The results of the multiple regression analysis can be seen in Table 5. Against our expectations, for the perception facet, only a small proportion of the variance was explained. Only the average grade in the final secondary school examinations showed a significant but small effect on perception, with better test performance being associated with a better average grade. Interpretation and decision-making were also significantly predicted by the average grade in the final secondary school examinations. However, against our assumptions, the only effect that could be found for OTL was a small effect of OTL in mathematics education on decision-making. Another very small effect of OTL in mathematics education on interpretation was not significant when controlling for OTL in general pedagogy. For all facets, no effect of pedagogical experience or OTL in general pedagogy was found.

Table 5 Multiple regression models for perception, interpretation, and decision-making

Using a combination of university and education program as a cluster variable, a considerable proportion of variance was found to be on program level for interpretation (ICC = 0.11) and decision-making (ICC = 0.09), but not for perception (ICC < 0.01). Therefore, additional multilevel regression models were specified to explore the effects of OTL on interpretation and decision-making when distinguishing between individual use of OTL and the influence of the teacher education program (i.e., the context effect). The results can be found in Online Resource 3, Table 1, and are comparable to the results of the regression models reported above, except for a very small effect (β = 0.12) of OTL in general pedagogy predicting interpretation on level 1. However, it should be noted that a substantial proportion of variance regarding interpretation on program level was explained when OTL in mathematics education were included in the model (around 45%). Even though this effect was not significant—the small number of clusters reduces the statistical power—this effect can serve as a starting point for further analyses.

To examine the effect of the OTL subscales, a multiple regression model for each subscale was specified, including study semester, teacher education program, and average grade in the final secondary school examinations as control variables (see Table 6). Using separate models accounts for possibly reduced power caused by moderate correlations between the OTL subscale (see Online Resource 2, Tables 2 and 3). Given the increased alpha risk due to the number of models estimated, the significance criterion was reduced by factor 0.1 (p < 0.005), which is equivalent to a Bonferroni correction considering ten significance tests per noticing facet. While no significant effect was found for all OTL scales related to general pedagogy, both interpretation and decision-making were significantly predicted by the subscale “research on mathematics education,” and decision-making was further predicted by the subscale “curricular aspects/assessment.” The effect sizes for all significant coefficients were small. With respect to perception, no subscale showed any significant effect.

Table 6 Summary of beta coefficients of OTL subscales

To report a final estimation of how much variance in the participants’ noticing skills can be explained by individual prerequisites and use of OTL in the present study, a summarizing structural equation model was specified (see Fig. 4). This procedure is exploratory and can therefore not be transferred to other samples. For the model, perception, interpretation, and decision-making were used as indicators of noticing, and all variables that showed significant effects in the previous analyses were added as predictors, that is, the OTL subscales “research on mathematics education” and “curricular aspects/assessment,” the average grade, the study semester, and the teacher education program. The OTL subscales were modeled as indicators of OTL in mathematics education as latent variable. The resulting model showed acceptable model fit (CFI = 0.943; RMSEA = 0.063; SRMR = 0.055; Chi2 = 33.665, df = 16, p < 0.01) and explained a considerable proportion of variance in pre-service teachers’ noticing (R2 = 0.226).

Fig. 4
figure 4

Factors within teacher education (individual prerequisites and use of OTL) predicting noticing skills (*p < 0.05, **p < 0.01, ***p < 0.001); The correlation between noticing and the average grade is negative, since in Germany lower average grades indicate better performance

6 Discussion

We investigated whether an established video-based test instrument, the TEDS-FU Video Test, originally developed to capture in-service mathematics teachers’ noticing skills, could be used to measure pre-service teachers’ noticing as a learning outcome of teacher education. The test was implemented in a survey of 313 pre-service teachers from different universities. We aimed to provide group-specific validity evidence by examining the internal test structure (RQ1) and the test’s association with influential factors within teacher education including individual prerequisites and OTL (RQ2).

6.1 Measurement of perception, interpretation, and decision-making

With respect to our first research question, IRT scaling analysis revealed that the TEDS-FU Video Test provided a reliable measurement of the three noticing facets of perception, interpretation, and decision-making among the new target group. High correlations were found between perception and interpretation as well as for interpretation and decision-making, while perception and decision-making were only weakly correlated. This pattern corresponds to the theoretical assumptions on the structure of teacher noticing (Jacobs et al. 2010; Sherin et al. 2011b) and can thus be seen as validity evidence. Moreover, the correlations described are in line with the findings by Bastian et al. (2021), who investigated the TEDS-FU Video Test in a concurrent scaling analysis including pre-service teachers and in-service teachers.

Even when used for a pre-service teacher sample with limited formal access to practical teaching, the TEDS-FU Video Test provided a reliable measurement of the three noticing facets including decision-making skills. As novices, our target group can be assumed to have severe difficulties in quick decision-making (e.g., Carter et al. 1988; Stigler and Miller 2018) as they lack well-organized cognitive schemata and are not able to anticipate potential further courses of classroom events. However, they do not seem to be unfamiliar with classroom situations; their noticing abilities vary to a substantial degree, resulting in differentiated reliable measures. We conclude from this that the transfer of this video-based noticing instrument as part of competence assessment into teacher education is possible.

6.2 Associations with factors within university teacher education

To interpret the test scores as a learning outcome of teacher education, evidence is required that the test scores are associated with relevant factors in teacher education including individual prerequisites and use of OTL. In our study, the average grade in the final secondary school examinations was the strongest predictor of noticing skills, having small to moderate effect sizes, which is in line with a previous meta-analysis highlighting (average) school grades as predictors of academic achievement (Trapmann et al. 2007). This association could be further explained by both the average grade in the final secondary school examinations and noticing being connected with information processing. This corresponds to the finding that the average grade showed a higher correlation with interpretation than with perception (z = 3.44; p < 0.001) and decision-making (z = 2.32; p = 0.02); this is possibly explained by interpretation being cognitively demanding and requiring knowledge when applying theories and concepts to observed instructional events. Overall, the effects of the average school leaving grade found in our study contradict the hypothesis by Stürmer et al. (2015), who assume that the average school grade in the final secondary school examinations is suitable to predict knowledge acquisition but not the application of knowledge into practice. The relationship between noticing and the average grade may, however, depend strongly on the used operationalization of noticing.

Although, on the theoretical level, noticing skills should be developed among pre-service teachers when engaging in teaching practice, no relationship between pre-service teachers’ noticing and pedagogical experience was found in our study; this is in line with previous findings on the construct noticing (conceptualized as professional vision) measured by video-based tests (e.g., Stürmer et al. 2015; Todorova et al. 2017). Without explicit training, pre-service teachers may not automatically draw on their knowledge acquired during teacher education when they teach. Our findings suggest that targeted interventions are needed to support pre-service teachers in linking theory (e.g., principles of teaching and learning) and teaching practice (Stürmer et al. 2013; Weber et al. 2018).

For OTL in university teacher education, only few effects were found. OTL in mathematics education significantly predicted pre-service teachers’ decision-making with a small effect size. This finding cannot be regarded as strong validity evidence. However, König et al. (2018) found no effect of overall OTL in mathematics education on pre-service teachers’ pedagogical content knowledge. The authors assume that mathematics teacher education courses provide highly structured curricular requirements and therefore do not allow pre-service teachers a completely free choice of content during their study, thus reducing the variance of OTL measures and limiting effect sizes. Against this background and considering the small effect sizes in previous studies on the acquisition of teacher noticing (e.g., Wiens et al. 2013), it is encouraging that significant correlations between OTL and the TEDS-FU Video Test could be found at all.

König et al. (2018) found that only the subdimension “research on teaching mathematics” significantly predicted pedagogical content knowledge. In the present study, this subdimension predicted interpretation and decision-making, suggesting that this domain has diagnostic value. However, the significant effect of specific OTL in mathematics education on teacher noticing, including “research on teaching mathematics” and “curricular aspects,” can also be explained by highly constructive alignment. Especially, the content related to research, among others, includes theories on the development of mathematical competence (e.g., Bruner’s modes of representation) and the role of applying mathematics to real-word problems, both covered by the TEDS-FU Video Test. Moreover, the correlation between test scores and the subscale focusing on research might reflect that the development of this instrument, which was mainly conducted by researchers from mathematics education, was highly influenced by prominent mathematics educational theories and findings within this research area and reflect a common core of courses in mathematics education.

In contrast, the absence of general pedagogy OTL effects in our analysis may be interpreted regarding the strong focus of the test on teaching mathematics, even though general pedagogical issues are also included. However, with respect to the discourse on the effectivity of teacher education, it is also possible that for OTL in general pedagogy, the theoretical contents are not sufficiently related to practice, for example, using video clips or other forms of practical examples.

To sum up, the TEDS-FU Video Test allows for a reliable measurement of pre-service mathematics teachers’ noticing skills and is significantly associated with factors within teacher education, that is, individual prerequisites and OTL. Factors explained nearly 23% of the variance in noticing, which is comparable to previous studies (Stürmer et al. 2015; Todorova et al. 2017). However, the variance explained is particularly due to the average grade as an indicator of academic ability, while only few effects of OTL were shown, suggesting that test scores should not be interpreted as learning outcomes on the individual level (e.g., for individual diagnostics), but more measuring the effects of programs. This conclusion is also supported by the multilevel regression models reported, as OTL in mathematics education explained a considerable proportion of variance in interpretation test scores on program level.

6.3 Limitations and directions for future research

The following limitations should be considered. First, the analysis is based on a convenience sample, so the variance may be restricted due to selection effects. Also, it should be noted that the analyses are based on cross-sectional data. So, the effects of OTL and study semester cannot be interpreted as effects of development.

Exploratory analyses of the internal structure revealed that the two-dimensional model merging interpretation and decision-making shows better model fit than the three-dimensional model. This result is partly due to the comparably low reliability of decision-making suggesting that further efforts in test development would be helpful. It should be emphasized that the effect sizes for interpretation and decision making were slightly different, suggesting that the facets are separable. Moreover, the multilevel analysis conducted suggested that the two facets differ regarding their relationships to OTL on the individual and the program level.

Although a substantial amount of variance in noticing was explained by the factors considered, there is still a high proportion of unexplained variance—especially regarding the perception facet. Future studies should therefore identify further influencing factors, such as motivational aspects (e.g., interest), the actual extent and perceived quality of specific learning opportunities as well as teaching experience. In addition, the implementation of measures for cognitive abilities could be helpful to understand, whether the effect of the average grade on noticing skills leads back to cognitive abilities.

The absence of effects for many OTL scales in the present study might be due to operationalizing OTL as a list of topics within a domain. Future studies should develop and implement more specific questionnaires to explore the extent to which representations of practice (e.g., video clips) are used for supporting pre-service teachers’ skills in analyzing classroom situations. Furthermore, previous studies using multilevel modeling to explore the acquisition of professional knowledge during teacher education, found that the influence of OTL was greater on program level than on the individual level (e.g., König et al. 2017). In the present study, the influence of OTL might be underestimated, especially for interpretation, since for this facet, OTL appears to be a relevant predictor on the program level. Future studies should aim at acquiring samples that are appropriate for multilevel modeling.