Introduction

Training in children and youth sports aims at achieving multiple goals: the optimal improvement of learning and performance, the support of motivation, as well as the promotion of positive social and emotional growth. In order to reconcile these goals, it is important to examine effective coaching behaviours and thereby identify environments that are most healthy for children’s and youths’ social, emotional, athletic and cognitive development.

Coach–athlete interactions as the most dominant part of coaching behaviour can be measured with different methods—e.g. by questionnaires for coaches, athletes and parents or by external observations. With these measures, one can assess the instructional quality and simultaneously gain information about strengths and areas of growth of individual coaches. Observations can be highly selective and unreliable, as prominent experiments like the “Invisible Gorilla” show. However, the advantages of observations can be achieved with qualified observers who are specially trained to make objective, reliable and valid assessments. They usually have plenty of experience, having already observed multiple groups, and they are not personally related to the people involved in the teaching–learning scenario (Praetorius, Rogh, Bell, & Klieme, 2019). Helmke (2010) showed that only in some parts are learners’ judgements in good agreement with expert judgements of teaching quality. Moreover, Richartz & Sallen (2013) were able to demonstrate considerable differences between the assessments of pedagogical quality by expert observations and the assessments by children and their parents. This discrepancy is an important argument that addressees’ reports cannot be considered satisfactory indicators of the pedagogical coaching quality in themselves. Observation instruments may be superior to them in terms of differentiation and quality standards.

Although observational instruments seem to be receiving increased attention in educational research in recent years (Praetorius & Charalambous, 2018), they are still sparsely used in sports coaching research. A review from 2017 identified 26 studies using systematic observation between 1997 and 2016 (Cope, Partington, & Harvey, 2017). The authors identified a range of instruments used in these studies including the Arizona State University Observation Instrument (ASUOI), the Coach Behaviour Recording Form (CBRF), and the Coach Analysis Intervention System (CAIS). However, in almost all cases, these instruments were not used in their original form, but were modified to address specific research questions. Cope and colleagues suggest that this could be explained by the mostly dated nature of these instruments. Only the computerised CAIS was developed more recently (Cushion, Harvey, Muir, & Nelson, 2012). Although this is supposed to be a more sophisticated systematic observation instrument, it only focusses on context-sensitive descriptions of coaching behaviour, answering the question “What is going on here?” (Cushion et al., 2012). Therefore, it fails to provide evidence regarding coaching effectiveness. Overall, in their review, Cope et al. (2017) recommend establishing one common language when defining coaching behaviour. This can also be taken as an argument for connecting coaching effectiveness research construct equivalent to the broad stream of educational research (Richartz & Kohake, 2021). To our knowledge, this task is still to be worked on.

Systematic observation instruments provide a selection of behaviour indicators that can be observed. It is therefore necessary to provide theoretical justifications and to critically examine the instrument’s reliability and validity. Observations seem particularly susceptible to rater error (Westergård, Ertesvåg, & Rafaelsen, 2019). Therefore, the more complex the observation system is, the greater demand is placed on rater training. Divergent assessments by external observers are particularly likely when observation instruments do not measure individual features on the surface of the behaviour (like low-inferent counting), but rather ascertain more complex constructs (high-inferent). As a result, it is important to investigate to what extent trained observers agree in their ratings.

In the current study, we used a well-researched observation instrument for teacher–student interactions, the Classroom Assessment Scoring System (CLASS; Pianta, La Paro, & Hamre, 2008). The present study investigates whether the CLASS can be transferred to a different pedagogical context than the classroom—namely sports training. For this purpose, trained and certified CLASS-observers scored videotaped coaching sessions of gymnastics, rhythmic gymnastics, handball and judo. Based on this data, interrater reliability was assessed. Moreover, referring to the “Teaching Through Interactions Framework” (Hamre et al., 2013), factorial validity was assessed. The three-dimensional structure of the CLASS is in good agreement with German (physical) education research (Herrmann & Gerlach, 2020; Praetorius, Klieme, Herbert, & Pinger, 2018). To our knowledge, the CLASS has not been applied to the context of extracurricular sports training, yet. Although the CLASS is an American observation instrument, in recent years, studies investigating its psychometric properties in European samples increased (e.g. Finland, Norway, Great Britain, Germany). However, all of these studies examined classroom settings. Accordingly, this study aims at providing the first psychometric evidence for the use of the CLASS in sports training.

Pedagogical quality in sports training

Pedagogical quality always refers to the pedagogical—in contrast to the exercise-scientific—aspects of teaching behaviour. An assessment of pedagogical quality based on learning effectiveness needs clear definitions of learning objectives. In children’s sports training, these can be multidimensional: improvement of performance, motivational goals as well as personal, social and cognitive development. In the school context, in the sense of the “Angebot-Nutzungs-Modell” (supply–use model) (Helmke, 2010), teachers can make students’ learning success more likely by realizing high pedagogical quality. However, studies examining coaches’ effectiveness have great challenges to overcome since numerous variables (e.g. talent, training environment, group size, coach qualification) can influence athletes’ performance and motivation (Horn, 2008). Thus, it is difficult to detect effects clearly attributed to coaches’ behaviour (Richartz, Maier, & Kohake, 2021). It would require large longitudinal studies with large sample sizes, which are especially difficult to realize due to small coaching groups in the extracurricular sports setting in leisure as well as in elite sports. Accordingly, it seems reasonable to draw on the broad stream of classroom research.

The aforementioned goals of sports training also apply in a similar way to teaching and learning in schools (performance gains, social and emotional growths, motivation). Particularly in the case of quality characteristics that have proven to be effective over many different teaching/learning scenarios (different subjects, different age groups) (Hattie, 2009), one can assume that those are likewise effective in sports training. For this reason, the CLASS—an instrument for the assessment of interdisciplinary aspects of pedagogical teaching quality in classrooms—was transferred to the context of sports training. The transfer of this classroom observation instrument to teaching–learning scenarios in sports training seems to be justified not only by the aforementioned parallels. In a narrative review, Becker (2013) compiled quality characteristics of coaches’ behaviour from 148 empirical studies and condensed them into seven main qualities: positive, supportive, individualized, fair, appropriate, clear and consistent. With regard to dimensions in other observation instruments, like the ASUOI, CBRF and CAIS mentioned earlier, there are a lot of overlaps. Most of them contain dimensions that refer to instructions, questions, physical assistance, modelling, hustle, praise, humour, scolding, punishment, management and feedback. All of these markers can be recognized in the CLASS.

The Classroom Assessment Scoring System

The Classroom Assessment Scoring System (CLASS) is a teaching–subject-independent observation instrument to assess the process quality of teacher–student interactions. The developers of the CLASS provide slightly different specifications for different age groups (Infant, Toddler, Pre‑K, K–3, Upper Elementary and Secondary). However, across age groups, the CLASS follows a hierarchical structure with three overarching domains of teaching quality that are in accordance with the German-speaking educational research literature (Kunter et al., 2013): Emotional Support, Classroom Organisation and Instructional Support.

The Emotional Support domain emphasizes relationships between teachers and children and among children. Emotional warmth and feelings of security in a supportive social environment are central. The theoretical basis is grounded in attachment theory (Ainsworth, 2003 [1974]) and self-determination theory (Deci & Ryan, 1993).

The Classroom Organisation domain deals with how effective the teacher is in leading and directing the group towards learning goals. Time for students’ learning activities should be maximized and problem behaviour should be managed effectively. The theoretical foundation is built by research on self-regulation as well as classroom management (Evertson & Emmer, 2000; Kounin, 1976).

The Instructional Support domain includes the teachers’ support for students’ learning and language development. The focus lies on instructions, clarity, visualizations and feedback that promotes learning. The domain is based on research regarding children’s cognitive development (Donovan & Bransford, 2005), the concept of the zone of proximal development (Vygotsky, 1978) and its subsequent scaffolding-research (van de Pol, Volman, & Beishuizen, 2010).

The three domains are differentiated in ten dimensions, which consist of four to five indicators that are specified in behavioural descriptions of quality markers (Table 1). However, in the context of sports training, maintaining the substance of core concepts will only allow for eight dimensions to be transferred without major distortions while keeping the same name. “Concept Development” refers to an overarching cognitive learning concept from the Teaching for Understanding approach. It is fundamentally based on the difference between two cognitive levels of learning processes: rote learning and higher-order thinking (Darling-Hammond, & Richardson, 2009; Pianta et al., 2008, p. 5). Higher-order thinking aims at cognitive processes, e.g. understanding, building connecting concepts, transferring knowledge and critical thinking. These kinds of learning processes are targeted in sports training as well. However, in our view, these moments do not have a major or critical role in defining pedagogical quality when compared to the importance of motor learning and tactical learning. Modern concepts of motor learning and tactical learning even champion structures and sequences of learning processes which are hardly compatible with core ideas of higher-order thinking (Hossner & Künzell, 2022). In fact, in most cases motor learning and tactical learning do not seem to benefit much from or are maybe even complicated by teaching strategies founded in higher-order thinking concepts. Indicators and behavioural markers of the “Concept Development” dimension therefore seem only suitable for a minor part of teaching and learning in sports (see Richartz & Kohake, 2021 for further clarifications).

“Language Modelling” does not seem particularly relevant for an extracurricular sport setting in general. Therefore, in the following, only eight dimensions of the CLASS will be considered. The CLASS manual provides detailed examples for low, medium and high-quality levels for each indicator. The CLASS is a high-inference measurement aiming at in-depth evaluations of teaching–learning processes. This allows for a better assessment of situation and process-specific teaching behaviours (Lotz, Gabriel, & Lipowsky, 2013). The psychometric properties of the CLASS are well documented, e.g. in the Measures of Effective Teaching Study (Kane, McCaffrey, Miller, & Staiger, 2013). However, since the CLASS is a complex system with multiple levels and dimensions, it may be vulnerable to error between different raters. Therefore, examinations of interrater reliabilities need to be evaluated and examined in depth.

Table 1 Structure and content of the Classroom Assessment Scoring System (CLASS) K–3

Current state of research

Various studies explicitly focus on examining interrater reliability within the CLASS. To assess interrater reliability two or more observers have to observe the same event. Some studies only included double coding for parts of their observations in order to control interrater reliability (e.g. Westergård et al., 2019). Others used at least two raters for all observation cycles (e.g. Pakarinen et al., 2010). Studies report evaluations based on live observations as well as videotaped observations.

As further psychometric evidence, structural validity of the CLASS has been assessed in different studies via confirmatory factor analyses (CFA). Some of the studies took the nested data (students in classrooms) into account by calculating multilevel analyses. Most of the studies compared a three-factor structure, typically referred to as the “Teaching Through Interactions Framework” (Hamre et al., 2013), with one overall “Teaching Quality” factor. Others also considered a two-factor solution (e.g. “Social Support” and “Instructional Support”) and included multiple modifications. The most prominent latent three-factor structure of the CLASS is expected to be invariant across age groups and contexts. Since the CLASS is a generic measure, which is often applied in the preschool age, most of the studies do not specify subjects or lesson contents in their studies. Despite structural validity and interrater reliability, most reporting of CLASS data includes means and standard deviations for teachers on dimension as well as domain level regarding several sequences of a lesson or school day.

In the following, results of studies using the CLASS regarding interrater reliability and structural validity are summarised. Hamre et al. (2013) brought together results from ten US studies in the preschool and elementary age utilizing the CLASS Pre‑K, K–3 and Upper Elementary. Even though absolute model-fit was not excellent, results of CFA showed best model-fits for the three-factor solution compared to the one- or two-factor solution across multiple data sets. Interrater reliabilities were reported in means of weighted kappa (κ) ranging from 0.42–0.82 as well as Percent Within One (PWO), which were higher than 0.80 in five studies, higher than 0.70 in three studies and lower than 0.60 in two studies. Results from four additional US studies working with the CLASS-Secondary are reported by Hafen et al. (2015). In CFA, the three-factor structure displayed a superior fit to the data than alternative one- or two-factor models. Interrater reliabilities measured by weighted κ (0.14–0.41), intraclass correlation (ICC) (0.15–0.43) and PWO (77.3–81.9) showed varying degrees of reliabilities in the different studies included. In recent years, the CLASS has increasingly been used in European studies as well. Pakarinen et al. (2010), for example, could confirm the “Teaching Through Interactions Framework” (three-factors) for a Finish sample. Following minor modifications (one dimension was excluded and the residuals of other dimensions were allowed to correlate), the three-factor model fit the data well and was superior to a one-factor model. With two exceptions, ICCs demonstrated high interrater reliabilities (above 0.80). Similar results regarding structural validity are reported for the CLASS-Secondary in Finland (Virtanen et al., 2018). With slight exceptions, they reported acceptable or good interrater reliabilities indicated by PWO and ICCs. These results have successfully been replicated in a Norwegian secondary sample with acceptable to high interrater reliabilities (PWO between 0.51 and 0.94; 20% of the ratings double coded) (Westergård et al., 2019). Moreover, results of CFA supported the three domains as described in the CLASS manual.

Two studies validated the CLASS in a German context, both conducted in the preschool age. Von Suchodoletz, Fäsche, Gunzenhauser, and Hamre, (2014) carried out live observations from which 15% were double coded. Thereby, they achieved interrater reliabilities of 86% within one. The assumed three-factor model did not fit the data well according to the CFA. However, its model fit indices were superior to those of a one- or two-factor model. After considering modification indices, satisfactory model fit was likewise reported for another German Pre‑K sample evaluating video-taped observations (Stuck, Kammermeyer, & Roux, 2016). Interrater reliabilities were categorized as good ranging from 0.87–0.98 for PWO and from 0.61–0.90 for ICCs. As a result, one can assume that the CLASS can be transferred into a German context. However, to the authors’ knowledge, no previous study exists—neither nationally nor internationally—that examines the psychometric properties of the CLASS in extracurricular training or physical education.

Present study

The aim of the present study is to transfer the CLASS to a different context other than classrooms and examine its psychometric properties. This is to be achieved by testing whether the CLASS provides a useful framework to assess coach–athlete interactions in German extracurricular sports groups with acceptable validity and reliability.

  1. 1.

    In order to examine whether the pedagogical quality in sports training is similar to the quality in classrooms, we calculate means and standard deviations for different types of sports on item and scale level and compare them to previous studies in the school context.

  2. 2.

    To test the structural validity of the CLASS, confirmatory factor analyses are performed. In line with previous research in classrooms, we hypothesize a three-factor structure of the CLASS for the context of sports training: “Emotional Support”, “Organisational Support”, and “Instructional Support”. We assume that this model will show a better fit to the data than a G-factor model with one overall “Pedagogical Quality” factor. However, in contrast to previous research in school settings and the K–3 manual, the dimensions “Language Modelling” and “Concept Development” are excluded from the assessments. As previously explained, the two dimensions do not seem to apply as main quality criteria for sports training. Consequently, both dimensions are excluded. Moreover, the Instructional Learning Format dimension is expected to belong to the Instructional Support domain instead. The “Instructional Learning Formats” dimension comprises teaching strategies that aim at maximizing interest and engagement of students by clear learning targets, variety of modalities and materials and effective facilitating behavioural strategies. Based on findings of Kounin (1976), Yair (2000) and others, in the K–3 version of the CLASS this dimension is incorporated in the Classroom Organization domain. A supporting argument here is that students in the early school age still have to learn self-regulation in the classroom and engaging teaching strategies aim at learning gains in content knowledge and in self-regulation. In the Upper-Elementary and the Secondary versions of the CLASS, however, while keeping exactly the leading concept as well as the indicators and behavioural markers, this dimension is incorporated in the Instructional Support domain. The main focus now shifts from supporting self-regulation to facilitating students’ engagement in instructional content. The teaching strategies subsumed in the Instructional Learning Formats dimension are essential to achieve two categories of goals—they keep students engaged in instructional content and therefore facilitate learning and simultaneously prevent students from being bored and engaged in unwanted activities. The later function falls under the umbrella concept of Classroom Organization, while the former is without doubt a critical aspect of Instructional Support.

  3. 3.

    To determine interrater reliabilities, multiple measures are computed, i.e. PWO, exact agreement, Cohen’s κ, weighted κ and ICCs. Dimension scores of the CLASS are expected to demonstrate acceptable interrater reliability in a sports setting, comparable to those reported from previous studies in the classroom (Pianta et al., 2008).

Methods

Sample

A total of 26 sports coaches from cities all over Germany participated in the study. The types of sports included gymnastics (N = 13), rhythmic gymnastics (N = 4), handball (N = 5) and judo (N = 4). The 14 female and 12 male coaches were between 20 and 59 years old (M = 38.0; SD = 12.7). Their coaching experience ranged from 1–45 years (M = 14.8; SD = 12.0) and they reported diverse levels of coaching licences (7 highest licence [A-Licence], 10 medium level [B-Licence], 7 low level [C-Licence] and 2 with no licence). Their coaching groups consisted of 6–22 children whose majority was in the age of 8–12 years. The faculty’s ethics committee granted permission to conduct this study.

Training

To ensure initial reliability, all observers attended a CLASS certification training for the K–3 level led by a certified CLASS trainer. In this training, the different dimensions combined with their theoretical backgrounds were presented and their observation was practiced using video clips. Since this training followed the original CLASS certification procedure, it only contained video footage from US classrooms and was not specified for the sports context. After the workshop, the participants had to rate five 12–20 min video clips and achieve at least 80% within-one agreement with the master codes set by the developers of the CLASS. To ensure a successful transfer of the CLASS to the sports context, additional calibration sessions were conducted with video material from sports training. As a result of these sessions, further specifications of the indicators and behavioural markers of the CLASS for the sports context were written down as a supplement to the initial manual, e.g. what a nonverbal feedback cycle with movements as children’s responses could look like (more information can be obtained from the authors upon request). However, it is important to note that these specifications did not result in any modifications of the instrument.

Procedure

This study is part of a larger longitudinal research projectFootnote 1. Coaches were asked for their written permission to participate in the study. Moreover, the athletes were informed and parental permission was obtained. Members of the research team recorded two or three training sessions of each participating coach over a period of approximately 6–12 month. Three 20-minute sequences were selected from each training session including different training phases (warm-up, technique/tactics training, game/routines; except for one session which only lasted 40 min, therefore only two sequences were selected). Overall, 221 sequences were rated independently by two of three certified observers in 8 dimensions, resulting in a total of 3536 ratings. Observers were blinded for point of measurement (first, second or third recorded training session) since some coaches were part of an intervention between the points of measurements. Therefore, the sequences were rated in a random order after all recordings were completed.

Measures

CLASS scores were assigned on the dimension level on a 7-point scale (1–2 = low; 3–5 = moderate; 6–7 = high). The manual provides detailed examples of what teacher–student interactions falling in the low, moderate or high range look like. As mentioned earlier, the CLASS dimensions “Language Modelling” and “Concept Development” were excluded from the assessments due to sport-specific considerations. Therefore, eight dimensions were included in our analyses (Positive Climate, Negative Climate, Teacher Sensitivity, Regard for Student Perspectives, Behaviour Management, Productivity, Instructional Learning Formats and Quality of Feedback).

Data analysis

Analyses were conducted with SPSS 26 (IBM Corp, Armonk, NY, USA) as well as Mplus 8.1 (Muthén & Muthén, 2017). The sample did not have missing data that could have affected the results.

Descriptive statistics

Firstly, mean scores and standard deviations of the eight CLASS dimensions were examined and compared with those from other European and US K–3 classroom samples. Following, mean scores of the two raters of each sequence in each dimension were calculated and coaches’ scores of different sports were contrasted. To avoid the absolute ratings being influenced by the intervention in which some of the coaches participated, only the ratings of the first measurement point were included in the mean scores (N = 78 sequences).

Confirmatory factor analysis

To assess the structural validity of the CLASS in the sports setting, confirmatory factor analyses (CFA) were conducted (N = 221 sequences). We compared a three-factor model (Emotional Support, Classroom Organisation, Instructional Support) to a G-factor model (Pedagogical Quality). This approach has already been used in comparable studies using the CLASS instrument (e.g. Hamre et al., 2013).

We used maximum likelihood estimation, which is relatively robust for violations of normal distribution and data independence (Muthén & Muthén, 2017). According to recommendations by Hu and Bentler (1999), the overall model fit was evaluated by means of multiple goodness-of-fit indices, including the χ2 test. The p-value associated with the χ2 test is supposed to be nonsignificant. However, it is highly oversensitive to sample size (Schermelleh-Engel, Moosbrugger, & Müller, 2003). Therefore, we report alternative fit indices, including the standardized root mean square residual (SRMR), the comparative fit index (CFI), the Tucker–Lewis index (TLI), and the root mean square error of approximation (RMSEA). A model that fits the data well is indicated when values for the CFI as well as the TLI are greater than 0.95 (good fit) or 0.97 (excellent fit) and for the RMSEA as well as the SRMR are less than 0.10 (good fit) or 0.05 (excellent fit) (Marsh, 2007; Schermelleh-Engel et al., 2003).

Interrater reliability

Two or more raters are assumed to code reliably when they produce parallel results observing the same event (Suen, 1988). In accordance with the technical appendix of the original CLASS manual as well as other previously published research (e.g. Hamre et al., 2013), Percent Within One Agreement (PWO) was calculated as a measure of interrater reliability. Thus, for PWO, scores are considered to be in agreement if they fall within ± one point of each other. PWO greater than 0.80 are considered good (Pianta et al., 2008). However, PWO is a fairly broad indicator of interrater agreement, which is why percentages of exact agreement were additionally calculated in order to assess the precise agreement between observers. Since PWO and exact agreement do not take chance agreement into account, reliability might be overestimated. Therefore, more conservative estimates seamed necessary: Cohen’s κ, weighted κ and intraclass correlations (ICC). Cohen’s κ assesses the proportion of agreement after adjusting for the expected percentage of agreement that may occur through random chance (Fleiss & Cohen, 1973). Weighted κ is a further modification of Cohen’s κ and allows the differential weighting of nonagreement scores (Cohen, 1968; Fleiss & Cohen, 1973; Wirtz & Caspar, 2002). This seems especially interesting for ordinal and metric scales as given in the CLASS. The differences have a greater effect the further apart the judgements of the observers are. Both Cohen’s κ and weighted κ are between −1 and 1, with values of 0.40–0.60 displaying acceptable rater agreement, 0.60–0.75 demonstrating good rater agreement, and > 0.75 presenting excellent agreement (Fleiss & Cohen, 1973; Wirtz & Caspar, 2002). Furthermore, ICCs were calculated to determine the strength of the relationships between ratings. The two-way mixed effect model with unadjusted estimates was chosen (absolute agreement). Values above 0.70 are deemed acceptable and above 0.80 are considered as good (Wirtz & Caspar, 2002).

Results

Descriptive statistics

The mean scores and standard deviations of the present sample in the context of sports and comparison samples from K–3 classrooms are shown in Table 2. High scores are considered positive with the exception of the dimension Negative Climate. The scores for the dimension Negative Climate were therefore recoded. German sports mean scores range from 3.52 (Quality of Feedback) to 6.76 (Negative Climate, revised). The mean dimension scores of the Emotional Support domain are in the mid- to high-range (4–7), those for the Classroom Organisation dimensions are mostly higher, but likewise in the mid- to high-range (5–7). The Instructional Support is in the mid-range (3–5). All variables except Negative Climate are considered normally distributed with a kurtosis or skewness value ±1.20. Moreover, scores vary widely between cycles in the dimensions support a reasonable differentiation (Table 2).

Table 2 Mean scores and standard deviations for different types of sports

Comparisons with other studies in school contexts show similar patterns with the lowest scores in the dimension Quality of Feedback and mid- to high-range scores in the dimensions of the Emotional Support (Fig. 1). All dimensions tend to score higher in the sports training groups, especially in Teacher Sensitivity, Behaviour Management and Productivity.

Fig. 1
figure 1

Dimension mean scores from different samples

Structural validity—confirmatory factor analysis

With regard to the factor structure of the eight CLASS dimensions in a sports context, the G‑factor model (Model 1) with one overall “Pedagogical Quality” factor shows a poor model fit (χ2 (20) = 160.81, p < 0.01; RMSEA < 0.01; CFI = 0.77; TLI = 0.67; SRMR = 0.10). The model fit remarkably improves when three factors are specified (Model 2: χ2 (17) = 55.12, p < 0.01; RMSEA < 0.01; CFI = 0.94; TLI = 0.90; SRMR = 0.06). Modification indices state that the fit of the model would further improve if the residual of the Positive Climate item (i.e. the first item loading on the Emotional Support factor) is allowed to correlate with the residual of the Instructional Learning Formats item (i.e. the first item loading on the Instructional Support factor). Because both dimensions share the common aspect of teacher enthusiasm as part of a joyful learning interaction, we allow their residuals to correlate. The fit indices of the resulting model show almost excellent model fit (Model 3: χ2 (16) = 34.89, p < 0.01; RMSEA = 0.12; CFI = 0.97; TLI = 0.95; SRMR = 0.05). All standardized item-factor loadings are significant (p < 0.01) and with one exception (0.33) at least moderate in size ranging from 0.56–0.95.

Interrater reliability

Results on the percentage of agreement are presented first, starting with the PWO followed by the percentage of exact agreement. Next, the results on Cohen’s κ and weighted κ are laid out. Lastly, the ICCs are displayed.

Percent Within One and exact agreement

The Percent Within One interrater agreement is calculated separately for all types of sports as well as observation cycles (Tables 3 and 4). The mean PWO on the dimensional level across cycles and types of sports ranges between 88% (Regard for Student Perspectives) and 98% (Negative Climate). Therefore, interrater agreement on dimensional level can be considered very good to excellent for all subgroups. Regarding each sequence individually, PWO ranges from 63–100%. Only 23 sequences show less than 80% reliability, which equals 10% of the ratings. In only 5 out of 1768 occasions (0.3%), the observer’s ratings differ more than 2 points from each other, but never more than 3. The three observer teams show similarly good agreements with PWO ranging between 90 and 95%.

The percentage of exact agreement ranges from 38% (Regard for Student Perspectives and Quality of Feedback) to 78% (Negative Climate) regarding the eight dimensions. Results for all other dimensions are reported in Table 3. Mean scores for each of the three observer teams reveal percentages of exact agreement between 23 and 42%.

Table 3 Mean Percent Within One interrater agreement for types of sport
Table 4 Percent Within One interrater agreement for all segments

Cohen’s κ and weighted κ

Values for Cohen’s κ as well as weighted κ for all dimensions are reported in Table 5. For the stricter measure Cohen’s κ, all values are below 0.40 and therefore fail to fall in the acceptable range. Weighted κ scores are noticeably higher with scores for Positive Climate showing excellent interrater agreement. Productivity, Behaviour Management and Quality of Feedback show good interrater agreement and Negative Climate and Teacher Sensitivity display acceptable interrater agreement. Only scores for Regard for Student Perspectives and Instructional Learning Formats slightly miss the acceptable range. Overall, weighted κ coefficients vary widely ranging from 0.36–0.77.

Intraclass correlations

Intraclass correlations (ICC) are shown in Table 5. ICCs range from 0.54–0.87. All scores are statistically significant (p < 0.01). Except for the scores of Instructional Learning Formats and Regard for Student Perspectives, all scores are above 0.70 and therefore deemed acceptable (Negative Climate, Teacher Sensitivity, Productivity) or good (Positive Climate, Behaviour Management, Quality of Feedback).

Table 5 Interrater agreement for Classroom Assessment Scoring System (CLASS) dimensions across observations

Discussion

The aim of the present study was to examine psychometric properties of the Classroom Assessment Scoring System (CLASS) in sports settings. To our knowledge, this was the first study to use the CLASS as an observation tool not for teacher–student but for coach–athlete interactions. For this purpose, we video recorded 26 sports coaches in two to three training sessions and blindly double-coded all of the 221 sequences.

Observational data from our study show moderate to high average scores for all dimensions. However, high levels of deviations between-coaches were noticeable. We can therefore conclude that the CLASS provides a useful framework to illustrate differences between coaches. Minimum and maximum scores for coaches across three cycles showed that a wide range of variance can be well represented with the CLASS. From a content point of view, little variance is of course desirable in the dimension Negative Climate—but from a measurement theory point of view the instrument should be sufficiently sensitive.

Comparisons of descriptive statistics showed a similar distribution of ratings across the eight dimensions for extracurricular sports as for studies in school contexts. Accordingly, coaches scored mainly in the mid to high range in the dimensions of Emotional Support and Classroom Management. These results are gratifying. In accordance with the Measures of Effective Teaching Study, coaches, likes teachers, scored the highest in the Behaviour Management and Productivity dimensions (Kane & Staiger, 2012). Moreover, the majority of coaches as well as teachers have not created a Negative Climate in their learning groups as shown by low CLASS scores in this dimension. One could assume that this dimension can best be manipulated by the coaches in the short term. At the same time, social desirability might be most uncontroversial in this area. Coaches may therefore have tried to adjust their behaviour accordingly for the period of the video recordings.

In the dimension Quality of Feedback coaches as well as teachers showed weaker results (Kane & Staiger, 2012; Sandilos, Shervey, DiPerna, Lei, & Cheng, 2017). This is the most complex and probably the most demanding dimension which is why this result is not surprising. Instructional quality, especially with regard to feedback, is therefore the area where the greatest potential for improvement can be found among the broad range of coaches.

It is also striking that the dimension Regard for Student Perspectives seems to be particularly challenging for coaches as well as for teachers: on average, coaches in the present study achieve scores in the midrange and thus comparable to teachers in previous studies (Kane & Staiger, 2012).

Comparing different types of sports, there are no major differences in the Emotional Support dimensions to be observed. However, coaches of the aesthetic sports (gymnastics and rhythmic gymnastics) show better results in the Classroom Organisation as well as Instructional Quality. One possible reason for this can be found in the different group sizes. While gymnastics and rhythmic gymnastics usually operate in group sizes with about 5–10 children and youths, in handball and judo groups with up to 20 children are common. Moreover, the coaching groups we observed pursued a different focus on performance. The gymnastics groups were more performance-oriented while the handball und judo groups focused on nonelite sports. This could have led to a stronger focus on feedback and instructions. CLASS scores for different sports levels need further investigation.

Results of confirmatory factor analysis confirmed a three-factor-structure and therefore the “Teaching Through Interactions Framework” (Hamre et al., 2013). Quantitative studies in the field of physical education also support the three factorial structure (Herrmann & Gerlach, 2020). In comparison, a unidimensional G‑factor model with one “Pedagogical Quality” factor showed a poor model fit. Other authors have additionally tested a two-factor structure. However, the theoretical explanation is unsatisfactory which led to the model being discarded (Stuck et al., 2016). Our results are in line with those of a recent meta-analysis by Li, Liu, and Hunter (2020). Including 26 correlation matrices across age groups, the authors confirm the three-factor structure initially outlined by the CLASS developers. Consequently, it would be inappropriate to represent pedagogical coaching quality in a single value. Instead, an assessment at dimension or domain level seems adequate. The confirmation of the three-factorial structure makes it possible to maintain international comparability when using the CLASS. Therefore, CLASS data generated in the context of sports training can be related to data from other teaching–learning contexts, providing important links for the discussion of general pedagogical quality issues. This way, we can also connect the results to more elaborated longitudinal studies financed in other pedagogical fields.

Observer reliability was assessed using several measures and showed overall good interrater reliability. Not surprisingly, PWO had the highest scores, as it was the least stringent measure. On average, the targeted value of at least 80% within-one agreement was achieved for all eight dimensions. The total mean PWO across dimensions is 93% and thus higher than the values reported by Hafen et al. (2015) for several studies (77–82%) and as high as values reported by Stuck et al. (2016). Since PWO is a very broad indicator for rater agreement, we further calculated additional indicators.

In line with expectations, percentages of exact agreement are considerably lower. However, with values from 38–78% they still exceed those given in the CLASS manual (Pianta et al., 2008). When interpreting these results, one should bear in mind that the high values in the dimension Negative Climate are also influenced by the low variance in this dimension. However, more divergent dimensions still achieve the targeted reliability level.

Cohen’s κ points to weaknesses in the dimensions Teacher Sensitivity, Regard for Student Perspectives, Instructional Learning Formats and Quality of Feedback (< 0.30). More meaningful, however, seem to be the values for weighted κ taking the weighting of nonagreement scores into account. Since the CLASS aims at achieving agreement with a maximum of one-point deviation, it makes sense to only weigh deviations above this as unreliable. Regarding weighted κ only Instructional Learning Formats and Regard for Student Perspectives slightly miss the acceptable range. Future rater trainings and calibration meetings should pay particular attention to these dimensions. It is possible that these dimensions require a special sport-specific differentiation and explanation, which is not sufficiently provided by the CLASS manual alone.

ICCs support this finding. Except for the dimension Instructional Learning Formats and Regard for Student Perspectives, all scores are above 0.70 and therefore acceptable to good. Stuck et al. (2016) report similar values from 0.61–0.90 for ICCs, while Hafen et al. (2015) report substantially lower average results of four different studies (0.15–0.42).

Our results show the necessity of presenting different reliability indicators since the widespread PWO is a very broad indicator. Only more complex measures allow a deeper insight in areas in need of further development.

Strengths and limitations

In this study we included coaches of gymnastics, rhythmic gymnastics, handball and judo. Therefore, this covers different types of sports like individual and team sports as well as elite and nonelite training groups. It is still questionable whether the results are representative for all types of sports. Further studies with additional types of sports seem necessary. Moreover, comparisons with elite and nonelite groups should be considered. However, we present first reference values that can provide a basis for further between-sports comparisons.

Moreover, our sample was a convenient sample and could therefore be selective: coaches have volunteered to participate in the project knowing that their pedagogical coaching quality would be evaluated. This could have caused distortions in the absolute values, since we may have observed those coaches who are already particularly interested in the topic of pedagogical quality.

In this study, we did not test the construct validity of the CLASS in a sports context by evaluating its convergent validity with other validated measures. Comparisons of CLASS scores with data from other instruments could further support the suitability of the CLASS.

It is a strength of our study that all sequences were double coded. Other studies have only double-coded a random percentage of sequences, e.g. Stuck et al. (2016) only double-coded 32% of their sequences. This is probably because coding is very time-consuming. For this elaborate evaluation procedure, the present sample size is very large. Although the number of coaches observed is low, the high number of video sequences per coach means that the total number of ratings is exceptionally high (N > 3500 scores).

Implications and further research

For future research, it will be interesting to investigate stability of coaching behaviour over one training session as well as several training sessions. Similar studies in the school context are already available (Praetorius, Pauli, Reusser, Rakoczy, & Klieme, 2014). Evaluations of coaching stability are in preparation. To date, there is little evidence on comparisons between live ratings and video ratings (Praetorius et al., 2019). Future research projects could also take this distinction into account.

Referring to the five stages for establishing valid systematic observation systems for sports coaching by Brewer & Jones (2002), the present study covered the observer training (1), the amendment of an existing instrument (2), the face validity of the instrument (3) as well as the interobserver reliability (4). To complete the five steps, intraobserver reliability (5) could be additionally tested in the future.

The information obtained through the CLASS could be used to provide constructive feedback to coaches. The ultimate goal of measurement work in the field of education as well as sports coaching lies in promoting positive changes in teaching/coaching practice (Hafen et al., 2015). In the future, this will also include the investigation of predictive validity in the context of sports coaching. Many studies have examined the association between CLASS scores and children’s developmental outcomes in the school context (Downer, Sabol, & Hamre, 2010). The present study opened the possibility for similar examinations in the sports context.