Theoretical framework

Based on the paradigm of reflective practice (Schön, 1983), teachers’ self-reflection on teaching is considered an important prerequisite for professional development (Bengtsson, 2003; Ross & Bruce, 2007). Within the paradigm, reflection-on-practice focuses on how practitioners can change their methods or how they should move in new directions (Schön, 1983) and, with respect to teachers, this means becoming more-reasoned actors by questioning routines (Cruickshank, 1987) as well as by confronting personal assumptions and values as a consequence of experiencing perturbations in practice (Smyth, 1992). Consequently, reflective practice involves obtaining evidence about one’s impact on how students benefit from teaching and how this impact could be improved (Benade, 2015; Hattie 2009). When teachers rely on their own perceptions of instructional quality, different mechanisms of bias can lead to miscalibrated perceptions and, consequently, to a learning environment that does not fit students’ specific needs. Reflective practice therefore involves comparing self-perceptions with external data (e.g. students’ perceptions). For characteristics of instructional quality, such comparisons are possible by obtaining both self and other perspectives with the help of standardized feedback questionnaires.

Previous research shows that teachers’ self-perceptions and others’ (students’ or external observers’) perceptions differ considerably from one another, with most studies reporting only small to moderate correlations between measurements of self and others’ perspectives (Clausen, 2002; Fauth et al., 2014; Kunter & Baumert, 2006; Maulana et al., 2012; Wagner et al., 2016; Wisniewski et al., 2020). This can be accounted for by perspective-specific validities (Fauth et al., 2014; Wettstein et al., 2016), meaning that different points of view “tap different aspects of the classroom environment, rather than the same underlying construct” (Kunter & Baumert, 2006, p. 234). However, for some measures, the construct structure of instructional quality based on teachers’ and students’ perceptions is similar (Kunter et al., 2008; Wisniewski et al., 2020), making a comparison of the two perspectives possible.

In general, as discussed by Clausen (2002), the correlation between teachers’ and students’ perceptions of instructional quality is lower for high-inference characteristics that are partly influenced by students’ preconditions (e.g. motivation, interest or prior knowledge). Additionally, they are lower when more-difficult characteristics are observable. The three basic dimensions of instructional quality are based on items with relatively low inference but, on the other hand, they are not easily observable but result from classroom interaction (Wisniewski et al., 2020). This means that teachers and students both judge the instruction provided, but divergence arises from subjective assessments of how effective this instruction is.

When reflective practice is considered as the questioning of one’s own assumptions about teaching, then comparisons of perceptions can be used to adapt teaching to students’ needs as well as to identify miscalibrations of one’s own perceptions. The measurement of instructional quality characteristics from different perspectives allows such comparisons based on relevant criteria. The comparison of perception differences based on questionnaire data is considered a common way to identify blind spots in one’s own perception and adapt teaching to students’ needs (Helmke & Lenske, 2013). They help to question one’s own assumptions and prevent reflection based merely on validating and perpetuating one’s own view (Larrivee, 2006). Therefore, Hattie (2009) suggests that critical self-reflection as proposed by Schön (1983) needs to be enriched by evidence in the form of external data (in the form of feedback). Antoniou and Kyriakides (2011) state that critical reflective practice by teachers must utilize a combination of feedback from others, educational research, and their own perceptions. Instead of using criteria that are only subjectively perceived as relevant—different models of instructional quality provide characteristics that correlate with high achievement gains (Klieme et al., 2006 Slavin, 1994; Wisniewski et al., 2020).

Estimating the degree of disagreement between teachers and students is also important for understanding how teachers can benefit from feedback. Accordingly, studies of differences between student and teacher ratings of instructional quality can not only advance the current state of research, but also benefit educational practice. Existing research shows that inaccurate self-perceptions are relatively stable towards feedback (Hacker et al., 2000; Helzer & Dunning, 2012; Simons, 2013) and low performers have difficulties in calibrating their self-perceptions based on feedback (Brett & Atwater, 2001; Kruger & Dunning, 1999). These individuals do improve the accuracy of their self-assessment through feedback when it is continuous, concrete, and specific (Miller & Geraci, 2011), which is consistent with more-general findings showing that feedback is ineffective when it focuses on very general observations (Braga et al., 2014; Kornell & Hausman, 2016), or when it is related to personality characteristics instead of actual behavior (Hattie & Timperley, 2007; Kluger & DeNisi, 1996). This means that people who over-estimate their performance need different feedback from those who assess their own performance realistically or tend to under-estimate themselves.

In this article—after a brief outline of the state of research on differences between students’ and teachers’ perceptions of instructional quality—we report how teachers’ self-perceptions and students’ perceptions are related, how differences in perception vary across teachers based on assessments of favorable and less-favorable students’ assessments, and how these differences are moderated by person and context variables.

Assessing instructional quality

A common method for assessing characteristics of instructional quality are questionnaires including items that are associated with high learning outcomes (Kunter & Voss, 2013). These can usually be answered from different outside perspectives (observers, students) and from a teacher self-perspective (Helmke & Lenske, 2013). With students providing the least- expensive way to obtain formative evaluation feedback, it is disputed if and to what respect students’ knowledge and experience enables them to provide reliable and valid information on these characteristics (Lamb, 2017). Primary students have been shown to distinguish teaching quality and popularity insufficiently (Fauth et al., 2014) and it has been shown that—even for college students—discriminant validity can be compromised by very trivial interventions such as giving students chocolate before they evaluate the teaching quality (Hessler et al., 2018; Youmans & Jee, 2007).

Nonwithstanding, many studies have shown that, when applying questionnaires that are based on sound theory, students’ perceptions of instructional quality exhibit a high degree of discriminant validity (Balch, 2012; Ferguson, 2012; Gaertner, 2014; Kane & Staiger, 2012; van Petegem et al., 2008; Wagner et al., 2013; Wisniewski et al., 2020) and relevant outcomes of teaching (e.g., student achievement) are predicted more accurately by students’ perceptions than by teachers’ self-perceptions (De Jong & Westerhof, 2001; Fraser, 1991; Clausen, 2002; Kunter & Baumert, 2006; Pham et al., 2012; Seidel and Shavelson 2007). Additionally, as aggregated data, students’ perceptions are more reliable than teachers’ self-perceptions (Kyriakides et al. 2014). Taking these findings into account, comparing teachers’ self-perceptions with students’ perceptions of instructional quality is still a comparison of subjective data with other subjective data does not allow a decision to be made about which perspective is more accurate. However, students’ perceptions can be considered an important source of information for identifying teachers’ miscalibrated self-assessments that can adversely affect the effectiveness of teaching.

A comparison can be based on different quality criteria. There are several frameworks of instructional quality (see Wisniewski et al., 2020 for an overview). One of the most prominent models, the so-called three basic dimensions, includes the generic factors of classroom management, cognitive activation, and student support (Klieme et al., 2006 Praetorius et al.,  2018). Classroom management includes rules and procedures, measures of coping with disruption, and smooth transitions, which directly influence time on task (Seidel and Shavelson 2007). These three dimensions have been shown to predict students’ cognitive, motivational and affective learning outcomes (Praetorius et al., 2018) and can be obtained independently of students’ age, school subject or school type (Wisniewski et al., 2020). Cognitive activation integrates challenging tasks, connects newly-introduced concepts to prior knowledge, and encourages students in elaborate thinking and classroom discussion (Lipowsky et al. 2009), stimulating deep forms of thinking and conceptual understanding during learning (Klieme et al., 2006). Supportive climate refers to aspects of the teacher–student relationship, including a caring behavior, a productive way of dealing with errors, and constructive feedback (Klieme et al., 2006). With its very strong theoretical foundation and multiple verifications by confirmatory factor analyses (Fauth et al., 2014; Kunter and Voss 2013), the model offers a parsimonious structure for operationalizing the construct of instructional quality. Previous research has shown different results of comparability of the three basic dimensions across different perspectives: Kunter and Baumert (2006) found different factor structures for both perspectives and concluded that they are indeed different constructs. They also reported that only one of the three factors, classroom management, was comparable between teachers and students and that there was significant agreement between the two groups when this factor is assessed. Other research has demonstrated that the assessment of the three basic dimensions is indeed invariant across teachers and students (Kunter et al., 2008; Wisniewski et al., 2020). In order to make claims about the agreement or divergence of teachers and students when assessing instructional quality, measurement invariance of the two perspectives is a basic prerequisite.

In discussing the differences between generic aspects and subject-specific aspects of instructional quality, Lipowsky et al. (2018) show that the basic dimensions of instructional quality can be supplemented by subject-specific instructional quality, whereby the generic and subject-specific factors are largely independent of each other.

Also, in differently-designed questionnaires for surveying the learning environment, considerable divergences between student and teacher perceptions are found (Fraser, 2007). Findings from research on teacher–student interpersonal relationships and behavior revealed that a large proportion of the teachers tend to overestimate aspects of their behavior which are positively related with students’ motivation and achievement compared with their students’ perceptions. Additionally, teachers’ tendency for underestimation of teaching aspects which are perceived as negative are widespread (Den Brok et al., 2006; Maulana et al., 2012).

Explanations for the divergence of perceptions

There are several explanations for self–other perception differences related to different tasks from personality and social psychology. While self-perceptions are characterized by privileged access to thoughts and are not dependent on interpretation of behavior, the perceptions of others rely on indirect behavioral indicators, drawing inferences, and interpreting behavior (Fauth et al., 2020). On the other hand, people’s difficulty in detecting their own stable (positive and negative) behavioral tendencies stems from a lack of awareness similar to the phenomenon that fish are said to find it difficult to detect water (Kolar et al., 1996; Leising et al., 2006). Self- perceptions are less associated with actual behavior than the perceptions of others (Kolar et al., 1996). Summarizing differences between self and other ratings, the SOKA model (self–other knowledge asymmetry) by Vazire (2010) assumes that there are differences in the information that is available for a rater, as well as differences in the processing of that information. Regarding the latter, the model accounts for the degree of ego involvement that differs between self and other ratings. In turn, this can lead to miscalibrated self-perceptions: “Judges have a lot more at stake when they are also the target than when they are judging someone else” (Vazire, 2010, p. 284).

Another reason for self-perceptions differing from others’ perceptions is a specific cognitive bias phenomenon defined by Kruger and Dunning (1999). Less-skilled people usually overestimate their performance because they are less able to reflect accurately on what they do, whereas highly-skilled performers underestimate their skills because of their overestimation of other people’s skills (and therefore underestimate their own relative competence) and out of modesty (Dunning et al., 2003). Although there is a low to moderate correlation (r = 0.39) between self-assessment and actual ability (e.g. with respect to examination taking), people who perform particularly poorly are unaware of their incompetence, while those who perform particularly well tend to underestimate themselves and are not fully aware of their good performance compared with peer group members (Kruger & Dunning, 1999). This effect—often called the Dunning–Kruger-effect—was originally attributed to metacognitions, assuming that the same skills that are necessary to solve a cognitive task are necessary to recognize whether the processing of that cognitive task is successful. Because people who are less competent in a task cannot produce a correct result, they cannot recognize a correct result. This leads them to overestimate their own abilities. When Kruger and Dunning (1999) split subjects into quartiles based on their ability in different tasks, those in the bottom quartile overestimated their performance the most strongly, while those in the top quartile slightly underestimated their performance. Until now, various alternatives have been presented to explain Kruger and Dunning’s (1999) findings. Krueger and Mueller (2002) argue that the observable effect is a distortion that arises as a result of a regression to the middle of the self-assessment: both subjects with above-average competence and subjects with below-average competence tend to assess their abilities as average. This could also explain why the best performers in Kruger and Dunning’s (1999) surveys underestimated their abilities. However, recent research with statistical control for this tendency of regression to the mean shows that the Dunning–Kruger effect can be reduced somewhat but cannot be fully explained by regression to the mean (McIntosh et al., 2019).

One moderator of the association between self-assessment and objective performance is the specificity of the items that are used to measure perceived ability (Ackerman et al., 2002; Zell & Krizan, 2014). Self-assessments are generally more precise for narrowly-defined areas of behavior that are based on clear criteria (e.g., “I intervened quickly and consistently when students ignored classroom rules”) than for broader areas that are based on an overall impression (e.g., “I’m good at classroom management”). These findings are relevant for the present study because they suggest a specific and criterion-based operationalization of instructional quality.

Gender influences

Generally, misjudgments of people’s own abilities or performances are moderated by different variables, with gender being one of the most influential ones (Lindeman et al., 1995; Lundeberg et al., 1994, 2000). Significant gender differences are found for many kinds of tasks, with men tending to assess their performances more positively, while women’s self-evaluations are generally inaccurately low. Existing research has also shown that women especially underestimate their achievements in masculine-stereotyped tasks or domains (Beyer & Bowden, 1997). Ehrlinger and Dunning (2003) found no actual differences between female and male college students’ performance on a science test, but female students underestimated their performances because they thought less of their general scientific reasoning abilities. Similarly, female managerial students assess their own abilities that qualify them for a leadership position significantly lower than their male counterparts do (Bosak and Sczesny 2008). These findings indicate a pervasive gender bias in self-concepts related to performance. However, to the best of our knowledge, the question of how gender can affect the self-perceptions of teachers (especially in comparison to their students’ perceptions) is still unresolved.

The present study

Previous research shows that teachers generally perceive instructional quality characteristics differently from students. However, it is unclear how teachers differ from each other in perceiving these characteristics compared with their students’ perceptions. Previous findings do not address the question of how the differences vary between those teachers whose instructional characteristics are perceived favorably by students and those whose instructional characteristics are perceived less favorably. The purpose of the present study was to further explore the relationship between teachers’ and students’ perceptions of instructional quality by investigating patterns of teachers’ over- and under-estimation of characteristics of generic instructional quality compared with students’ perceptions. On the one hand, this study aimed to generate further findings about the different perception perspectives on teaching and explain these differences.

On the other hand, these findings can be used to provide teachers with important information for reflecting on their own teaching based on student feedback and self-perceptions. This knowledge about possible explanations for differences in perception, for example, could lead to a more self-critical attitude toward one's own teaching when teachers overestimate themselves to a particularly high degree. At the same time, it might encourage teachers who under-estimated their teaching (especially female teachers) to think somewhat more positively about their own teaching. Therefore, the research design encompassed the testing of hypotheses about differences in perception and the investigation of differential perception profiles.

After testing whether teachers’ and students’ perceptions obtained with the instrument used were comparable or, in other words, whether the measurement of instructional quality is invariant across the two (RQ1), we investigated how self-assessment and external assessment by students of instructional quality are correlated and whether and to what extent this correlation differs depending on the external assessment (RQ2). Following up on previous research dealing with self-perceptions that has shown different patterns for women and men when assessing their own behavior, we investigated whether gender effects on self-assessment known from other contexts can also be transferred to teachers' assessments of teaching (RQ3). To this end, we put forward four specific hypotheses:

  • The perception of instructional quality can be obtained by the same measurement model for students and teachers (H1).

  • Teachers’ self-perceptions and students’ perceptions are moderately correlated (H2.1).

  • Teachers’ self-perceptions differ in a systematic way from students’ perceptions regarding the dimensions of classroom management (H2.2.1), cognitive activation (H2.2.2) and student support (H2.2.3):

    • (a) Overestimation of instructional quality characteristics is largest among those whose lessons are perceived unfavorably by their students.

    • (b) Correct estimation or underestimation can be found among those assessed favorably by their students.

  • Male teachers show a tendency to significantly overestimate their own instructional quality (H3).

The three hypotheses are related to perception differences with respect to three separate characteristics of instructional quality. If patterns of over- and under-estimation of teachers’ perceptions compared with students’ perceptions can be shown for these three dimensions—given the multidimensionality of instructional quality—it must be clarified if the simultaneous consideration of all three dimensions allows the identification of teacher profiles that reflect their perception of instructional quality in general relative to their students’ perceptions. Therefore, in a next step, we used a more-exploratory method to investigate typical inter-personal patterns in the deviation of self-assessment from an external assessment with regard to instructional quality in general. We analyzed the extent to which different profiles occur in the combination of teacher and student assessments of the three dimensions of instructional quality (RQ4). Finally, we explored if and to what extent personal and context variables (grade, school type, school subject, teacher gender) are associated with the perception profiles to which teachers belong (RQ5).

Because student assessments of teaching in primary and secondary schools are unsuitable or only suitable to a very limited extent for accountability purposes of teachers to supervisors (Röhl & Rollett, 2021), we limit interpretation of our findings to their relevance for teachers' self-reflection on their instruction.

Sample

The sample consisted of 171 teachers (51% female) teaching classes in grades 5–12 at eight German schools from three different school types. The corresponding student sample consisted of 4108 students. These three school types are university preparatory high schools (Gymnasium), intermediate secondary schools (Realschule) and vocational schools (Berufliche Schule). Within the German school context, these are three types of secondary schools, with the first two starting from grade 5 and both requiring certain entry grades from grade 4 (primary school), and vocational schools starting from grade 10 and requiring the completion of junior high school.

Data collection

Students’ perceptions of instructional quality (with an average cluster size of 22.3) were surveyed during the period from September 2017 to October 2019 via an online portal. The data stem from the everyday school context, rather than being obtained for research purposes: teachers collected feedback for their professional development and then provided the results for scientific analysis. This is why a period of this duration was chosen. No extra incentive was provided for teachers or students. For every survey, teachers assessed themselves on the same items as their students before they had received the students’ assessments. Because of the technical nature of the online portal and privacy restrictions, no personalized student data were obtained. Personalized teacher data were anonymized before being transferred to us for analysis. The heterogenous database reflects the variety of different school types in the German school system, but it does not constitute a representative sample of the school system in a narrower sense. Teachers decided to use the online feedback portal voluntarily, which means that the sample was restricted to those teachers who were willing to reflect on their teaching based on student feedback or simply wanted to try out this instrument.

Measure

Students’ and teachers’ perceptions of instructional quality were surveyed using the teaCh questionnaire (Wisniewski & Zierer, 2020) consisting of 29 items, which refer to the seven categories of care, control, conferment, clarity, challenge, consolidation, and captivation (see Table 1). The items for teachers are identical to the student version but are formulated from the teacher’s perspective. As most of the item formulations focus on the teacher (e.g., “The teacher had high expectations of me”; “I had high expectations of the students”), the questionnaire measures comparable self–other perceptions of teacher’s behavior in the classes.

Table 1 Dimensions of teaCh questionnaire

As latent second-order factors, these categories load on the known three basic dimensions of instructional quality, namely, classroom management, cognitive activation, and student support (Praetorius et al., 2018), which were used for analyses in this study. Both versions were rated on a four-point Likert-type scale, ranging from 1: I Don’t Agree to 4: I Agree. The instrument has been shown to measure general instructional quality in a valid way and that the measurement is generalizable across school types, school subjects, and grade levels in a secondary school context (Wisniewski et al., 2020). It also allows a valid comparison of student and teacher perspectives.

Statistical analyses

Using the actual sample, we conducted a two-level confirmatory factor analysis of the assumed factor structure of seven first-order factors and three second-order factors referring to the basic dimensions of instructional quality. To compare the values of the student and teacher perspective of instructional quality in our analyses, at least a metric invariance between the two perspectives is necessary. For testing this, the models with the restrictions were compared between the groups with the less restrictive precursor. χ2 statistics have proven to be an unreliable indicator of measurement invariance for large samples because significant p-values can be obtained almost irrespective of actual differences of model fit (Cheung & Rensvold, 2002; Kline, 2016). As an alternative, goodness-of-fit indices can be used as a more-reliable source of information to test for measurement invariance. As proposed by Meade et al. (2008), ΔCFI ≤ 0.002 was chosen as a comparative indicator. To ensure an even better comparability of the two perspectives, equal item loadings for students and teachers were specified for the subsequent analyses.

To show systematic miscalibrations of self-perceptions, most research on the Dunning–Kruger effect uses the percentile ranks of actual performance and self-assessments (Kruger & Dunning, 1999). Therefore, factor scores were scaled in the same way to test the relevant hypotheses.

For the in-depth regression analyses, we used the factor scores for the three basic dimensions of instructional quality provided by the program MPlus 8.2 (Muthen & Muthen 2012–2019). A latent profile analysis (LPA) based on teachers’ self-perception compared with the students’ perceptions of their classes was conducted, also using the three basic dimensions of instructional quality. Because teachers were nested in schools, we accounted for possible dependencies in our data by correcting standard errors and a chi-square test of model fit (TYPE = COMPLEX), with schools used as clusters. To identify the best-fitting profile solution, we estimated fit indices (BIC, aBIC, AIC), likelihood ratio tests (Vuong likelihood ratio test and Lo-Mendell-Rubin likelihood ratio test, entropy) and the number of subjects per assumed class for different solutions.

After identifying perception profiles, we tested if person and context variables were associated with the assignment of teachers to these profiles. We used the grade that was taught, the school type, the school subject, and the teacher’s gender for this analysis.

Results

Descriptive results

All observed item means were slightly above the theoretical mean (ranging from 2.99 to 3.54 for teachers and from 2.98 to 3.35 for students), with standard deviations ranging from 0.60 to 0.92 for students and from 0.62 to 0.88 for teachers. Responses were approximately normally distributed with skewness ranging from − 1.44 to − 0.33 for teachers and from − 1.43 to − 0.49 for students. Kurtosis values ranged from − 0.61 to 1.79 for teachers and from − 0.61 to 1.77 for students. More specific descriptive data can be found in the Table 2 in the "Appendix".

Measurement model

The test for measurement invariance across teachers and their classes pointed to an acceptable fit for the sample. Using ΔCFI ≤ 0.002 as comparative indicator (Meade et al., 2008), results pointed to strong measurement invariance between the groups (see Table 3). The measurement model with equal item loadings for teachers and students used in the following pointed to an acceptable fit for the sample (CFI = 0.93 TLI = 0.92, RMSEA = 0.02 SRMRwithin = 0.03, SRMRbetween = 0.10). Intraclass correlations for the 29 items on the student level were substantial with ICC1 ranging from 0.08 to 0.24 (median 0.17) and ICC2 ranging from 0.68 to 0.88 (median 0.82).

Deviations between teachers’ self-perceptions and students’ perceptions

The correlations between self-perception and aggregated students’ perception were r = 0.52 (p < 0.001) for classroom management, r = 0.35 (p < 0.001) for cognitive activation, and r = 0.40 (p < 0.001) for student support. For all three basic dimensions, teachers in the bottom quartile overestimated their performance based on student perceptions, whereas teachers in the three other quartiles either agreed with their students’ perceptions or underestimated their performance (Fig. 1).

Fig. 1
figure 1

Self-perception and aggregated student perceptions of instructional quality

Influence of gender

Female teachers received better ratings than male teachers, with small but significant differences for all three basic dimensions (tCM [127] = 2.52, p < 0.05; tCA[129] = 3.08, p < 0.01; tSS[132] = 3.34, p < 0.01). Self-perceptions were significantly different between female and male teachers (with more favorable self-perceptions for men) for classroom management and cognitive activation, but not for student support (tCM [166] = − 2.57, p < 0.05; tCA [169] = − 2.04, p < 0.05; tSS[169] = − 1.77, p > 0.05), See Table 4.

In predicting teachers’ self-perceptions using student perceptions and teachers’ gender, both variables showed significant effects. Together, student perceptions and teachers’ gender explained between 18 and 36% of variance in the self-perception for the three dimensions of classroom management, cognitive activation, and student support (Table 5.)

Identification of different perception profiles

The fit indices, likelihood ratio tests and number of subjects per assumed class for solutions for classes 1–12 are shown in Table 6. Both, the Vuong and the Lo-Mendell-Rubin likelihood ratio test supported the four-class-model. In addition, the difference of aBIC to the next number of classes decreased most significantly from 3 to 4 (ΔaBIC = 148) and from 4 to 5 (ΔaBIC = 98). Therefore, the four-class-model was selected for further analysis.

Class 1 (10.53%) was characterized by the lowest student assessments of instructional quality and significant differences between students’ and self-perceptions for classroom management and cognitive activation (p < 0.001), whereas no perception difference was found for student support (p = 0.10) for this class. Class 2 (21.05%) was characterized by the second-lowest student assessments and significant perception differences for classroom management (p < 0.05), cognitive activation (p < 0.001) and student support (p < 0.01). Class 3 (40.35%) was characterized by teachers’ underestimation of all three dimensions of classroom management (p < 0.01), cognitive activation (p < 0.01) and student support (p < 0.05) compared with their students’ perceptions. Class 4 (28.07%) was characterized by the most-positive student assessments, agreement of self with students’ perceptions for classroom management (p = 0.40) and student support (p = 0.80), and a significant underestimation of cognitive activation compared with students’ perceptions (p < 0.001). Figure 2 shows the average factor scores for students’ and self-perceptions of the latent variables.

Fig. 2
figure 2

Average factor scores for students’ and self-perceptions of the latent variables

Perception profiles and their association with grade, school type, school subject, and teacher gender

No significant differences were found for grade (χ2 = 4.81, df = 6, p = 0.57), school type (χ2 = 1.59, df = 6, p = 0.95) or taught school subject (χ2 = 22, df = 18, p = 0.23) among individuals assigned to the four classes could be found, However, a chi-squared test revealed significant differences in teacher gender (χ2 = 28, df = 3, p < 0.001, see Table 7), with a higher proportion of men in the overestimating classes 1 and 2.

Table 8 shows the assignment to the four classes differentiated by the taught subject, whereas

Table 9 shows the assignment to the four classes differentiated by teacher gender and taught subject (χ2 = 51, df = 18, p < 0.001).

Discussion

The aim of the present study was to investigate the divergence between teachers’ self-perceptions and students’ perceptions of instructional quality. We argue that this divergence is the basis for meaningful reflective practice because, when teachers misjudge instructional quality characteristics in comparison to their students, they cannot adapt the learning environment to their students’ needs. A regression analysis and latent profile analysis were applied to explore disagreement between teachers and students on characteristics of instructional quality.

Comparing perceptions across different groups requires that the measured constructs have the same meaning across these groups, and that comparisons of sample estimates are not distorted by group-specific attributes. As a prerequisite for our investigation, we therefore confirmed the assumed factor structure of the instrument used to obtain self and students’ perceptions for our sample, and that the three superordinate dimensions of instructional quality were invariant across teachers and students, allowing a comparison of the two perspectives (H1).

In line with previous research (Clausen, 2002; Fauth et al., 2014; Kunter & Baumert, 2006; Maulana et al., 2012; Wagner et al., 2016), the overall correlation between teachers’ self-perceptions and students’ perception was only low to moderate (H2.1).

However, correlations based on the whole sample offer no informational value about how teachers differ in the accuracy of their self-perceptions. A classification of teachers into quartiles according to their students’ perceptions produced a pattern similar to the pattern found by Kruger and Dunning (1999) for different tasks. Teachers with more-unfavorable students’ perceptions overestimated their performance. The results from the latent profile analysis can hardly be explained by a regression to the middle of self-assessments as proposed by Krueger and Mueller (2002) because there is substantial overestimation of teachers with unfavorable students’ assessments on the one hand, but little underestimation of teachers with favorable students’ assessments. Teachers’ self-perceptions are more accurate when the external perception is more favorable. Consequently, our data support Kruger’s and Dunning’s (1999) hypothesis, understanding over-estimations as a consequence of the inability to assess characteristics of instructional quality that someone puts into practice inadequately.

Teachers who overestimated their performance (classes 1 and 2) accounted for about one third of the sample. The 11% of teachers who received the lowest ratings for all three basic dimensions of instructional quality overestimated their performance regarding classroom management and cognitive activation the most. The two classes with the most-favorable student assessments accounted for more than two-thirds of the sample and were characterized by more-precise self-perceptions and underestimations.

While the assignment to the four identified classes was independent of grade level, school type, and taught subjects, significant gender differences were found, with female teachers being under-represented and male teachers being over-represented in the profile defined by the highest overestimation of one’s own performance. These associations were not dependent on the school subjects taught. Effects of women especially underestimating their achievements in masculine-stereotyped tasks or domains (Beyer & Bowden, 1997) were not replicated in the way in which female teachers assessed their instructional quality more negatively in subjects like mathematics, physics or IT that are traditionally stereotyped as male domains (Makarova et al., 2019).

There are certain limitations of the study that need to be borne in mind when interpreting the results. Firstly, our sample included teachers from two German federal states and three school types and is therefore not representative for the whole German school system. Secondly, we could not test for effects of some personal variables. We could not consider students’ gender because no individual-related student data were collected for data-protection reasons. It has been shown in previous research that there are interaction effects of teacher and student gender (Boring, 2017; Mitchel and Martin 2018) for evaluations of teaching in higher education, but we could not expand on this research. We also were unable to use information on teachers’ age or professional experience and could therefore not check if theses variables influence patterns of self-perception. Finally, teachers’ perceptions of instructional quality were only compared with (also subjective) students’ perceptions. Future research should expand on this by using actual teacher performance data (e.g. in the form of value added) so that claims of miscalibrations can be made based on objective data.

Despite these limitations, practical implications can still be drawn: although a favorable self-assessment can have beneficial effects (Bandura, 1977, 1997; Fox et al., 2009; Mosing et al., 2009), under- and over-estimations can also cause detrimental motivational consequences (Dunlosky & Rawson, 2012). Most importantly, over-estimations of one’s own performance hinders people from developing their professional skills because they are unaware of reasons for improvement. This is especially disadvantageous when those who require professional development the most are least aware of its necessity.

Our findings point to the importance of adaptive feedback for teachers, considering different patterns of disagreement and counteracting under- and over-estimation. Because of the relative stability of inaccurate self-perceptions towards feedback (Hacker et al., 2000; Helzer & Dunning 2012; Simons, 2013) and the difficulties of low-performers in calibrating their self-perceptions based on feedback (Brett & Atwater, 2001; Kruger & Dunning, 1999), it is noteworthy that teachers who over-estimated the instructional quality of their lessons compared with how students perceived it did not benefit from simply getting the information that there is a divergence, and therefore need support to improve (Röhl & Gärtner, 2021). Feedback which is based on specific observations of actual behavior is more likely to provide opportunity for improvement (Braga et al., 2014; Kluger & DeNisi, 1996; Kornell & Hausman, 2016; Miller & Geraci, 2011).

In his seminal study, Goodlad (1984, quoted from Lamb, 2017) claimed that, by not using students’ feedback, an essential part of reflective practice has been neglected. While research on the effectiveness of feedback to teachers from peers or traditional supervisors has provided conflicting results (Scheeler et al., 2004), student feedback seems to lead to improved instructional quality (Röhl & Gärtner, 2021). Consequently, questionnaire data from student feedback, as used for our analysis, can be one important feedback source used by teachers to reflect on teaching and improve on instructional practices. Our results point toward the necessity for this feedback to be based on clearly-defined dimensions of instructional quality and for those who overestimate their performance to be supported with measures of counselling and coaching.