Abstract
In educational research, characteristics of the learning environment are generally assessed by asking students to evaluate features of their lessons. The student ratings produced by this simple and efficient research strategy can be analysed from two different perspectives. At the individual level, they represent the individual student’s perception of the learning environment. Scores aggregated to the classroom level reflect perceptions of the shared learning environment, corrected for individual idiosyncrasies. This second approach is often pursued in studies on teaching quality and effectiveness, where student-level ratings are aggregated to the class level to obtain general information about the learning environment. Although this strategy is widely applied in educational research, neither the reliability of aggregated student ratings nor the within-group agreement between the students in a class has been subject to much investigation. The present study introduces and discusses different procedures that have been proposed in the field of organisational psychology to assess the reliability and agreement of students’ ratings of their instruction. The application of the proposed indexes is demonstrated by a reanalysis of student ratings of mathematics instruction obtained in the Third International Mathematics and Science Study (N = 2,064 students in 100 classes).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The literature on observer agreement also refers to the ICC(1) as the unadjusted intraclass correlation, because mean differences between raters affect the error variance, meaning that absolute agreement between the raters is required (McGraw & Wong, 1996). In organisational psychology (e.g. Bliese, 2000; Cohen et al., 2001), in contrast, the ICC(1) is seen as a measure of reliability. This is the approach taken in the present article.
In their original article, James et al. (1984) presented the r WG as a measure of interrater reliability among ratings of a single target. Following critical analysis by Schmidt and Hunter (1989), it became standard practice to see the r WG as a measure of interrater agreement (James et al., 1993; Kozlowski & Hattrup, 1992). The main criticism voiced by Schmidt and Hunter was that, in psychometric testing theory, the concept of reliability is based on the assumption of variance between true values, but that this variance is only given when more than one stimulus is rated.
References
Anderson, C. S. (1982). The search for school climate: A review of the research. Review of Educational Research, 52, 368–420.
Baumert, J., Lehmann, R. H., Lehrke, M., Schmitz, B., Clausen, M., Hosenfeld, I., et al. (1997). TIMSS: Mathematisch-Naturwissenschaftlicher Unterricht im internationalen Vergleich [TIMSS: Mathematics and science instruction in an international comparison]. Opladen, Germany: Leske and Budrich.
Beaton, A. E., Mullis, I. V. S., Martin, M. O., Gonzales, E. J., Kelly, D. L., & Smith, T. A. (1996). Mathematics achievement in the middle school years: IEA’s third international mathematics and science study. Chestnut Hill: Boston College.
Bliese, P. D. (1998). Group size, ICC values, and group-level correlations: A simulation. Organizational Research Methods, 1, 355–373.
Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein, & S. W. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp. 349–381). San Francisco: Jossey-Bass.
Brown, R. D., & Hauenstein, N. M. A. (2005). Interrater agreement reconsidered: An alternative to the r WG indices. Organizational Research Methods, 8, 165–184.
Burke, M. J., & Dunlap, W. P. (2002). Estimating interrater agreement with the average deviation index: A user’s guide. Organizational Research Methods, 5, 159–172.
Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater agreement. Organizational Research Methods, 2, 49–68.
Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of compositional models. Journal of Applied Psychology, 83, 234–246.
Clausen, M. (2002). Unterrichtsqualität: Eine Frage der Perspektive? [Quality of instruction: A matter of perspective?] Münster, Germany: Waxmann.
Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the r WG(J) index of agreement. Psychological Methods, 6, 297–310.
Dunlap, W. P., Burke, M. J., & Smith-Crowe, K. (2003). Accurate tests of statistical significance for r WG and average deviation interrater agreement indexes. Journal of Applied Psychology, 88, 356–362.
Finn, R. H. (1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71–76.
Fraser, B. (1991). Two decades of classroom environment research. In H. J. Walberg (Eds.), Educational environments: Evaluation, antecedents and consequences (pp. 3–27). Elmsford, NY: Pergamon.
Grawitch, M. J., & Munz, D. C. (2004). Are your data nonindependent? A practical guide to evaluating nonindependence and within-group agreement. Understanding Statistics, 3, 231–257.
Griffith, J. (2002). Is quality/effectiveness an empirically demonstrable school attribute? Statistical aids for determining appropriate levels of analysis. School Effectiveness and School Improvement, 13, 91–122.
Gruehn, S. (2000). Unterricht und schulisches Lernen: Schüler als Quellen der Unterrichtsbeschreibung [Instruction and learning in school: Students as sources of information]. Münster, Germany: Waxmann.
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98.
James, L. R., Demaree, R. G., & Wolf, G. (1993). r WG: An assessment of within-group interrater agreement. Journal of Applied Psychology, 78, 306–309.
Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational Research, 47, 267–292.
Klein, K. J., Conn, A. B., Smith, B. D., & Sorra, J. S. (2001). Is everyone in agreement? An exploration of within-group agreement in employee perceptions of the work environment. Journal of Applied Psychology, 86, 3–16.
Kozlowski, S. W., & Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77, 161–167.
Kunter, M. (2005). Multiple Ziele im Mathematikunterricht [Multiple objectives in mathematics instruction]. Münster, Germany: Waxmann.
Kunter, M., Baumert, J., & Köller, O. (2005). Effective classroom management and the development of subject-related interest. Manuscript submitted for publication.
LeBreton, J. M., James, L. R., & Lindell, M. K. (2005). Recent issues regarding r WG, r*WG, r WG(J), and r*WG(J). Organizational Research Methods, 8, 128–138.
Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single target. Applied Psychological Measurement, 21, 271–278.
Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of CVI, T, r WG(J), and r*WG(J) indexes. Journal of Applied Psychology, 84, 640–647.
Lindell, M. K., & Brandt, C. J. (2000). Climate quality and climate consensus as mediators of the relationship between organizational antecedents and outcomes. Journal of Applied Psychology, 85, 331–348.
Lindell, M. K., Brandt, C. J., & Whitney, D. J. (1999). A revised index of interrater agreement for multi-item ratings of a single target. Applied Psychological Measurement, 23, 127–135.
Lüdtke, O., Köller, O., Marsh, H. W., & Trautwein, U. (2005). Teacher frame of reference and the big-fish-little-pond effect. Contemporary Educational Psychology, 30, 263–285.
Lüdtke, O., Robitzsch, A., & Köller, O. (2002). Statistische Artefakte bei Kontexteffekten in der pädagogisch-psychologischen Forschung [Statistical artifacts in educational studies on context effects]. Zeitschrift für Pädagogische Psychologie, 16, 217–231.
Lüdtke, O., Trautwein, U., Kunter, M., & Baumert, J. (2006). Analyse von Lernumwelten: Ansätze zur Bestimmung der Reliabilität und Übereinstimmung von Schülerwahrnehmungen [Analysis of learning environments: Approaches to determining the reliability and agreement of student ratings]. Zeitschrift für Pädagogische Psychologie, 20, 85–96.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models (2nd ed.). Thousand Oaks, CA: Sage.
Schmidt, F. L., & Hunter, J. E. (1989). Interrater reliability coefficients cannot be computed when only one stimulus is rated. Journal of Applied Psychology, 75, 322–327.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.
Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. London: Sage.
Urdan, T., Midgley, C., & Anderman, E. M. (1998). The role of classroom goal structure in students’ use of self-handicapping strategies. American Educational Research Journal, 35, 101–122.
Wong, A. F., Young, D. J., & Fraser, B. J. (1997). A multilevel analysis of learning environments and student attitudes. Educational Psychology, 17, 449–468.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lüdtke, O., Trautwein, U., Kunter, M. et al. Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data. Learning Environ Res 9, 215–230 (2006). https://doi.org/10.1007/s10984-006-9014-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10984-006-9014-8