Beurteilereffekte bei der Messung von Unterrichtsqualität

Pietsch, Marcus; Tosana, Simone

doi:10.1007/s11618-008-0021-7

Beurteilereffekte bei der Messung von Unterrichtsqualität

Das Multifacetten-Rasch-Modell und die Generalisierbarkeitstheorie als Methoden der Qualitätssicherung in der externen Evaluation von Schulen

Rater effects in the measurement of quality of classroom teaching

The many-facet Rasch model and the generalizability theory as methods of quality assurance in the external evaluation of schools.

Allgemeiner Teil
Published: 19 November 2008

Volume 11, pages 430–452, (2008)
Cite this article

Zeitschrift für Erziehungswissenschaft Aims and scope Submit manuscript

Marcus Pietsch¹ &
Simone Tosana¹

1602 Accesses
18 Citations
Explore all metrics

Zusammenfassung

In der externen Evaluation von Schulen gehören Unterrichtsbeobachtungen mittlerweile zum methodischen Standardrepertoire. Gleichwohl ist die Messung von Unterrichtsqualität auf Basis weniger ausgewählter Unterrichtssequenzen, in die zudem in der Regel nur kurze Einblicknahmen möglich sind, mit einer Vielzahl methodischer Probleme behaftet. Daher ist es wichtig, durch den Einsatz angemessener empirischer Verfahren eine fundierte Qualitätssicherung zu etablieren, die es gestattet, Probleme bei der Messung von Unterrichtsqualität sichtbar zu machen. Dies ermöglichen die Generalisierbarkeitstheorie und das Multifacetten-Rasch-Modell. Analysen von Daten der Schulinspektion Hamburg zeigen, dass bei einem entsprechenden Erhebungsdesign Beurteilereffekte bei Unterrichtsbeobachtungen mit rund neun Prozent der Gesamtvariation vergleichsweise gering ausfallen. Darüber hinaus belegen die Analysen aber auch, dass es nicht ausreicht, nur die globale Übereinstimmung von Beobachtern mithilfe von Reliabilitätsmaßen zu bestimmen, sondern dass auch die Überprüfung der beobachterindividuellen Bewertungskonsistenz notwendig ist, um valide und somit für die Praxis nutzbare Ergebnisse aus Unterrichtsbeobachtungen zu garantieren.

Abstract

In the external evaluation of schools the technique of classroom observation belongs to the methodological standard repertoire. Nevertheless the measurement of quality of classroom teaching based upon selected lesson sequences, which are as a rule inspected only briefly, is fraught with a lot of methodological problems. Therefore it is relevant for a substantiated quality assurance to reveal problems in the measurement of quality of classroom teaching due to an implementation of adequate empirical methods. This is made possible by using the generalizability theory and the many-facet Rasch model. Analyses based upon data of the Hamburg school inspection point out that by using an appropriate data collection procedure rater effects in classroom observations turn out comparatively low at about nine percent of total variance. Furthermore analyses prove that it is insufficient to simply quantify the agreement among raters by using global reliability measures, but that it is necessary to check up on intra rater consistency for getting valid and in this way reliable results from classroom observations for the practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

Wie relevant die Kombination dieser beiden Methoden ist, zeigt sich u. a. darin, dass in der rezenten Forschungsliteratur erste Vorschläge unterbreitet werden, die Ansätze der beiden genutzten Verfahren auch formal zu kombinieren (vgl. Bock/Brennan/Muzaki 2002; Marcoulides/Drezner 2000; Briggs/Wilson 2007). So schlagen beispielsweise Briggs und Wilson (2007) eine integrative Methode vor, mit der es möglich ist, Modellparameter und Varianzkomponenten simultan zu ermitteln, um so zu gewährleisten, dass der Einfluss verschiedener Messfehlerquellen in Vergleichstests angemessen berücksichtigt und kontrolliert werden kann.
Weitere Analysen, die nicht Teil dieses Beitrages sind, zeigen, dass bei Nichtberücksichtigung der Facette ‚Beurteiler‘ im Modell die Itempositionen auf der dargestellten Metrik für 28 der 30 Items nahezu konstant bleiben. Die Messwerte der Items 21 und 23 werden jedoch stark durch die Unterrichtsbeobachter beeinflusst; d. h. die für diese Items eingebrachte Variation wird maßgeblich durch die Unterrichtsbeobachter erzeugt. Bei Berechnung eines klassischen 1-PL-Rasch-Modells liegt Item 21 auf der Höhe von Item 15 und Item 23 auf der Höhe von Item 14. Die Item-Separationsreliabilität liegt für dieses Modell bei 0.995, die Item-Separationsratio bei 8,269 und der Item-Klassenseparationsindex bei 11.

Literatur

Andrich, D. (1978): Application of a psychometric model to ordered categories which are scores with successive integers. In: Applied Psychological Measurement, Vol. 2, pp. 581–594.
Article Google Scholar
Bartz, A./Müller, S. (2005): Schulinspektion – Ziele, Funktionen und Qualitätsbereiche. In: Barz, A./Fabian, J./Huber, S. G./Kloft, C./Rosenbusch, H./Sassenscheidt, H. (Hrsg.): Praxiswissen Schulleitung (Stand: 15.11.2005). – München.
Behörde für Bildung und Sport (2006): Orientierungsrahmen: Qualitätsentwicklung an Hamburger Schulen. – Hamburg.
Berry, K. J./Mielke, P. W. (1988): A generalization of cohen’s kappa agreement to interval measurement and multiple raters. In: Educational and Psychological Measurement, Vol. 48(4), pp. 921–933.
Article Google Scholar
Bock, R. D./Brennan, R. L./Muraki E. (2002): The information in multiple ratings. In: Applied Psychological Measurement, Vol. 26(4), pp. 364–375.
Article Google Scholar
Bortz, J./Döring, N. (2002): Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler. – Berlin.
Bos, W./Holtappels, H. G./Rösner, E. (2006): Schulinspektionen in den deutschen Bundesländern – eine Baustellenbeschreibung. In: Bos, W./Holtappels, H.-G./Pfeiffer, H./Rolff, H.-G./Schulz-Zander, R. (Hrsg.): Jahrbuch der Schulentwicklung Band 14. Daten, Beispiele und Perspektiven. – Weinheim, S. 81–124.
Brennan, R. L. (2001a): Generalizability Theory. – New York.
Brennan, R. L. (2001b): Manual for urGENOVA Version 2.1. Iowa Testing Programs Occasional Papers 49. – Iowa City.
Briggs, D. C./Wilson, M. R. (2007): Generalizability in item response modeling. In: Journal of Educational Measurement, Vol. 44(2), pp. 131–155.
Article Google Scholar
Clauser, B./Linacre, J. M. (1999): Relating Cronbach and Rasch reliabilites. In: Rasch Measurement Transactions, Vol. 13(2), p. 696.
Google Scholar
Cohen, J. (1960): A coefficient of agreement for nominal scales. In: Educational and Psychological Measurement, Vol. 20, pp. 37–46.
Article Google Scholar
Cronbach, L. J./Rajatranam, N./Gleser, C. G. (1963): Theory of generalizability: A liberalization of reliability theory. In: British Journal of Statistical Psychology, Vol. 16, pp. 137–163.
Article Google Scholar
Cronbach et al. 1972 = Cronbach, L. J./Gleser, G. C./Nanda, H./Rajatranam, N. (1972): The Dependability of Behavioral Measurements: Theory of generalizability for scores and profiles. – New York.
Dobbelstein, P. (in Druck): Die Suche nach dem guten Unterricht. Qualitätsmaßstäbe in der Diskussion. In: Bos, W./Dedering, K./Holtappels, H.-G./Müller, S./Rösner, E. (Hrsg.): Schulische Qualitätsanalyse in NRW. – Köln.
Eckes, T. (2004): Facetten des Sprachtestens. Strenge und Konsistenz in der Beurteilung sprachlicher Leistung. In: Wolff, A./Ostermann, A./Closta, C. (Hrsg.): Integration durch Sprache. – Regensburg, S. 485–518.
Fisher, W. (1992): Reliability statistics. In: Rasch Measurement Transactions, Vol. 6(3), p. 238.
Google Scholar
Gleser, G. C./Cronbach, L. J./Rajatranam, N. (1965): Generalizability of scores influenced by multiple sources of variance. In: Psychometrika, Vol. 30, pp. 395–418.
Article Google Scholar
Helmke, A. (2003): Unterrichtsqualität erfassen, bewerten, verbessern. – Seelze.
Helmke, A. (2006): Was wissen wir über guten Unterricht? In: Pädagogik, 58. Jg., H. 2, S. 42–45.
Google Scholar
Helmke, A./Helmke, T./Schrader, F.-W. (2007): Unterrichtsqualität: Brennpunkte und Perspektiven der Forschung. In: Arnold, K.-H. (Hrsg.): Unterrichtsqualität und Fachdidaktik. – Bad Heilbrunn, S. 51–72.
Hoyt, W. T. (2000): Rater bias in psychological research: When is it a problem and what can we do about it? In: Psychological Methods, Vol. 5, pp. 64–86.
Article Google Scholar
Hoyt, W. T./Kerns, M.-D. (1999): Magnitude and moderators of bias in observer ratings: A meta-analysis. In: Psychological Methods, Vol. 4, pp. 403–424.
Article Google Scholar
Linacre, J. M. (1989): Many-faceted Rasch Measurement. – Chicago.
Linacre, J. M. (1997): KR-20/ Cronbach alpha or Rasch reliabities: Which tells the truth? In: Rasch Measurement Transactions, Vol. 11(3), pp. 580–581.
Google Scholar
Linacre, J. M. (1998): Linking constants with common items and judges. In: Rasch Measurement Transactions, Vol. 12(1), p. 621.
Google Scholar
Linacre, J. M. (2001): Generalizability theory and Rasch measurement. In: Rasch Measurement Transactions, Vol. 15(1), pp. 806–807.
Google Scholar
Linacre, J. M. (2002): What do infit and outfit, mean-square and standardized mean? In: Rasch Measurement Transactions, Vol. 16(2), p. 878.
Google Scholar
Linacre, J. M. (2007): A User’s Guide to Facets: Rasch model computer programs. – Chicago.
Lunz, M. E./Stahl, J. A./Wright, B. D. (1994): Interjudge reliability and decision reproducibility. In: Educational and Psychological Measurement, Vol. 54(4), pp. 913–925.
Article Google Scholar
Lunz, M. E./Wright, B. D./Linacre, J. M. (1990): Measuring the impact of judge severity on examination scores. In: Applied Measurement in Education, Vol. 3(4), pp. 331–345.
Article Google Scholar
Marcoulides, G. A./Drezner, Z. (2000): A procedure for detecting pattern clustering in measurement designs. In: Wilson, M. R./ Engelhard, G. (Eds.): Objective Measurement: Theory into practice. – Stamford, pp. 287–303.
Maritzen, N. (2006): Eine Trendanalyse – Schulinspektion zwischen Aufsicht und Draufsicht. In: Buchen, H./Horster, L./Rolff, H.-G. (Hrsg.): Schulinspektion und Schulleitung. – Stuttgart, S. 7–26.
Masters, G. N. (1982): A Rasch model for partial credit scoring. In: Psychometrika, Vol. 47, pp. 149–174.
Article Google Scholar
Meyer, H. (2004): Was ist guter Unterricht? – Berlin.
Myford, C. M./Wolfe, E. W. (2000): Strengthening the Ties that Bind: Improving the linking network in sparsely connected rater designs. – Princeton.
Myford, C. M./Wolfe, E. W. (2003): Detecting and measuring rater effects using many-facet Rasch measurement: Part 1. In: Journal of Applied Measurement, Vol. 4(4), pp. 386–422.
Google Scholar
Myford, C. M./Wolfe, E. W. (2004): Detecting and measuring rater effects using many-facet Rasch measurement: Part 2. In: Journal of Applied Measurement, Vol. 5(2), pp. 189–227.
Google Scholar
Myford, C. M./Marr, D. B./Linacre, J. M. (1996): Reader Calibration and its Potential Role in Equating for the Test of Written English. – Princeton.
Programme for educational research and development (2007): G-String II. A Windows Wrapper for urGENOVA. – Hamilton.
Rasch, G. (1960): Probabilistic Models for Some Intelligence and Attainment Tests. – Kopenhagen.
Rost, J. (2004): Lehrbuch Testtheorie – Testkonstruktion. – Bern.
Saal, F. E./Downey, R. G./Layhey, M. (1980): Rating the raters. assessing the psychometric quality of rating data. In: Psychological Bulletin, Vol. 88, pp. 413–428.
Article Google Scholar
Shavelson, R. J./Webb, N. M. (1991): Generalizability Theory: A primer. – Newbury Park.
Smith, E.V./Kulikovich, J. M. (2004): An application of generalizability theory and many-facet Rasch measurement using a complex problem-solving skills assessment. In: Educational and Psychological Measurement, Vol. 64, pp. 617–639.
Article Google Scholar
Smith, R. M. (2004): Fit analysis in latent trait measurement models. In: Smith, E. V./Smith, R. M. (Eds.): Introduction to Rasch Measurement: Theory, models and applications. – Maple Grove, pp. 73–92.
Smith, R. M./Schumacker, R. E./Bush, M. J. (1998): Using item mean squares to evaluate fit to the Rasch model. In: Journal of Outcome Measurement, Vol. 2(1), pp. 66–78.
Google Scholar
Stahl, J. A. (1994): What does generalizability theory offer that many-facet Rasch measurement cannot duplicate? In: Rasch Measurement Transactions, Vol. 8(1), pp. 342–343.
Google Scholar
Webb, N. M./Shavelson, R. J./Haertel, E. H. (2006): Reliability coefficients and generalizability theory. In: Rao, C. R./Sinharay, S. (Eds.). Handbook of Statistics, Vol. 26: Psychometrics. – Amsterdam, pp. 81–124.
Wolfe, E. W./Chiu, C. W. T./Myford, C. M. (1999): The Manifestation of Common Rater Errors in Multi-faceted Rasch Analyses. – Princeton.
Wright, B. D. (1993): Logits? In: Rasch Measurement Transactions, Vol. 7(2), p. 288.
Google Scholar
Wright, B. D./Linacre, J. M. (1994): Reasonable mean-square fit values. In: Rasch Measurement Transactions, Vol. 8(3), p. 370.
Google Scholar
Wright, B. D./Masters, G. N. (1982): Rating Scale Analysis: Rasch measurement. – Chicago.
Wright, B. D./Masters, G. N. (2002): Number of person or item strata. In: Rasch Measurement Transactions, Vol. 16(3), p. 888.
Google Scholar
Wright, B. D./Stone, M. (1979): Best Test Design. – Chicago.
Wright, B. D./Stone, M. (1999): Measurement Essentials. – Wilmington.
Wu, M. L./Adams, R. J./Wilson, M. R. (1998): ACER ConQuest. Generalised item response modelling software. – Melbourne.

Download references

Author information

Authors and Affiliations

Freie und Hansestadt Hamburg, Institut für Bildungsmonitoring, Schulinspektion, Beltgens Garten 25, D-20537, Hamburg, Deutschland
Marcus Pietsch & Simone Tosana

Authors

Marcus Pietsch
View author publications
You can also search for this author in PubMed Google Scholar
Simone Tosana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcus Pietsch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pietsch, M., Tosana, S. Beurteilereffekte bei der Messung von Unterrichtsqualität. ZfE 11, 430–452 (2008). https://doi.org/10.1007/s11618-008-0021-7

Download citation

Published: 19 November 2008
Issue Date: November 2008
DOI: https://doi.org/10.1007/s11618-008-0021-7

Schlüsselwörter

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Beurteilereffekte bei der Messung von Unterrichtsqualität

Zusammenfassung

Abstract

Access this article

Notes

Literatur

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Schlüsselwörter

Keywords

Search

Navigation