Uncovering predictors of disagreement: ensuring the quality of expert ratings

Hoth, Jessica; Schwarz, Björn; Kaiser, Gabriele; Busse, Andreas; König, Johannes; Blömeke, Sigrid

doi:10.1007/s11858-016-0758-z

Uncovering predictors of disagreement: ensuring the quality of expert ratings

Original Article
Published: 26 February 2016

Volume 48, pages 83–95, (2016)
Cite this article

ZDM Aims and scope Submit manuscript

Jessica Hoth¹,
Björn Schwarz²,
Gabriele Kaiser²,
Andreas Busse²,
Johannes König³ &
…
Sigrid Blömeke⁴

699 Accesses
18 Citations
10 Altmetric
2 Mentions
Explore all metrics

Abstract

Rating scales are a popular item format used in many types of assessments. Yet, defining which rating is correct often represents a challenge. Using expert ratings as benchmarks is one approach to ensuring the quality of a rating instrument. In this paper, such expert ratings are analyzed in detail taking a video-based test instrument of teachers’ professional competencies from a follow-up study to TEDS-M (the so-called TEDS-FU study) as an example. The paper focuses on those items that did NOT reach sufficient consensus among the experts and analyzes in depth their features by coding the experts’ comments on those items and additionally considering their rating outcome. The results revealed that the item-wording and the composition of the group of experts strengthened or weakened agreement among the experts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

Article Open access 07 June 2017

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Article Open access 10 December 2019

Ethical Considerations of Conducting Systematic Reviews in Educational Research

Notes

At least 60% of expert agreement was chosen as a cut off by the research team considering the empirical outcome of the first expert rating and the impact of experts’ comments.
The amount of explanatory notes varied for each item. Some items were not commented on at all while other items had up to seven comments (1.3 comments on average).
For this specific hypothesis, the Jonckheere-Terpstra test (Jonckheere 1954) was used to statistically test significant differences between both expert groups. No significant difference was found but due to the very small sample of experts, a qualitative approach and interpretation of the present results may be beneficial and give first indications about the assumption presented above.
Significant difference, p < 0.05 using a two-tailed T-Tests (Rasch et al. 2010); All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests: All school-based teacher educators rated this item and all but one university expert (21 experts).
Item example for a low-inferent item is the first rating scale item in Fig. 2; the second item in Fig. 2 displays a high-inferent item example.

References

Blömeke, S., (2013). Validierung als Aufgabe im Forschungsprogramm „Kompetenzmodellierung und Kompetenzerfassung im Hochschulsektor“. KokoHs Working Papers, 2. Berlin and Mainz: Humboldt-Universität and Johannes Gutenberg-Universität.
Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: competence viewed as a continuum. Zeitschrift für Psychologie, 223, 3–13.
Article Google Scholar
Blömeke, S., Hsieh, F.-J., Kaiser, G., & Schmidt, W. (Eds.). (2014a). International perspectives on teacher knowledge, beliefs and opportunities to learn. Dordrecht: Springer.
Google Scholar
Blömeke, S., König, J., Busse, A., Suhl, U., Benthien, J., Döhrmann, M., & Kaiser, G. (2014b). Von der Lehrerausbildung in den Beruf–Fachbezogenes Wissen als Voraussetzung für Wahrnehmung, Interpretation und Handeln im Unterricht. Zeitschrift für Erziehungswissenschaft, 17(3), 509–542. doi:10.1007/s11618-014-0564-8.
Article Google Scholar
Blömeke, S., Suhl, U., & Döhrmann, M. (2013). Assessing strengths and weaknesses of teacher knowledge in Asia, Eastern Europe and Western countries: differential item functioning in TEDS-M. International Journal of Science and Mathematics Education, 11, 795–817.
Article Google Scholar
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364–367.
Article Google Scholar
Bühner, M. (2006). Einführung in die Test- und Fragebogenkonstruktion. München: Pearson Studium.
Google Scholar
Clausen, M., Reusser, K., & Klieme, E. (2003). Unterrichtsqualität auf der Basis hoch-inferenter Unterrichtsbeurteilungen. Unterrichtswissenschaft, 31(2), 122–141.
Google Scholar
Fabrigar, L. R., & Krosnick, J. A. (1995). Attitude measurement and questionnaire design. In A. S. R. Manstead & M. Hewstone (Eds.), Blackwell encyclopedia of social psychology. Oxford: Blackwell Publishers.
Google Scholar
Fowler, F. J. (1992). How unclear terms can affect survey data. Public Opinion Quarterly, 56, 218–231.
Article Google Scholar
Häder, M. (2009). Delphi-Befragungen–Ein Arbeitsbuch. Wiesbaden: VS Verlag für Sozialwissenschaften.
Book Google Scholar
Hartig, J., Frey, A., & Jude, N. (2012). Validität. In H. Moosbrugger & A. Kelava (Eds.), Testtheorie und Fragebogenkonstruktion (pp. 143–171). Heidelberg: Springer-Verlag.
Chapter Google Scholar
Helmke, A., (2009). Unterrichtsqualität und Lehrerprofessionalität. Diagnose, Evaluation und Verbesserung des Unterrichts. Seelze-Velber: Klett-Kallmeyer.
Holleman, B. (1999). Wording effects in survey research: using meta-analysis to explain the forbid/allow asymmetry. Journal of Quantitative Linguistics, 6, 29–40.
Article Google Scholar
Jenßen, L., Dunekacke, S., & Blömeke S. (2015). Qualitätssicherung in der Kompetenzforschung: Standards für den Nachweis von Validität in Testentwicklung und Veröffentlichungspraxis. Zeitschrift für Pädagogik, 61(supplementary issue), 11–31.
Jonckheere, A. R. (1954). A test of significance for the relation between m rankings and k ranked categories. British Journal of Statistical Psychology, 7(2), 93–100.
Article Google Scholar
Kaiser, G., Busse, A., Hoth, J., König, J., & Blömeke, S. (2015). About the complexities of video-based assessments: theoretical and methodological approaches to overcoming shortcomings of research on teachers’ competence. International Journal of Science and Mathematics Education, 13(2), 369–387.
Article Google Scholar
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
Article Google Scholar
Klieme, E., Pauli, C., & Reusser, K. (Eds.). (2005). Dokumentation der Erhebungs- und Auswertungsinstrumente zur schweizerisch-deutschen Videostudie Unterrichtsqualität, Lernverhalten und mathematisches Verständnis. Frankfurt: GFPF.
Google Scholar
[KMK] Sekretariat der Ständigen Konferenz der Kultusminister der Länder in der Bundesrepublik Deutschland (2004). Bildungsstandards im Fach Mathematik für den Primarbereich: Beschluss der Kultusministerkonferenz vom 15.10.2004. http://www.kmk.org/fileadmin/veroeffentlichungen_beschluesse/2004/2004_10_15-Bildungsstandards-Mathe-Primar.pdf. Accessed 22 July 2015.
König, J., Blömeke, S., Klein, P., Suhl, U., Busse, A., & Kaiser, G. (2014). Is teachers‘general pedagogical knowledge a premise for noticing and interpreting classroom situations? A video-based assessment approach. Teaching and Teaching Education, 38, 76–88.
Article Google Scholar
Krosnick, J. A., & Fabrigar, L. R. (1997). Designing rating scales for effective measurement in surveys. In L. Lyberg, P. Biemer, M. Collins, L. Decker, E. DeLeeuw, C. Dippo, N. Schwarz, & D. Trewin (Eds.), Survey measurement and process quality. New York: Wiley-Interscience.
Google Scholar
Lam, C. T., & Kolic, M. (2008). Effects of semantic incompatibility on rating response. Applied Psychological Measurement. Sage Publications, 32(3), 248–260.
Article Google Scholar
Markus, K. A., & Smith, K. M. (2010). Content validity. In N. Salkind (Ed.), Encyclopedia of research design (pp. 239–244). Thousand Oaks, CA: SAGE Publications.
Google Scholar
Messick, S. (1989). Meaning and values in test validation: the science and ethics of assessment. Educational Researcher, 18(2), 5–11.
Article Google Scholar
Moosbrugger, H., & Kelava, A. (2007). Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer.
Google Scholar
Pauli, C., & Reusser, K. (2006). Von international vergleichenden Video Surveys zur videobasierten Unterrichtsforschung und -entwicklung. Zeitschrift für Pädagogik, 52(6), 774–797.
Google Scholar
Rasch, B., Hofmann, W., Friese, M. & Naumann, E. (2010). Quantitative Methoden: Band 1: Einführung in die Statistik für Psychologen und Sozialwissenschaftler. Berlin, Heidelberg: Springer-Verlag.
Rheinberg, F. (2006). Motivation. Stuttgart: Kohlhammer.
Google Scholar
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (1999). Measures of political attitudes. San Diego, CA: Academic Press.
Google Scholar
Rohrmann, B. (1987). Empirische Studien zur Entwicklung von Antwortskalen für die sozialwissenschaftliche Forschung. Zeitschrift für Sozialpsychologie, 9(3), 222–245.
Google Scholar
Schwarz, N. (1999). Self-reports: how the questions shape the answers. American Psychologist, 54, 93–105.
Article Google Scholar
Seidel, T., & Prenzel, M. (2007). Wie Lehrpersonen Unterricht wahrnehmen und einschätzen–Erfassung pädagogisch-psychologischer Kompetenzen mit Videosequenzen. In M. Prenzel, I. Gogolin, & H.-H. Krüger (Eds.), Kompetenzdiagnostik-Zeitschrift für Erziehungswissenschaft, special issue 8 (pp. 201–216). Wiesbaden: VS Verlag für Sozialwissenschaften.
Google Scholar
Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA: Sage.
Book Google Scholar
Strauss, A., & Corbin, J. (1991). Basics of qualitative research–Grounded theory procedures and techniques. Newbury Park: Sage Publications.
Google Scholar

Download references

Author information

Authors and Affiliations

Department 2, Universität Vechta, Driverstraße 22, 49377, Vechta, Germany
Jessica Hoth
Fakultät für Erziehungswissenschaft, Universität Hamburg, Von-Melle-Park 8, 20146, Hamburg, Germany
Björn Schwarz, Gabriele Kaiser & Andreas Busse
Universität zu Köln, Gronewaldstr. 2, 50931, Cologne, Germany
Johannes König
Centre for Educational Measurement (CEMO), University of Oslo, Gaustadalleen 30, 0318, Oslo, Norway
Sigrid Blömeke

Authors

Jessica Hoth
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schwarz
View author publications
You can also search for this author in PubMed Google Scholar
Gabriele Kaiser
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Busse
View author publications
You can also search for this author in PubMed Google Scholar
Johannes König
View author publications
You can also search for this author in PubMed Google Scholar
Sigrid Blömeke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jessica Hoth.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hoth, J., Schwarz, B., Kaiser, G. et al. Uncovering predictors of disagreement: ensuring the quality of expert ratings. ZDM Mathematics Education 48, 83–95 (2016). https://doi.org/10.1007/s11858-016-0758-z

Download citation

Accepted: 22 January 2016
Published: 26 February 2016
Issue Date: April 2016
DOI: https://doi.org/10.1007/s11858-016-0758-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uncovering predictors of disagreement: ensuring the quality of expert ratings

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Ethical Considerations of Conducting Systematic Reviews in Educational Research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Uncovering predictors of disagreement: ensuring the quality of expert ratings

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Ethical Considerations of Conducting Systematic Reviews in Educational Research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation