Skip to main content
Log in

Uncovering predictors of disagreement: ensuring the quality of expert ratings

  • Original Article
  • Published:
ZDM Aims and scope Submit manuscript

Abstract

Rating scales are a popular item format used in many types of assessments. Yet, defining which rating is correct often represents a challenge. Using expert ratings as benchmarks is one approach to ensuring the quality of a rating instrument. In this paper, such expert ratings are analyzed in detail taking a video-based test instrument of teachers’ professional competencies from a follow-up study to TEDS-M (the so-called TEDS-FU study) as an example. The paper focuses on those items that did NOT reach sufficient consensus among the experts and analyzes in depth their features by coding the experts’ comments on those items and additionally considering their rating outcome. The results revealed that the item-wording and the composition of the group of experts strengthened or weakened agreement among the experts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. At least 60% of expert agreement was chosen as a cut off by the research team considering the empirical outcome of the first expert rating and the impact of experts’ comments.

  2. The amount of explanatory notes varied for each item. Some items were not commented on at all while other items had up to seven comments (1.3 comments on average).

  3. For this specific hypothesis, the Jonckheere-Terpstra test (Jonckheere 1954) was used to statistically test significant differences between both expert groups. No significant difference was found but due to the very small sample of experts, a qualitative approach and interpretation of the present results may be beneficial and give first indications about the assumption presented above.

  4. Significant difference, p < 0.05 using a two-tailed T-Tests (Rasch et al. 2010); All experts rated this item (22 experts).

  5. Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).

  6. Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).

  7. Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).

  8. Significant difference, p < 0.05 using a two-tailed T-Tests: All school-based teacher educators rated this item and all but one university expert (21 experts).

  9. Item example for a low-inferent item is the first rating scale item in Fig. 2; the second item in Fig. 2 displays a high-inferent item example.

References

  • Blömeke, S., (2013). Validierung als Aufgabe im Forschungsprogramm „Kompetenzmodellierung und Kompetenzerfassung im Hochschulsektor“. KokoHs Working Papers, 2. Berlin and Mainz: Humboldt-Universität and Johannes Gutenberg-Universität.

  • Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: competence viewed as a continuum. Zeitschrift für Psychologie, 223, 3–13.

    Article  Google Scholar 

  • Blömeke, S., Hsieh, F.-J., Kaiser, G., & Schmidt, W. (Eds.). (2014a). International perspectives on teacher knowledge, beliefs and opportunities to learn. Dordrecht: Springer.

    Google Scholar 

  • Blömeke, S., König, J., Busse, A., Suhl, U., Benthien, J., Döhrmann, M., & Kaiser, G. (2014b). Von der Lehrerausbildung in den Beruf–Fachbezogenes Wissen als Voraussetzung für Wahrnehmung, Interpretation und Handeln im Unterricht. Zeitschrift für Erziehungswissenschaft, 17(3), 509–542. doi:10.1007/s11618-014-0564-8.

    Article  Google Scholar 

  • Blömeke, S., Suhl, U., & Döhrmann, M. (2013). Assessing strengths and weaknesses of teacher knowledge in Asia, Eastern Europe and Western countries: differential item functioning in TEDS-M. International Journal of Science and Mathematics Education, 11, 795–817.

    Article  Google Scholar 

  • Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364–367.

    Article  Google Scholar 

  • Bühner, M. (2006). Einführung in die Test- und Fragebogenkonstruktion. München: Pearson Studium.

    Google Scholar 

  • Clausen, M., Reusser, K., & Klieme, E. (2003). Unterrichtsqualität auf der Basis hoch-inferenter Unterrichtsbeurteilungen. Unterrichtswissenschaft, 31(2), 122–141.

    Google Scholar 

  • Fabrigar, L. R., & Krosnick, J. A. (1995). Attitude measurement and questionnaire design. In A. S. R. Manstead & M. Hewstone (Eds.), Blackwell encyclopedia of social psychology. Oxford: Blackwell Publishers.

    Google Scholar 

  • Fowler, F. J. (1992). How unclear terms can affect survey data. Public Opinion Quarterly, 56, 218–231.

    Article  Google Scholar 

  • Häder, M. (2009). Delphi-Befragungen–Ein Arbeitsbuch. Wiesbaden: VS Verlag für Sozialwissenschaften.

    Book  Google Scholar 

  • Hartig, J., Frey, A., & Jude, N. (2012). Validität. In H. Moosbrugger & A. Kelava (Eds.), Testtheorie und Fragebogenkonstruktion (pp. 143–171). Heidelberg: Springer-Verlag.

    Chapter  Google Scholar 

  • Helmke, A., (2009). Unterrichtsqualität und Lehrerprofessionalität. Diagnose, Evaluation und Verbesserung des Unterrichts. Seelze-Velber: Klett-Kallmeyer.

  • Holleman, B. (1999). Wording effects in survey research: using meta-analysis to explain the forbid/allow asymmetry. Journal of Quantitative Linguistics, 6, 29–40.

    Article  Google Scholar 

  • Jenßen, L., Dunekacke, S., & Blömeke S. (2015). Qualitätssicherung in der Kompetenzforschung: Standards für den Nachweis von Validität in Testentwicklung und Veröffentlichungspraxis. Zeitschrift für Pädagogik, 61(supplementary issue), 11–31.

  • Jonckheere, A. R. (1954). A test of significance for the relation between m rankings and k ranked categories. British Journal of Statistical Psychology, 7(2), 93–100.

    Article  Google Scholar 

  • Kaiser, G., Busse, A., Hoth, J., König, J., & Blömeke, S. (2015). About the complexities of video-based assessments: theoretical and methodological approaches to overcoming shortcomings of research on teachers’ competence. International Journal of Science and Mathematics Education, 13(2), 369–387.

    Article  Google Scholar 

  • Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.

    Article  Google Scholar 

  • Klieme, E., Pauli, C., & Reusser, K. (Eds.). (2005). Dokumentation der Erhebungs- und Auswertungsinstrumente zur schweizerisch-deutschen Videostudie Unterrichtsqualität, Lernverhalten und mathematisches Verständnis. Frankfurt: GFPF.

    Google Scholar 

  • [KMK] Sekretariat der Ständigen Konferenz der Kultusminister der Länder in der Bundesrepublik Deutschland (2004). Bildungsstandards im Fach Mathematik für den Primarbereich: Beschluss der Kultusministerkonferenz vom 15.10.2004. http://www.kmk.org/fileadmin/veroeffentlichungen_beschluesse/2004/2004_10_15-Bildungsstandards-Mathe-Primar.pdf. Accessed 22 July 2015.

  • König, J., Blömeke, S., Klein, P., Suhl, U., Busse, A., & Kaiser, G. (2014). Is teachers‘general pedagogical knowledge a premise for noticing and interpreting classroom situations? A video-based assessment approach. Teaching and Teaching Education, 38, 76–88.

    Article  Google Scholar 

  • Krosnick, J. A., & Fabrigar, L. R. (1997). Designing rating scales for effective measurement in surveys. In L. Lyberg, P. Biemer, M. Collins, L. Decker, E. DeLeeuw, C. Dippo, N. Schwarz, & D. Trewin (Eds.), Survey measurement and process quality. New York: Wiley-Interscience.

    Google Scholar 

  • Lam, C. T., & Kolic, M. (2008). Effects of semantic incompatibility on rating response. Applied Psychological Measurement. Sage Publications, 32(3), 248–260.

    Article  Google Scholar 

  • Markus, K. A., & Smith, K. M. (2010). Content validity. In N. Salkind (Ed.), Encyclopedia of research design (pp. 239–244). Thousand Oaks, CA: SAGE Publications.

    Google Scholar 

  • Messick, S. (1989). Meaning and values in test validation: the science and ethics of assessment. Educational Researcher, 18(2), 5–11.

    Article  Google Scholar 

  • Moosbrugger, H., & Kelava, A. (2007). Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer.

    Google Scholar 

  • Pauli, C., & Reusser, K. (2006). Von international vergleichenden Video Surveys zur videobasierten Unterrichtsforschung und -entwicklung. Zeitschrift für Pädagogik, 52(6), 774–797.

    Google Scholar 

  • Rasch, B., Hofmann, W., Friese, M. & Naumann, E. (2010). Quantitative Methoden: Band 1: Einführung in die Statistik für Psychologen und Sozialwissenschaftler. Berlin, Heidelberg: Springer-Verlag.

  • Rheinberg, F. (2006). Motivation. Stuttgart: Kohlhammer.

    Google Scholar 

  • Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (1999). Measures of political attitudes. San Diego, CA: Academic Press.

    Google Scholar 

  • Rohrmann, B. (1987). Empirische Studien zur Entwicklung von Antwortskalen für die sozialwissenschaftliche Forschung. Zeitschrift für Sozialpsychologie, 9(3), 222–245.

    Google Scholar 

  • Schwarz, N. (1999). Self-reports: how the questions shape the answers. American Psychologist, 54, 93–105.

    Article  Google Scholar 

  • Seidel, T., & Prenzel, M. (2007). Wie Lehrpersonen Unterricht wahrnehmen und einschätzen–Erfassung pädagogisch-psychologischer Kompetenzen mit Videosequenzen. In M. Prenzel, I. Gogolin, & H.-H. Krüger (Eds.), Kompetenzdiagnostik-Zeitschrift für Erziehungswissenschaft, special issue 8 (pp. 201–216). Wiesbaden: VS Verlag für Sozialwissenschaften.

    Google Scholar 

  • Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA: Sage.

    Book  Google Scholar 

  • Strauss, A., & Corbin, J. (1991). Basics of qualitative research–Grounded theory procedures and techniques. Newbury Park: Sage Publications.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jessica Hoth.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hoth, J., Schwarz, B., Kaiser, G. et al. Uncovering predictors of disagreement: ensuring the quality of expert ratings. ZDM Mathematics Education 48, 83–95 (2016). https://doi.org/10.1007/s11858-016-0758-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11858-016-0758-z

Keywords

Navigation