Abstract
Rating scales are a popular item format used in many types of assessments. Yet, defining which rating is correct often represents a challenge. Using expert ratings as benchmarks is one approach to ensuring the quality of a rating instrument. In this paper, such expert ratings are analyzed in detail taking a video-based test instrument of teachers’ professional competencies from a follow-up study to TEDS-M (the so-called TEDS-FU study) as an example. The paper focuses on those items that did NOT reach sufficient consensus among the experts and analyzes in depth their features by coding the experts’ comments on those items and additionally considering their rating outcome. The results revealed that the item-wording and the composition of the group of experts strengthened or weakened agreement among the experts.
Similar content being viewed by others
Notes
At least 60% of expert agreement was chosen as a cut off by the research team considering the empirical outcome of the first expert rating and the impact of experts’ comments.
The amount of explanatory notes varied for each item. Some items were not commented on at all while other items had up to seven comments (1.3 comments on average).
For this specific hypothesis, the Jonckheere-Terpstra test (Jonckheere 1954) was used to statistically test significant differences between both expert groups. No significant difference was found but due to the very small sample of experts, a qualitative approach and interpretation of the present results may be beneficial and give first indications about the assumption presented above.
Significant difference, p < 0.05 using a two-tailed T-Tests (Rasch et al. 2010); All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests; All experts rated this item (22 experts).
Significant difference, p < 0.05 using a two-tailed T-Tests: All school-based teacher educators rated this item and all but one university expert (21 experts).
References
Blömeke, S., (2013). Validierung als Aufgabe im Forschungsprogramm „Kompetenzmodellierung und Kompetenzerfassung im Hochschulsektor“. KokoHs Working Papers, 2. Berlin and Mainz: Humboldt-Universität and Johannes Gutenberg-Universität.
Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: competence viewed as a continuum. Zeitschrift für Psychologie, 223, 3–13.
Blömeke, S., Hsieh, F.-J., Kaiser, G., & Schmidt, W. (Eds.). (2014a). International perspectives on teacher knowledge, beliefs and opportunities to learn. Dordrecht: Springer.
Blömeke, S., König, J., Busse, A., Suhl, U., Benthien, J., Döhrmann, M., & Kaiser, G. (2014b). Von der Lehrerausbildung in den Beruf–Fachbezogenes Wissen als Voraussetzung für Wahrnehmung, Interpretation und Handeln im Unterricht. Zeitschrift für Erziehungswissenschaft, 17(3), 509–542. doi:10.1007/s11618-014-0564-8.
Blömeke, S., Suhl, U., & Döhrmann, M. (2013). Assessing strengths and weaknesses of teacher knowledge in Asia, Eastern Europe and Western countries: differential item functioning in TEDS-M. International Journal of Science and Mathematics Education, 11, 795–817.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364–367.
Bühner, M. (2006). Einführung in die Test- und Fragebogenkonstruktion. München: Pearson Studium.
Clausen, M., Reusser, K., & Klieme, E. (2003). Unterrichtsqualität auf der Basis hoch-inferenter Unterrichtsbeurteilungen. Unterrichtswissenschaft, 31(2), 122–141.
Fabrigar, L. R., & Krosnick, J. A. (1995). Attitude measurement and questionnaire design. In A. S. R. Manstead & M. Hewstone (Eds.), Blackwell encyclopedia of social psychology. Oxford: Blackwell Publishers.
Fowler, F. J. (1992). How unclear terms can affect survey data. Public Opinion Quarterly, 56, 218–231.
Häder, M. (2009). Delphi-Befragungen–Ein Arbeitsbuch. Wiesbaden: VS Verlag für Sozialwissenschaften.
Hartig, J., Frey, A., & Jude, N. (2012). Validität. In H. Moosbrugger & A. Kelava (Eds.), Testtheorie und Fragebogenkonstruktion (pp. 143–171). Heidelberg: Springer-Verlag.
Helmke, A., (2009). Unterrichtsqualität und Lehrerprofessionalität. Diagnose, Evaluation und Verbesserung des Unterrichts. Seelze-Velber: Klett-Kallmeyer.
Holleman, B. (1999). Wording effects in survey research: using meta-analysis to explain the forbid/allow asymmetry. Journal of Quantitative Linguistics, 6, 29–40.
Jenßen, L., Dunekacke, S., & Blömeke S. (2015). Qualitätssicherung in der Kompetenzforschung: Standards für den Nachweis von Validität in Testentwicklung und Veröffentlichungspraxis. Zeitschrift für Pädagogik, 61(supplementary issue), 11–31.
Jonckheere, A. R. (1954). A test of significance for the relation between m rankings and k ranked categories. British Journal of Statistical Psychology, 7(2), 93–100.
Kaiser, G., Busse, A., Hoth, J., König, J., & Blömeke, S. (2015). About the complexities of video-based assessments: theoretical and methodological approaches to overcoming shortcomings of research on teachers’ competence. International Journal of Science and Mathematics Education, 13(2), 369–387.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
Klieme, E., Pauli, C., & Reusser, K. (Eds.). (2005). Dokumentation der Erhebungs- und Auswertungsinstrumente zur schweizerisch-deutschen Videostudie Unterrichtsqualität, Lernverhalten und mathematisches Verständnis. Frankfurt: GFPF.
[KMK] Sekretariat der Ständigen Konferenz der Kultusminister der Länder in der Bundesrepublik Deutschland (2004). Bildungsstandards im Fach Mathematik für den Primarbereich: Beschluss der Kultusministerkonferenz vom 15.10.2004. http://www.kmk.org/fileadmin/veroeffentlichungen_beschluesse/2004/2004_10_15-Bildungsstandards-Mathe-Primar.pdf. Accessed 22 July 2015.
König, J., Blömeke, S., Klein, P., Suhl, U., Busse, A., & Kaiser, G. (2014). Is teachers‘general pedagogical knowledge a premise for noticing and interpreting classroom situations? A video-based assessment approach. Teaching and Teaching Education, 38, 76–88.
Krosnick, J. A., & Fabrigar, L. R. (1997). Designing rating scales for effective measurement in surveys. In L. Lyberg, P. Biemer, M. Collins, L. Decker, E. DeLeeuw, C. Dippo, N. Schwarz, & D. Trewin (Eds.), Survey measurement and process quality. New York: Wiley-Interscience.
Lam, C. T., & Kolic, M. (2008). Effects of semantic incompatibility on rating response. Applied Psychological Measurement. Sage Publications, 32(3), 248–260.
Markus, K. A., & Smith, K. M. (2010). Content validity. In N. Salkind (Ed.), Encyclopedia of research design (pp. 239–244). Thousand Oaks, CA: SAGE Publications.
Messick, S. (1989). Meaning and values in test validation: the science and ethics of assessment. Educational Researcher, 18(2), 5–11.
Moosbrugger, H., & Kelava, A. (2007). Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer.
Pauli, C., & Reusser, K. (2006). Von international vergleichenden Video Surveys zur videobasierten Unterrichtsforschung und -entwicklung. Zeitschrift für Pädagogik, 52(6), 774–797.
Rasch, B., Hofmann, W., Friese, M. & Naumann, E. (2010). Quantitative Methoden: Band 1: Einführung in die Statistik für Psychologen und Sozialwissenschaftler. Berlin, Heidelberg: Springer-Verlag.
Rheinberg, F. (2006). Motivation. Stuttgart: Kohlhammer.
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (1999). Measures of political attitudes. San Diego, CA: Academic Press.
Rohrmann, B. (1987). Empirische Studien zur Entwicklung von Antwortskalen für die sozialwissenschaftliche Forschung. Zeitschrift für Sozialpsychologie, 9(3), 222–245.
Schwarz, N. (1999). Self-reports: how the questions shape the answers. American Psychologist, 54, 93–105.
Seidel, T., & Prenzel, M. (2007). Wie Lehrpersonen Unterricht wahrnehmen und einschätzen–Erfassung pädagogisch-psychologischer Kompetenzen mit Videosequenzen. In M. Prenzel, I. Gogolin, & H.-H. Krüger (Eds.), Kompetenzdiagnostik-Zeitschrift für Erziehungswissenschaft, special issue 8 (pp. 201–216). Wiesbaden: VS Verlag für Sozialwissenschaften.
Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA: Sage.
Strauss, A., & Corbin, J. (1991). Basics of qualitative research–Grounded theory procedures and techniques. Newbury Park: Sage Publications.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hoth, J., Schwarz, B., Kaiser, G. et al. Uncovering predictors of disagreement: ensuring the quality of expert ratings. ZDM Mathematics Education 48, 83–95 (2016). https://doi.org/10.1007/s11858-016-0758-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11858-016-0758-z