Quality & Quantity

, Volume 43, Issue 2, pp 197–209 | Cite as

A meta-validation model for assessing the score-validity of student teaching evaluations

  • Anthony J. OnwuegbuzieEmail author
  • Larry G. Daniel
  • Kathleen M. T. Collins
Original Paper


Virtually every institution of higher education in the US uses some type of student teaching evaluation (STE) instrument as a means of assessing instructors’ instructional performance in courses. Unfortunately, many administrators and faculty misinterpret STE ratings. Therefore, the present article provides a comprehensive critique of STE instruments. In particular, we build on Messick’s (Educational Measurement, MacMillan, pp. 13–103, and Messick (Am. Psychol., 50, 741–749, 1995, 1989) conceptualization of validity to yield what we refer to as a meta-validity model that subdivides content-, criterion-, and construct-related validity into several areas of evidence. We use our meta-validity model to conduct a meta-validity analysis of STEs. Specifically, we assessed the score-validity of STEs based on findings from the extant literature. We conclude that strong evidence has been provided with respect to areas of criterion-related validity; however, for the most part, weak or inadequate evidence has been provided with regard to areas of both content-related and construct-related validity. This seriously calls into question both the score-validity and utility of STEs.


Teaching evaluations Formative evaluation Summative evaluation Meta-validity model Meta-validity analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Aleamoni L.M. (1981). The use of student evaluations in the improvement of instruction. NACTA. J. 20: 16 Google Scholar
  2. Alreck P.L. and Settle R.B. (1985). The Survey Research Handbook. Irwin, Homewood, IL Google Scholar
  3. Ambady N. and Rosenthal R. (1992). Half a minute. Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness. J. Pers. Soc. Psychol. 64: 431–441 Google Scholar
  4. American Educational Research Association American Psychological Association National Council on Measurement and Evaluation : (1985). Standards for Educational and Psychological Testing. American Psychological Association, WashingtonGoogle Scholar
  5. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999) Standards for Educational and Psychological Testing (Rev edn). American Educational Research Association, WashingtonGoogle Scholar
  6. Babad E. (2001). Students’ course selection: differential considerations for first and last course. Res. High. Educ. 42: 469–492 CrossRefGoogle Scholar
  7. Blackburn R.T. and Clark M.J. (1975). An assessment of faculty performance. Some correlates between administrators, colleagues, students and self-ratings. Sociol. Educ. 48: 242–256 Google Scholar
  8. Braskamp L.A., Brandenberg D.C. and Ory J.C. (1984). Evaluating teaching effectiveness: a practical guide. Sage, Beverly Hills, CA Google Scholar
  9. Campbell D.T. and Fiske D.W. (1959). Convergent and discriminant validation by the multitrait–multimethod matrix. Psychol. Bull. 56: 81–105 CrossRefGoogle Scholar
  10. Centra J.A. (1974). The relationship between student and alumni ratings of teachers. Educ. Psychol. Meas. 34: 321–326 CrossRefGoogle Scholar
  11. Centra J.A. (1976). The influence of different directions on student ratings of instruction. J. Educ. Meas. 13: 277–282 CrossRefGoogle Scholar
  12. Centra J.A. (1993). Reflective faculty evaluation. Jossey-Bass, San Francisco, CA Google Scholar
  13. Centra J.A. and Creech F.R. (1976). The relationship between student teachers and course characteristics and student ratings of teacher effectiveness. Project Report 76–1. Educational Testing Service, Princeton, NJ Google Scholar
  14. Cohen P.A. (1980). Effectiveness of student-rating feedback for improving college instruction: a meta-analysis of findings. Res. High. Educ. 13: 321–341 CrossRefGoogle Scholar
  15. Cohen P.A. (1981). Student ratings of instruction and student achievement: a meta-analysis of multisection validity studies. Rev. Educ. Res. 51: 281–309 Google Scholar
  16. Crocker L. and Algina J. (1986). Introduction to classical and modern test theory. Holt, Rinehart, & Winston, Orlando, FL Google Scholar
  17. D’Apollonia S. and Abrami P.C. (1997). Navigating student ratings of instruction. Am. Psychol. 52(11): 1198–1208 CrossRefGoogle Scholar
  18. Dommeyer C.J., Baum P., Chapman K.S. and Hanna R.W. (2002). Attitudes of business faculty towards two methods of collecting teaching evaluations. Paper vs. online. Assess. Eval. High. Educ. 27: 455–462 CrossRefGoogle Scholar
  19. Doyle K.O. and Crichton L.A. (1978). Student, peer and self-evaluations of college instruction. J. Educ. Psychol. 70: 815–826 CrossRefGoogle Scholar
  20. Feldman K.A. (1978). Course characteristics and college students’ ratings of their teachers and courses: what we know and what we don’t. Res. High. Educ. 9: 199–242 CrossRefGoogle Scholar
  21. Feldman K.A. (1979). The significance of circumstances for college students’ ratings of their teachers and courses: a review and analysis. Res. High. Educ. 10: 149–172 CrossRefGoogle Scholar
  22. Feldman K.A. (1984). Class size and college students’ evaluations of teachers and courses: a closer look. Res. High. Educ. 21: 45–116 CrossRefGoogle Scholar
  23. Feldman K.A. (1989). The association between student ratings of specific instructional dimensions and student achievement: refining and extending the synthesis of data from multisection validity studies. Res. High. Educ. 30: 583–645 CrossRefGoogle Scholar
  24. Fernald P.S. (1990). Students’ ratings of instruction: standardized and customized. Teach. Psychol. 17: 105–109 CrossRefGoogle Scholar
  25. Gray M. and Bergmann B.R. (2003). Student teaching evaluations: inaccurate, demeaning, misused. Academe 89(5): 44–46 Google Scholar
  26. Greenwald A.G. and Gillmore G.M. (1997). Grading leniency is a removable contaminant of student ratings. Am. Psychol. 52: 1209–1217 CrossRefGoogle Scholar
  27. Guthrie E.R. (1954). The evaluation of teaching: a progress report. University of Washington, Seattle, WA Google Scholar
  28. Haskell R.E.: Academic freedom, tenure, and student evaluation of faculty: galloping polls in the 21st century. Educ. Policy Anal. Arch. 5(6). Retrieved February 3, 2005, from (1997)Google Scholar
  29. Howard G.S., Conway C.G. and Maxwell S.E. (1985). Construct validity of measures of college teaching effectiveness. J. Educ. Psychol. 77: 187–196 CrossRefGoogle Scholar
  30. Kierstead, D.P., D’Agostino, P., Dill, H.: Sex role stereotyping of college professors: bias in students’ ratings of instructors. J. Educ. Psychol. 80:342–344 (1988)Google Scholar
  31. Kulik J.A. (2001). Student ratings: validity, utility and controversy. New Dir. Inst. Res. 109: 9–25 CrossRefGoogle Scholar
  32. Lombardo J.P. and Tocci M.E. (1979). Attribution of positive and negative characteristics of instructors as a function of attractiveness and sex of instructor and sex of subject. Percept. Mot. Skills 48: 491–494 Google Scholar
  33. Marsh H.W. (1984). Students’ evaluations of university teaching: dimensionality, reliability, validity, potential biases and utility. J. Educ. Psychol. 76: 707–754 CrossRefGoogle Scholar
  34. Marsh H.W. (1987). Students’ evaluations of university teaching: research findings, methodological issues and directions for future research. Int. J. Educ. Res. 11: 253–388 CrossRefGoogle Scholar
  35. Marsh H.W. and Bailey M. (1993). Multidimensional students’ evaluations of teaching effectiveness. A profile analysis. J. High. Educ. 64: 1–18 Google Scholar
  36. Marsh H.W., Overall J.U. and Kessler S.P. (1979). Validity of student evaluations of instructional effectiveness: a comparison of faculty self-evaluations and evaluations by their students. J. Educ. Psychol. 71: 149–160 CrossRefGoogle Scholar
  37. Marsh H.W. and Roche L.A. (1993). The use of students’ evaluations and an individually structured intervention to enhance university teaching. Am. Educ. Res. J. 30: 217–251 Google Scholar
  38. Marsh H.W. and Roche L.A. (2000). Effects of grading leniency and low workload in students’ evaluations of teaching: popular myth, bias, validity, or innocent bystanders?. J. Educ. Psychol. 92: 202–228 CrossRefGoogle Scholar
  39. McCallum L.W. (1984). A meta-analysis of course evaluation data and its use in the tenure decision. Res. High. Educ. 21: 150–158 CrossRefGoogle Scholar
  40. Messick S.  (1989). Validity. In: Linn, R.L. (eds) Educational Measurement, 3rd edn., pp 13–103. Macmillan, Old Tappan, NJ Google Scholar
  41. Messick S. (1995). Validity of psychological assessment: validation of inferences from persons responses and performances as scientific inquiry into score meaning. Am. Psychol. 50: 741–749 CrossRefGoogle Scholar
  42. Murray H.G. (1983). Low-inference classroom teaching behaviors and student ratings of college teaching effectiveness. J. Educ. Psychol. 71: 856–865 Google Scholar
  43. Naftulin D.H., Ware J.E. and Donnelly F.A. (1973). The doctor fox lecture: a paradigm of educational seduction. J. Med. Educ. 48: 630–635 Google Scholar
  44. Newport F.J. (1996). Rating teaching in the USA: probing the qualifications of student raters and novice teachers. Assess. Educ. High. Educ. 21(1): 17–21 CrossRefGoogle Scholar
  45. Onwuegbuzie A.J. and Daniel L.G. (2002). Uses and misuses of the correlation coefficient. Res. Sch. 9(1): 73–90 Google Scholar
  46. Onwuegbuzie, A.J., Daniel, L.G.: Typology of analytical and interpretational errors in quantitative and qualitative educational research. Curr. Issues Educ. [On-line], 6(2). Available: (2003)Google Scholar
  47. Onwuegbuzie A.J. and Weems G.H. (2004). Response categories on rating scales: characteristics of item respondents who frequently utilize midpoint. Res. Sch. 11(1): 51–60 Google Scholar
  48. Onwuegbuzie A.J., Witcher W.A.E., Collins K.M.T., Filer J.D., Wiedmaier C.D. and Moore C.W. (2007). Students’ perceptions of characteristics of effective college teachers: a validity study of a teaching evaluation form using a mixed-methods analysis. Am. Educ. Res. J. 44: 113–160 CrossRefGoogle Scholar
  49. Ory J.C. (2000). Teaching evaluation: past, present and future. New Dir. Teach. Learn. 83: 13–18 CrossRefGoogle Scholar
  50. Ory J.C., Braskamp L.A. and Pieper D.M. (1980). The congruency of student evaluative information collected by three methods. J. Educ. Psychol. 72: 181–185 CrossRefGoogle Scholar
  51. Ory J.C. and Ryan K. (2001). How do student ratings measure up to a new validity framework?. New Dir. Ins. Res. 109: 27–44 Google Scholar
  52. Overall J.U. and Marsh H.W. (1979). Midterm feedback from students: its relationship to instructional improvement and students’ cognitive and affective outcomes. J. Educ. Psychol. 71: 856–865 CrossRefGoogle Scholar
  53. Overall J.U. and Marsh H.W. (1980). Students’ evaluations of instruction: a longitudinal study of their stability. J. Educ. Psychol. 72: 321–325 CrossRefGoogle Scholar
  54. Penny A.R. (2003). Changing the agenda for research into students’ views about university teaching: four shortcomings of SRT research. Teach. High. Educ. 8(3): 399–411 CrossRefGoogle Scholar
  55. Peterson K. and Kauchak D. (1982). Teacher evaluation: perspectives, practices and promises. Utah University Center for Educational Practice, Salt Lake City, Utah Google Scholar
  56. Rodin M. and Rodin B. (1972). Student evaluations of teachers. Sci. 177: 1164–1166 CrossRefGoogle Scholar
  57. Schmelkin L.P., Spencer K.J. and Gellman E.S. (1997). Faculty perspectives on course and teacher evaluations. Res. High. Educ. 38(5): 575–592 CrossRefGoogle Scholar
  58. Seldin P. (1984). Changing practices in faculty evaluation. Jossey-Bass, San Francisco, CA Google Scholar
  59. Seldin P. (1993). The use and abuse of student ratings of professors. Chron. High. Educ. 21: A40 Google Scholar
  60. Shapiro E.G. (1990). Effects of instructor and class characteristics on students’ class evaluations. Res. High. Educ. 31: 135–148 CrossRefGoogle Scholar
  61. Simmons T.L. (1996). Student evaluation of teachers: professional practice or punitive policy?. JALT Test. Eval. N-SIG Newsl. 1(1): 12–16 Google Scholar
  62. Spencer K.J. and Schmelkin L.P. (2002). Student perspectives on teaching and its evaluation. Assess. High. Educ. 27: 397–409 CrossRefGoogle Scholar
  63. Sun A. and Valiga M.J. (1997). Using generalizability theory to assess the reliability of student ratings of academic advising. J. Exp. Educ. 65: 367–379 Google Scholar
  64. Theall M. and Franklin J. (2001). Looking for bias in all the wrong places: a search for truth or a witch hunt in student ratings of instruction?. New Dir. Teach. Learn. 109: 45–56 Google Scholar
  65. Washburn, K., Thornton, J. F.: (eds.) Dumbing down: essays on the strip mining of American culture. W.W. Norton & Company, New York (1996)Google Scholar
  66. Weems G.H. and Onwuegbuzie A.J. (2001). The impact of midpoint responses and reverse coding on survey data. Meas. Eval. Couns. Dev. 34: 166–176 Google Scholar
  67. Williams W.M. and Ceci S.J. (1997). How’m I doing? Problems with student ratings of instructors and courses. Change 29(5): 13–23 Google Scholar

Copyright information

© Springer Science + Business Media B.V. 2007

Authors and Affiliations

  • Anthony J. Onwuegbuzie
    • 1
    Email author
  • Larry G. Daniel
    • 2
  • Kathleen M. T. Collins
    • 3
  1. 1.Department of Educational Leadership and CounselingSam Houston State UniversityHuntsvilleUSA
  2. 2.College of Education and Human ServicesUniversity of North FloridaJacksonvilleUSA
  3. 3.College of Education and Health Professions, Department of Curriculum and InstructionUniversity of Arkansas at FayettevilleFayettevilleUSA

Personalised recommendations