Adams, R. J., Wu, M. L., & Wilson, M. R. (2020). ConQuest: Generalised item response modelling software (4.5.2) [Computer Software]. Australian Council for Educational Research.
Alhija, F. (2017). Guest editor introduction to the special issue “contemporary evaluation of teaching: Challenges and promises”. Studies in Educational Evaluation, 54(Supplement C), 1–3. https://doi.org/10.1016/j.stueduc.2017.02.002.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
American Sociological Association. (2019). Reconsidering student evaluations of teaching. American Sociological Association. Retrieved November 6, 2019, from https://www.asanet.org/press-center/press-releases/reconsidering-student-evaluations-teaching
Ames, A. J., & Penfield, R. D. (2015). An NCME instructional module on item-fit statistics for item response theory models. Educational Measurement: Issues and Practice, 34(3), 39–48. https://doi.org/10.1111/emip.12067.
Andersen, K., & Miller, E. D. (1997). Gender and Student Evaluations of Teaching. PS: Political Science and Politics, 30(2), 216. https://doi.org/10.2307/420499.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814.
Arbuckle, J., & Williams, B. D. (2003). Students’ perceptions of expressiveness: Age and gender effects on teacher evaluations. Sex Roles, 49(9–10), 507–516. https://doi.org/10.1023/A:1025832707002.
Basow, S. A., & Martin, J. L. (2012). Bias in student evaluations. In Effective evaluation of teaching: A guide for faculty and administrators (pp. 40–49). Society for the Teaching of Psychology.
Basow, S. A., & Montgomery, S. (2005). Student ratings and professor self-ratings of college teaching: Effects of gender and divisional affiliation. Journal of Personnel Evaluation in Education, 18(2), 91–106. https://doi.org/10.1007/s11092-006-9001-8.
Bassett, J., Cleveland, A., Acorn, D., Nix, M., & Snyder, T. (2017). Are they paying attention? Students’ lack of motivation and attention potentially threaten the utility of course evaluations. Assessment & Evaluation in Higher Education, 42(3), 431–442. https://doi.org/10.1080/02602938.2015.1119801.
Bavishi, A., Madera, J. M., & Hebl, M. R. (2010). The effect of professor ethnicity and gender on student evaluations: Judged before met. Journal of Diversity in Higher Education, 3(4), 245–256. https://doi.org/10.1037/a0020763.
Bertrand, M. (2017). The glass ceiling. Becker Friedman Institute for Research in Economics Working Paper No. 2018-38, https://doi.org/10.2139/ssrn.3191467
Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model: Fundamental measurement in the human sciences, third edition (3rd ed.). Routledge.
Bonitz, V. S. (2011). Student evaluation of teaching: Individual differences and bias effects. Graduate Theses and Dissertations. 12211. Retrieved November 6, 2019, from https://lib.dr.iastate.edu/etd/12211
Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness (pp. 1–11). ScienceOpen Research. https://doi.org/10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1.
Boysen, G. A. (2015). Significant interpretation of small mean differences in student evaluations of teaching despite explicit warning to avoid overinterpretation. Scholarship of Teaching and Learning in Psychology, 1(2), 150–162. https://doi.org/10.1037/stl0000017.
Boysen, G. A., Kelly, T. J., Raesly, H. N., & Casner, R. W. (2014). The (mis)interpretation of teaching evaluations by college faculty and administrators. Assessment & Evaluation in Higher Education, 39(6), 641–656. https://doi.org/10.1080/02602938.2013.860950.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. Springer-Verlag http://www.springer.com/gp/book/9780387953649.
Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (Fourth ed., pp. 221–256). Praeger Publishers.
Camilli, G. (2013). Ongoing issues in test fairness. Educational Research and Evaluation, 19(2–3), 104–120. https://doi.org/10.1080/13803611.2013.767602.
Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work? Research in Higher Education, 44(5), 495–518. https://doi.org/10.1023/A:1025492407752.
Centra, J. A., & Gaubatz, N. B. (2000). Is there gender bias in student evaluations of teaching? Journal of Higher Education, 71, 17–33.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge.
Cundiff, J. L., Danube, C. L., Zawadzki, M. J., & Shields, S. A. (2018). Testing an intervention for recognizing and reporting subtle gender bias in promotion and tenure decisions. The Journal of Higher Education, 89(5), 611–636. https://doi.org/10.1080/00221546.2018.1437665.
de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge University Press.
Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II—Evidence from students’ evaluations of their classroom teachers. Research in Higher Education, 34(2), 151–211. https://doi.org/10.1007/BF00992161
Gómez Cama, M., Larrán, M. J., & Andrades Peña, F. J. (2016). Gender differences between faculty members in higher education: A literature review of selected higher education journals. Educational Research Review, 18, 58–69. https://doi.org/10.1016/j.edurev.2016.03.001.
Haladyna, T., & Hess, R. K. (1994). The detection and correction of bias in student ratings of instruction. Research in Higher Education, 35(6), 669–687. https://doi.org/10.1007/BF02497081.
Kline, T. J. B. (2005). Psychological testing: A practical approach to design and evaluation. Sage Publications.
Laird, T. F., Garver, A. K., & Niskodé-Dossett, A. S. (2011). Gender gaps in collegiate teaching style: Variations by course characteristics. Research in Higher Education, 52(3), 261–277. https://doi.org/10.1007/s11162-010-9193-0.
MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40(4), 291–303. https://doi.org/10.1007/s10755-014-9313-4.
Malisch, J. L., Harris, B. N., Sherrer, S. M., Lewis, K. A., Shepherd, S. L., McCarthy, P. C., Spott, J. L., Karam, E. P., Moustaid-Moussa, N., Calarco, J. M., Ramalingam, L., Talley, A. E., Cañas-Carrell, J. E., Ardon-Dryer, K., Weiser, D. A., Bernal, X. E., & Deitloff, J. (2020). Opinion: In the wake of COVID-19, academia needs new solutions to ensure gender equity. Proceedings of the National Academy of Sciences, 117(27), 15378–15381. https://doi.org/10.1073/pnas.2010636117.
Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11(3), 253–388. https://doi.org/10.1016/0883-0355(87)90001-2.
McClain, L., Gulbis, A., & Hays, D. (2017). Honesty on student evaluations of teaching: Effectiveness, purpose, and timing matter! Assessment & Evaluation in Higher Education, 43, 1–17. https://doi.org/10.1080/02602938.2017.1350828.
McPherson, M. A., & Jewell, R. T. (2007). Leveling the playing field: should student evaluation scores be adjusted? Social Science Quarterly, 88(3), 868–881. https://doi.org/10.1111/j.1540-6237.2007.00487.x.
McPherson, M. A., Jewell, R. T., & Kim, M. (2009). What determines student evaluation scores? A random effects analysis of undergraduate economics classes. Eastern Economic Journal, 35(1), 37–51.
Meyer, J. P., Doromal, J. B., Wei, X., & Zhu, S. (2017). A criterion-referenced approach to student ratings of instruction. Research in Higher Education, 58(5), 545–567. https://doi.org/10.1007/s11162-016-9437-8.
Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. T. (2009). A meta-validation model for assessing the score-validity of student teaching evaluations. Quality & Quantity, 43(2), 197–209. https://doi.org/10.1007/s11135-007-9112-4.
Osteen, P. (2010). An introduction to using multidimensional item response theory to assess latent factor structures. Journal of the Society for Social Work and Research, 1(2), 66–82. https://doi.org/10.5243/jsswr.2010.6.
Rivera, L. A., & Tilcsik, A. (2019). Scaling down inequality: Rating scales, gender bias, and the architecture of evaluation. American Sociological Review, 84(2), 248–274. https://doi.org/10.1177/0003122419833601.
Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W. H., & Shavelson, R. J. (2007). Estimating causal effects using experimental and observational designs. American Educational Research Association.
Setari, A. P., Lee, J., & Bradley, K. D. (2016). A psychometric approach to the validation of a student evaluation of teaching instrument. Studies in Educational Evaluation, 51, 77–87. https://doi.org/10.1016/j.stueduc.2016.09.006.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin.
Smith, S. W., Yoo, J. H., Farr, A. C., Salmon, C. T., & Miller, V. D. (2007). The influence of student sex and instructor sex on student ratings of instructors: Results from a college of communication. Women's Studies in Communication, 30(1), 64–77. https://doi.org/10.1080/07491409.2007.10162505.
Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching the state of the art. Review of Educational Research, 83(4), 598–642. https://doi.org/10.3102/0034654313496870.
Stark, P., & Freishtat, R. (2014). An evaluation of course evaluations. ScienceOpen Research https://www.scienceopen.com/document/id/ad8a9ac9-8c60-432a-ba20-4402a2a38df4.
Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction? New Directions for Institutional Research, 2001(109), 45–56. https://doi.org/10.1002/ir.3.
Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22–42. https://doi.org/10.1016/j.stueduc.2016.08.007.
Van Zile-Tamsen, C. (2017). Using Rasch analysis to inform rating scale development. Research in Higher Education, 58(8), 922–933. https://doi.org/10.1007/s11162-017-9448-0.
Valencia, E. (2020). Acquiescence, instructor’s gender bias and validity of student evaluation of teaching. Assessment & Evaluation in Higher Education, 45(4), 483–495. https://doi.org/10.1080/02602938.2019.1666085.
Viswanathan, M. (2005). Measurement Error and Research Design. SAGE Publications Inc.
Wachtel, H. K. (1998). Student evaluation of college teaching effectiveness: A brief review. Assessment & Evaluation in Higher Education, 23(2), 191–212. https://doi.org/10.1080/0260293980230207.
Weisshaar, K. (2017). Publish and perish? An assessment of gender gaps in promotion to tenure in academia. Social Forces, 96(2), 529–560. https://doi.org/10.1093/sf/sox052.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Mesa Press.
Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER ConQuest Version 2.0: Generalised Item Response Modelling Software. ACER Press.
Zipser, N., & Mincieli, L. (2018). Administrative and structural changes in student evaluations of teaching and their effects on overall instructor scores. Assessment & Evaluation in Higher Education, 43(6), 995–1008. https://doi.org/10.1080/02602938.2018.1425368.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved November 6, 2019, from http://faculty.educ.ubc.ca/zumbo/DIF/handbook.pdf
Anderson, K. J., & Smith, G. (2005). Students’ Preconceptions of Professors: Benefits and Barriers According to Ethnicity and Gender. Hispanic Journal of Behavioral Sciences, 27(2), 184–201. https://doi.org/10.1177/0739986304273707.
Aleamoni, L. M., & Hexner, P. Z. (1980). A review of the research on student evaluation and a report on the effect of different sets of instructions on student course and instructor evaluation. Instructional Science, 9(1), 67–84. https://doi.org/10.1007/BF00118969.
Educational Testing Service. (2016). ETS international principles for the fairness of assessments. Princeton, NJ: Author; Berliner, 2005.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage Publications, Inc.
Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness effective: The critical issues of validity,bias, and utility. American Psychologist, 52(11), 1187–1197. https://doi.org/10.1037/0003-066X.52.11.1187.
Mengel, F., Sauermann, J., & Zölitz, U. (2019). Gender Bias in Teaching Evaluations. Journal of the European Economic Association, 17(2), 535–566. https://doi.org/10.1093/jeea/jvx057.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741.
Ory, J. C. (2001). Faculty Thoughts and Concerns About Student Ratings. New Directions for Teaching and Learning, 2001(87), 3–15. https://doi.org/10.1002/tl.23; American Sociological Association. (2019, September 9). Reconsidering Student Evaluations of Teaching. American Sociological Association. https://www.asanet.org/presscenter/press-releases/reconsidering-student-evaluations-teaching
Penny, A. R. (2003). Changing the Agenda for Research into Students’ Views about University Teaching: Four shortcomings of SRT research. Teaching in Higher Education, 8(3), 399–411. https://doi.org/10.1080/13562510309396.
Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA: Sage Publications.
Traub, R. E. (1997). Classical Test Theory in Historical Perspective. Educational Measurement: Issues and Practice, 16(4), 8–14. https://doi.org/10.1111/j.1745-3992.1997.tb00603.x.
Wagner, N., Rieger, M., & Voorvelt, K. (2016). Gender, ethnicity and teaching evaluations: Evidence from mixed teaching teams. Economics of Education Review, 54, 79–94. https://doi.org/10.1016/j.econedurev.2016.06.004.
Shavelson, R. J., & Noreen, W. (2006). Generalizability Theory. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.), Handbook of Complementary Methods in Education Research (pp. 309–322). Washington DC: Lawrence Elbraum Associates, Inc.
Mitchell, K. M. W., & Martin, J. (2018). Gender Bias in Student Evaluations. PS: Political Science & Politics, 51(03), 648–652. https://doi.org/10.1017/S104909651800001X.