Adams, R. J., Wu, M. L., & Wilson, M. R. (2020). ConQuest: Generalised item response modelling software (4.5.2) [Computer Software]. Australian Council for Educational Research.
Alhija, F. (2017). Guest editor introduction to the special issue “contemporary evaluation of teaching: Challenges and promises”. Studies in Educational Evaluation, 54(Supplement C), 1–3. https://doi.org/10.1016/j.stueduc.2017.02.002.
Article
Google Scholar
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
American Sociological Association. (2019). Reconsidering student evaluations of teaching. American Sociological Association. Retrieved November 6, 2019, from https://www.asanet.org/press-center/press-releases/reconsidering-student-evaluations-teaching
Ames, A. J., & Penfield, R. D. (2015). An NCME instructional module on item-fit statistics for item response theory models. Educational Measurement: Issues and Practice, 34(3), 39–48. https://doi.org/10.1111/emip.12067.
Article
Google Scholar
Andersen, K., & Miller, E. D. (1997). Gender and Student Evaluations of Teaching. PS: Political Science and Politics, 30(2), 216. https://doi.org/10.2307/420499.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814.
Article
Google Scholar
Arbuckle, J., & Williams, B. D. (2003). Students’ perceptions of expressiveness: Age and gender effects on teacher evaluations. Sex Roles, 49(9–10), 507–516. https://doi.org/10.1023/A:1025832707002.
Article
Google Scholar
Basow, S. A., & Martin, J. L. (2012). Bias in student evaluations. In Effective evaluation of teaching: A guide for faculty and administrators (pp. 40–49). Society for the Teaching of Psychology.
Basow, S. A., & Montgomery, S. (2005). Student ratings and professor self-ratings of college teaching: Effects of gender and divisional affiliation. Journal of Personnel Evaluation in Education, 18(2), 91–106. https://doi.org/10.1007/s11092-006-9001-8.
Article
Google Scholar
Bassett, J., Cleveland, A., Acorn, D., Nix, M., & Snyder, T. (2017). Are they paying attention? Students’ lack of motivation and attention potentially threaten the utility of course evaluations. Assessment & Evaluation in Higher Education, 42(3), 431–442. https://doi.org/10.1080/02602938.2015.1119801.
Article
Google Scholar
Bavishi, A., Madera, J. M., & Hebl, M. R. (2010). The effect of professor ethnicity and gender on student evaluations: Judged before met. Journal of Diversity in Higher Education, 3(4), 245–256. https://doi.org/10.1037/a0020763.
Article
Google Scholar
Bertrand, M. (2017). The glass ceiling. Becker Friedman Institute for Research in Economics Working Paper No. 2018-38, https://doi.org/10.2139/ssrn.3191467
Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model: Fundamental measurement in the human sciences, third edition (3rd ed.). Routledge.
Bonitz, V. S. (2011). Student evaluation of teaching: Individual differences and bias effects. Graduate Theses and Dissertations. 12211. Retrieved November 6, 2019, from https://lib.dr.iastate.edu/etd/12211
Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness (pp. 1–11). ScienceOpen Research. https://doi.org/10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1.
Boysen, G. A. (2015). Significant interpretation of small mean differences in student evaluations of teaching despite explicit warning to avoid overinterpretation. Scholarship of Teaching and Learning in Psychology, 1(2), 150–162. https://doi.org/10.1037/stl0000017.
Article
Google Scholar
Boysen, G. A., Kelly, T. J., Raesly, H. N., & Casner, R. W. (2014). The (mis)interpretation of teaching evaluations by college faculty and administrators. Assessment & Evaluation in Higher Education, 39(6), 641–656. https://doi.org/10.1080/02602938.2013.860950.
Article
Google Scholar
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. Springer-Verlag http://www.springer.com/gp/book/9780387953649.
Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (Fourth ed., pp. 221–256). Praeger Publishers.
Camilli, G. (2013). Ongoing issues in test fairness. Educational Research and Evaluation, 19(2–3), 104–120. https://doi.org/10.1080/13803611.2013.767602.
Article
Google Scholar
Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work? Research in Higher Education, 44(5), 495–518. https://doi.org/10.1023/A:1025492407752.
Article
Google Scholar
Centra, J. A., & Gaubatz, N. B. (2000). Is there gender bias in student evaluations of teaching? Journal of Higher Education, 71, 17–33.
Article
Google Scholar
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge.
Cundiff, J. L., Danube, C. L., Zawadzki, M. J., & Shields, S. A. (2018). Testing an intervention for recognizing and reporting subtle gender bias in promotion and tenure decisions. The Journal of Higher Education, 89(5), 611–636. https://doi.org/10.1080/00221546.2018.1437665.
Article
Google Scholar
de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge University Press.
Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II—Evidence from students’ evaluations of their classroom teachers. Research in Higher Education, 34(2), 151–211. https://doi.org/10.1007/BF00992161
Gómez Cama, M., Larrán, M. J., & Andrades Peña, F. J. (2016). Gender differences between faculty members in higher education: A literature review of selected higher education journals. Educational Research Review, 18, 58–69. https://doi.org/10.1016/j.edurev.2016.03.001.
Article
Google Scholar
Haladyna, T., & Hess, R. K. (1994). The detection and correction of bias in student ratings of instruction. Research in Higher Education, 35(6), 669–687. https://doi.org/10.1007/BF02497081.
Article
Google Scholar
Kline, T. J. B. (2005). Psychological testing: A practical approach to design and evaluation. Sage Publications.
Laird, T. F., Garver, A. K., & Niskodé-Dossett, A. S. (2011). Gender gaps in collegiate teaching style: Variations by course characteristics. Research in Higher Education, 52(3), 261–277. https://doi.org/10.1007/s11162-010-9193-0.
Article
Google Scholar
MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40(4), 291–303. https://doi.org/10.1007/s10755-014-9313-4.
Article
Google Scholar
Malisch, J. L., Harris, B. N., Sherrer, S. M., Lewis, K. A., Shepherd, S. L., McCarthy, P. C., Spott, J. L., Karam, E. P., Moustaid-Moussa, N., Calarco, J. M., Ramalingam, L., Talley, A. E., Cañas-Carrell, J. E., Ardon-Dryer, K., Weiser, D. A., Bernal, X. E., & Deitloff, J. (2020). Opinion: In the wake of COVID-19, academia needs new solutions to ensure gender equity. Proceedings of the National Academy of Sciences, 117(27), 15378–15381. https://doi.org/10.1073/pnas.2010636117.
Article
Google Scholar
Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11(3), 253–388. https://doi.org/10.1016/0883-0355(87)90001-2.
Article
Google Scholar
McClain, L., Gulbis, A., & Hays, D. (2017). Honesty on student evaluations of teaching: Effectiveness, purpose, and timing matter! Assessment & Evaluation in Higher Education, 43, 1–17. https://doi.org/10.1080/02602938.2017.1350828.
Article
Google Scholar
McPherson, M. A., & Jewell, R. T. (2007). Leveling the playing field: should student evaluation scores be adjusted? Social Science Quarterly, 88(3), 868–881. https://doi.org/10.1111/j.1540-6237.2007.00487.x.
Article
Google Scholar
McPherson, M. A., Jewell, R. T., & Kim, M. (2009). What determines student evaluation scores? A random effects analysis of undergraduate economics classes. Eastern Economic Journal, 35(1), 37–51.
Article
Google Scholar
Meyer, J. P., Doromal, J. B., Wei, X., & Zhu, S. (2017). A criterion-referenced approach to student ratings of instruction. Research in Higher Education, 58(5), 545–567. https://doi.org/10.1007/s11162-016-9437-8.
Article
Google Scholar
Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. T. (2009). A meta-validation model for assessing the score-validity of student teaching evaluations. Quality & Quantity, 43(2), 197–209. https://doi.org/10.1007/s11135-007-9112-4.
Article
Google Scholar
Osteen, P. (2010). An introduction to using multidimensional item response theory to assess latent factor structures. Journal of the Society for Social Work and Research, 1(2), 66–82. https://doi.org/10.5243/jsswr.2010.6.
Article
Google Scholar
Rivera, L. A., & Tilcsik, A. (2019). Scaling down inequality: Rating scales, gender bias, and the architecture of evaluation. American Sociological Review, 84(2), 248–274. https://doi.org/10.1177/0003122419833601.
Article
Google Scholar
Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W. H., & Shavelson, R. J. (2007). Estimating causal effects using experimental and observational designs. American Educational Research Association.
Setari, A. P., Lee, J., & Bradley, K. D. (2016). A psychometric approach to the validation of a student evaluation of teaching instrument. Studies in Educational Evaluation, 51, 77–87. https://doi.org/10.1016/j.stueduc.2016.09.006.
Article
Google Scholar
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin.
Smith, S. W., Yoo, J. H., Farr, A. C., Salmon, C. T., & Miller, V. D. (2007). The influence of student sex and instructor sex on student ratings of instructors: Results from a college of communication. Women's Studies in Communication, 30(1), 64–77. https://doi.org/10.1080/07491409.2007.10162505.
Article
Google Scholar
Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching the state of the art. Review of Educational Research, 83(4), 598–642. https://doi.org/10.3102/0034654313496870.
Article
Google Scholar
Stark, P., & Freishtat, R. (2014). An evaluation of course evaluations. ScienceOpen Research https://www.scienceopen.com/document/id/ad8a9ac9-8c60-432a-ba20-4402a2a38df4.
Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction? New Directions for Institutional Research, 2001(109), 45–56. https://doi.org/10.1002/ir.3.
Article
Google Scholar
Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22–42. https://doi.org/10.1016/j.stueduc.2016.08.007.
Article
Google Scholar
Van Zile-Tamsen, C. (2017). Using Rasch analysis to inform rating scale development. Research in Higher Education, 58(8), 922–933. https://doi.org/10.1007/s11162-017-9448-0.
Article
Google Scholar
Valencia, E. (2020). Acquiescence, instructor’s gender bias and validity of student evaluation of teaching. Assessment & Evaluation in Higher Education, 45(4), 483–495. https://doi.org/10.1080/02602938.2019.1666085.
Article
Google Scholar
Viswanathan, M. (2005). Measurement Error and Research Design. SAGE Publications Inc.
Wachtel, H. K. (1998). Student evaluation of college teaching effectiveness: A brief review. Assessment & Evaluation in Higher Education, 23(2), 191–212. https://doi.org/10.1080/0260293980230207.
Article
Google Scholar
Weisshaar, K. (2017). Publish and perish? An assessment of gender gaps in promotion to tenure in academia. Social Forces, 96(2), 529–560. https://doi.org/10.1093/sf/sox052.
Article
Google Scholar
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.
Google Scholar
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Mesa Press.
Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER ConQuest Version 2.0: Generalised Item Response Modelling Software. ACER Press.
Zipser, N., & Mincieli, L. (2018). Administrative and structural changes in student evaluations of teaching and their effects on overall instructor scores. Assessment & Evaluation in Higher Education, 43(6), 995–1008. https://doi.org/10.1080/02602938.2018.1425368.
Article
Google Scholar
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved November 6, 2019, from http://faculty.educ.ubc.ca/zumbo/DIF/handbook.pdf
Anderson, K. J., & Smith, G. (2005). Students’ Preconceptions of Professors: Benefits and Barriers According to Ethnicity and Gender. Hispanic Journal of Behavioral Sciences, 27(2), 184–201. https://doi.org/10.1177/0739986304273707.
Aleamoni, L. M., & Hexner, P. Z. (1980). A review of the research on student evaluation and a report on the effect of different sets of instructions on student course and instructor evaluation. Instructional Science, 9(1), 67–84. https://doi.org/10.1007/BF00118969.
Educational Testing Service. (2016). ETS international principles for the fairness of assessments. Princeton, NJ: Author; Berliner, 2005.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage Publications, Inc.
Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness effective: The critical issues of validity,bias, and utility. American Psychologist, 52(11), 1187–1197. https://doi.org/10.1037/0003-066X.52.11.1187.
Mengel, F., Sauermann, J., & Zölitz, U. (2019). Gender Bias in Teaching Evaluations. Journal of the European Economic Association, 17(2), 535–566. https://doi.org/10.1093/jeea/jvx057.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741.
Ory, J. C. (2001). Faculty Thoughts and Concerns About Student Ratings. New Directions for Teaching and Learning, 2001(87), 3–15. https://doi.org/10.1002/tl.23; American Sociological Association. (2019, September 9). Reconsidering Student Evaluations of Teaching. American Sociological Association. https://www.asanet.org/presscenter/press-releases/reconsidering-student-evaluations-teaching
Penny, A. R. (2003). Changing the Agenda for Research into Students’ Views about University Teaching: Four shortcomings of SRT research. Teaching in Higher Education, 8(3), 399–411. https://doi.org/10.1080/13562510309396.
Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA: Sage Publications.
Traub, R. E. (1997). Classical Test Theory in Historical Perspective. Educational Measurement: Issues and Practice, 16(4), 8–14. https://doi.org/10.1111/j.1745-3992.1997.tb00603.x.
Wagner, N., Rieger, M., & Voorvelt, K. (2016). Gender, ethnicity and teaching evaluations: Evidence from mixed teaching teams. Economics of Education Review, 54, 79–94. https://doi.org/10.1016/j.econedurev.2016.06.004.
Shavelson, R. J., & Noreen, W. (2006). Generalizability Theory. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.), Handbook of Complementary Methods in Education Research (pp. 309–322). Washington DC: Lawrence Elbraum Associates, Inc.
Mitchell, K. M. W., & Martin, J. (2018). Gender Bias in Student Evaluations. PS: Political Science & Politics, 51(03), 648–652. https://doi.org/10.1017/S104909651800001X.