Social Indicators Research

, Volume 142, Issue 3, pp 1305–1331 | Cite as

On the Use of Student Evaluation of Teaching: A Longitudinal Analysis Combining Measurement Issues and Implications of the Exercise

  • Isabella SulisEmail author
  • Mariano Porcu
  • Vincenza Capursi


Multi item questionnaires are widely used to collect students’ evaluation of teaching at university. This article makes an attempt to analyse students’ evaluation on a broad perspective. Its main aim is to adjust the evaluations from a wide range of factors which jointly may influence the teaching process: academic year peculiarities, course characteristics, students’ characteristics and item dimensionality. By setting the analysis in a generalised mixed models framework a large flexibility is introduced in the measurement of the quality of university teaching in students’ perception. In that way we consider (1) the effects of potential confounding factors which are external to the process under evaluation; (2) the dependency structure across units in the same clusters; (3) the assessment of real improvement in lecturers’ performance over time and (4) the uncertainty related to the use of an overall indicator to assess the global level of quality of the teaching as it has been assessed by the students. The implications related to a misuse of the evaluation results in implementing university policies are discussed comparing point versus interval estimates and adjusted versus unadjusted indicators.


Measurement models Adjusted indicators Multilevel models Teaching evaluation Mokken analysis 



The authors would like to thank the anonymous reviewers for their helpful suggestions and Zija Li and Michal Toland for their careful review of early versions of this manuscript.


  1. Agresti, A. (2002). Categorical data analysis. Hoboken: Wiley-Interscience.CrossRefGoogle Scholar
  2. Alvira, F., Aguilar, M. J., Betrisey, D., Blanco, F., Lahera-Snchez, A., Mitxelena, C., & Velzquez, C. (2011). Quality and evaluation of teaching in Spanish universities. In 14th Toulon-Verona conference organizational excellence in services September 1–3, 2011 (pp. 45–59). University of Alicante, University of Oviedo (Spain).Google Scholar
  3. ANVUR. (2016). Rapporto biennale sullo stato del sistema universitario e della ricerca. Technical report, Agenzia Nazionale di Valutazione del Sistema Universitario e della Ricerca.Google Scholar
  4. Bacci, S. (2012). Longitudinal data: Different approaches in the context of item-response theory models. Journal of Applied Statistics, 39(9), 2047–2065.CrossRefGoogle Scholar
  5. Bacci, S., & Caviezel, V. (2011). Multilevel IRT models for the university teaching evaluation. Journal of Applied Statistics, 28, 2775–2791.CrossRefGoogle Scholar
  6. Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques. New York: Dekker.CrossRefGoogle Scholar
  7. Bella, M. (2016). Università: la valutazione della didattica attraverso la ‘pessimenza’. Scholar
  8. Bernardi, L., Capursi, V., & Librizzi, L. (2004). Measurement awareness: The use of indicators between expectations and opportunities. In Atti XLII Convegno della Società Italiana di Statistica. Bari, 9–11 Giugno 2004. Società italiana di Statistica.Google Scholar
  9. Boring, A. (2015). Can students evaluate teaching quality objectively? OFCE-PRESAGE-Sciences Po and LEDa-DIAL. Accessed February 24, 2015.
  10. Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. Retrieved from Science Open Research.Google Scholar
  11. Braga, M., Paccagnella, M., & Pellizzari, M. (2014). Evaluating students’ evaluations of professors. Economics of Education Review, 41, 71–88.CrossRefGoogle Scholar
  12. Browne, W. (2017). MCMC estimation in MLwiN v3.00. Centre for Multilevel Modelling, University of Bristol.Google Scholar
  13. CNVSU. (2009). Indicatori per la ripartizione del fondo di cui all’art. 2 della legge 1/2009. Technical report doc. 07/09, Ministero dell’Università e della Ricerca Scientifica.Google Scholar
  14. De Boeck, P., & Wilson, M. (Eds.). (2004). Item response models: A generalized linear and non linear approach. Statistics for social and behavioral sciences. New York: Springer.Google Scholar
  15. DeMars, C. E. (2006). Application of the Bifactor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168.CrossRefGoogle Scholar
  16. Draper, D., & Gittoes, M. (2004). Statistical analysis of performance indicators in UK higher education. Journal of the Royal Statistical Society: Series A, 167(3), 449–474.CrossRefGoogle Scholar
  17. Fayers, P. M., & Hand, D. J. (1997). Factor analysis, causal indicators and quality of life. Quality of Life Research, 6, 139–150.Google Scholar
  18. Fayers, P. M., & Hand, D. J. (2002). Causal variables, indicator variables and measurement scales: An example from quality of life. Journal of the Royal Statistical Society: Series B, 165, 233–261.CrossRefGoogle Scholar
  19. Firestone, W. A. (2015). Theacher evaluation policy and conflict theory of motivation. Educational Research, 43(2), 100–107.CrossRefGoogle Scholar
  20. Fox, J. (2011). Bayesian item response modeling: Theory and applications. New York: Springer.Google Scholar
  21. Fukuhara, H., & Kamata, K. (2011). A bifactor multidimensional item response theory model for differential item functioning analysis on testlet-based items. Applied Psychological Measurement, 35(8), 604–622.CrossRefGoogle Scholar
  22. Goldstein, H. (2011). Multilevel statistical models. Wiley series in probability and statistics (4th ed.). Hoboken: Wiley.Google Scholar
  23. Goldstein, H. (2008). School league tables: What can they really tell us. Significance, 5(2), 67–69.CrossRefGoogle Scholar
  24. Goldstein, H., & Healy, M. J. R. (1995). The graphical presentation of a collection of means. Journal of the Royal Statistical Society: Series A, 158, 175–177.CrossRefGoogle Scholar
  25. Goldstein, H., & Spiegelhalter, D. J. (1996). League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society: Series A, 159, 385–443.CrossRefGoogle Scholar
  26. Grilli, L., & Rampichini, C. (2007). Multilevel factor models for ordinal variables. Structural Equation Modeling, 14(1), 1–25.CrossRefGoogle Scholar
  27. Grilli, L., & Rampichini, C. (2012). Multilevel models for ordinal data. In R. Kenett & S. Salini (Eds.), Modern analysis of customer surveys: With applications using R. New York: Wiley.Google Scholar
  28. Grilli, L., & Sani, C. (2011). Differential variability of test scores among schools: A multilevel analysis of the fifth-grade invalsi test using heteroscedastic random effects. Journal of Applied Quantitative Methods, 53(6), 88–99.Google Scholar
  29. Gunn, A. (2018). Metrics and methodologies for measuring teaching quality in higher education: Developing the teaching excellence framework (REF). Educational Review, 53(70), 129–148.CrossRefGoogle Scholar
  30. Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38(1), 79–93.CrossRefGoogle Scholar
  31. Kelly, M. (2012). Student evaluations of teaching effectiveness: Considerations for Ontario universities. COU no. 866, Wilfrid Laurier University.Google Scholar
  32. La Rocca, M., Parrella, L., Primerano, I., Sulis, I., & Vitale, M. (2017). An integrated strategy for the analysis of student evaluation of teaching: From descriptive measures to explanatory models. Quality & Quantity, 51(2), 675–691.CrossRefGoogle Scholar
  33. Leckie, G., & Charlton, C. (2013). A program to run the MLwin multilevel modelling software from within Stata. Journal of Statistical Software, 52(11), 1–40.Google Scholar
  34. Leckie, G., & Goldstein, H. (2009). The limitation of using school league tables to inform school choice. Journal of the Royal Statistical Society: Series A, 172(4), 835–851.CrossRefGoogle Scholar
  35. McPherson, M. A., Jewell, R. T., & Kim, M. (2009). What determines student evaluation scores? A random effects analysis of undergraduate economics classes. Eastern Economic Journal, 35(1), 37–51.CrossRefGoogle Scholar
  36. Molenaar, I. W. (1997). Non parametric models for polytomous responses. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 369–380). New York: Springer.CrossRefGoogle Scholar
  37. Murmura, F., Casolani, N., & Bravi, L. (2016). Seven keys for implementing the self-evaluation, periodic evaluation and accreditation (AVA) method, to improve quality and student satisfaction in the italian higher education system. Quality in Higher Education, 2(22), 167–179.CrossRefGoogle Scholar
  38. Pastor, D. A. (2003). The use of multilevel item response theory modeling in applied research: An illustration. Applied Measurement in Education, 3(16), 223–243.CrossRefGoogle Scholar
  39. Rabe-Hesketh, S., & Skrondal, A. (2008). Multilevel and longitudinal modeling using Stata (2nd ed.). College Station: Stata Press.Google Scholar
  40. Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation modeling. Psychometrika, 69, 167–190.CrossRefGoogle Scholar
  41. Rampichini, C., Grilli, L., & Petrucci, A. (2004). Analysis of university course evaluations: From descriptive measures to multilevel models. Statistical Methods & Applications, 13(3), 357–371.Google Scholar
  42. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Nielsen and Lydicke.Google Scholar
  43. Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2004). A non linear mixed model framework for item response theory. Psychological Methods, 8(2), 185–205.CrossRefGoogle Scholar
  44. Samejima, F. (1969). Estimation of ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2), 100.Google Scholar
  45. Sijtsma, K., Emons, W., Bouwmeester, S., Nyklícek, I., & Roorda, L. (2008). Nonparametric IRT analysis of quality-of-life scales and its application to the world health organization quality-of-life scale (WHOQOL-Bref). Quality of Life Research, 17(2), 275–290.CrossRefGoogle Scholar
  46. Sijtsma, K., & Hemker, B. T. (2000). A taxonomy of IRT models for ordering persons and items using simple sum scores. Journal of Educational and Behavioral Statistics, 25(4), 391–415.CrossRefGoogle Scholar
  47. Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variables modeling. Boca Raton, FL: Chapman & Hall.CrossRefGoogle Scholar
  48. Slater, H., Davies, N. M., & Burgess, S. (2012). Do teachers matter? Measuring the variation in teacher effectiveness in England. Oxfor Bulletin of Economics and Statistics, 74(5), 629–645.CrossRefGoogle Scholar
  49. Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598–642.CrossRefGoogle Scholar
  50. Stroebe, W. (2016). Why good teaching evaluations may reward bad teaching: On grade inflation and other unintended consequences of student evaluations. Perspectives on Psychological Science, 11(6), 800816.CrossRefGoogle Scholar
  51. Sulis, I., & Capursi, V. (2013). Building up adjusted indicators of students’ evaluation of university courses using generalized item response models. Journal of Applied Statistics, 40(1), 88–102.CrossRefGoogle Scholar
  52. Sulis, I., & Porcu, M. (2017). Handling missing data in item response theory. Assessing the accuracy of a multiple imputation procedure based on latent class analysis. Journal of Classification, 34(2), 327–359. Scholar
  53. Taylor, J., & Nguyen, A. N. (2006). An analysis of the value added by secondary schools in england: Is the value added indicator of any value? Oxford Bulletin of Economcs and Statistics, 68(2), 203–224.CrossRefGoogle Scholar
  54. Uttl, B., White, C. A., & Gonzalez, D. W. (2016). Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22–42.CrossRefGoogle Scholar
  55. van der Ark, L. A. (2007). Mokken scale analysis in R. Journal of Statistical Software, 20(11), 1–19.CrossRefGoogle Scholar
  56. van der Lans, R., van de Grift, W. J., & van Veen, K. (2015). Developing a teacher evaluation instrument to provide formative feedback using student ratings of teaching acts. Educational Measurement: Issues and Practice, 34(3), 18–27.CrossRefGoogle Scholar
  57. Wolbring, T. (2012). Class attendance and students’ evaluations of teaching. Evaluation Review, 36(1), 72–96.CrossRefGoogle Scholar
  58. Zabaleta, F. (2007). The use and misuse of student evaluations of teaching. Teaching in Higher Education, 12, 55–76.CrossRefGoogle Scholar
  59. Zija, L. (2016). Longitudinal analysis for ordinal data through multilevel and item response modeling: Applications to child observation record (COR). Ph.D. thesis, Educational School, and Counseling Psychology. Paper 52.Google Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.Dipartimento di Scienze Sociali e delle IstituzioniCagliariItaly
  2. 2.Dipartimento di Scienze EconomicheAziendali e StatistichePalermoItaly

Personalised recommendations