Reliability Analysis of Instruments and Data Coding

  • Kirby C. GrabowskiEmail author
  • Saerhim Oh


The purpose of the chapter is to orient readers to reliability considerations specific to instruments and data coding practices in applied linguistics (AL) research. To that end, the chapter begins with a general discussion of different types of reliability (both internal and external to an instrument itself), including the different indices and models used to estimate reliability and their respective interpretations. Methods for improving the reliability of data coding and instrument scoring practices will then be discussed, followed by a summary of best practices in coder/rater training and norming. Throughout, guidelines for addressing common limitations with respect to reliability analysis and reporting in AL research will be outlined, including suggestions for how to address these issues in operational contexts.


Reliability Instruments Raters Coders 


  1. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.Google Scholar
  2. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  3. Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 20(1), 1–34.CrossRefGoogle Scholar
  4. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford, UK: Oxford University Press.Google Scholar
  5. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, UK: Oxford University Press.Google Scholar
  6. Bogdan, R. C., & Biklen, S. K. (2003). Qualitative research in education. Boston, MA: Allyn and Bacon.Google Scholar
  7. Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag.CrossRefGoogle Scholar
  8. Brown, J. D. (2001). Using surveys in language programs. Cambridge, UK: Cambridge University Press.Google Scholar
  9. Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  10. Card, N. (2015). Applied meta-analysis for social science research. New York: The Guilford Press.Google Scholar
  11. Carr, N. T. (2011). Designing and analyzing language tests. Oxford, UK: Oxford University Press.Google Scholar
  12. Cortina, J. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104.CrossRefGoogle Scholar
  13. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. New York, NY: Wiley.Google Scholar
  14. Derrick, D. (2015). Instrument reporting practices in second language research. TESOL Quarterly, 50(1), 132–153.CrossRefGoogle Scholar
  15. Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.CrossRefGoogle Scholar
  16. Ellis, R., & Barkhuizen, G. (2005). Analyzing learner language. Oxford, UK: Oxford University Press.Google Scholar
  17. Guilford, J. P. (1954). Psychometric methods. Bombay, India: Tata-McGraw Hill.Google Scholar
  18. Hadley, G. (2017). Grounded theory in applied linguistics research: A practical guide. London: Routledge.CrossRefGoogle Scholar
  19. Hamilton, J., Reddel, S., & Spratt, M. (2001). Teachers’ perception of on-line rater training and monitoring. System, 29(4), 505–520.CrossRefGoogle Scholar
  20. Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523–561.CrossRefGoogle Scholar
  21. Huot, B. A. (1993). The influence of holistic scoring procedures on reading and rating student essays. In M. M. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment (pp. 206–232). Cresskill, NJ: Hampton Press.Google Scholar
  22. Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64). Westport, CT: American Council of Education and Praeger Series on Higher Education.Google Scholar
  23. Kane, M. (2013). Validating and interpretation and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.CrossRefGoogle Scholar
  24. Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261.CrossRefGoogle Scholar
  25. Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65(1), 127–159.CrossRefGoogle Scholar
  26. Lim, G. (2011). The development and maintenance of rater quality in performance writing assessment: a longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560.CrossRefGoogle Scholar
  27. Linacre, J., & Wright, B. (1992). A user’s guide to FACETS: Rasch measurement computer program. Chicago, IL: MESA Press.Google Scholar
  28. Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.Google Scholar
  29. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley Publishing Company.Google Scholar
  30. Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13(14), 425–444.CrossRefGoogle Scholar
  31. McNamara, T. (1996). Measuring second language performance. London: Longman.Google Scholar
  32. Muchinsky, P. M. (1996). The correction for attenuation. Educational and Psychological Measurement, 56(1), 63–75.CrossRefGoogle Scholar
  33. O’Sullivan, B., & Rignall, M. (2007). Assessing the value of bias analysis feedback to raters for the IELTS writing module. In L. Taylor & P. Falvey (Eds.), IELTS collected papers. Research in speaking and writing performance (pp. 446–478). Cambridge, UK: Cambridge University Press.Google Scholar
  34. Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655–687.CrossRefGoogle Scholar
  35. Plonsky, L., & Derrick, D. (2016). A meta-analysis of reliability coefficients in second language research. Modern Language Journal, 100(2), 538–553.CrossRefGoogle Scholar
  36. Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research. Language Learning, 64, 878–912.CrossRefGoogle Scholar
  37. Shavelson, R., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.Google Scholar
  38. Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15(1), 72–101.CrossRefGoogle Scholar
  39. Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223.CrossRefGoogle Scholar
  40. Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.CrossRefGoogle Scholar
  41. Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–335.CrossRefGoogle Scholar
  42. Yang, Y., & Green, S. B. (2011). Coefficient Alpha: a reliability coefficient for the 21st Century? Journal of Psychoeducational Assessment, 29, 377–392.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Teachers CollegeColumbia UniversityNew YorkUSA

Personalised recommendations