Abstract
The purpose of the chapter is to orient readers to reliability considerations specific to instruments and data coding practices in applied linguistics (AL) research. To that end, the chapter begins with a general discussion of different types of reliability (both internal and external to an instrument itself), including the different indices and models used to estimate reliability and their respective interpretations. Methods for improving the reliability of data coding and instrument scoring practices will then be discussed, followed by a summary of best practices in coder/rater training and norming. Throughout, guidelines for addressing common limitations with respect to reliability analysis and reporting in AL research will be outlined, including suggestions for how to address these issues in operational contexts.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
In this chapter, we will be using the general term instrument to encompass a variety of tools that may be employed in empirical studies, such as tests, surveys, performance assessments, questionnaires, and others. We will also use more specific terms where relevant and when a more precise illustration is helpful or required.
- 2.
It should also be noted that in some areas of AL research (e.g., large-scale language assessment), certain approaches to reliability analysis, such as parallel-forms reliability or test-retest reliability, are becoming more and more obsolete as item- and task-banking and computer-adaptive testing are replacing former assessment delivery methods, such as traditional paper-and-pencil tests and even some first-generation computer-based tests. Most of the data collected from these newer large-scale test-delivery systems have properties that make traditional approaches to reliability estimation inefficient, if not impossible. In most cases, psychometricians and other measurement professionals charged with analyzing the data often use item response theory (IRT)/Rasch in their analysis, as each test-taker may, in theory, receive a unique set of items or tasks on any given occasion. As access to this type of technology is quite rare in most AL research contexts, we have chosen to present approaches to reliability analysis here that are accessible to most, if not all, AL researchers.
- 3.
Most statistics discussed in the chapter are easily obtained using statistical software, such as SPSS (SPSS, Inc.), and, thus, do not require hand calculations. However, for information on formulae related to different types of reliability, or how to calculate reliability statistics by hand, please see Resources for Further Reading at the end of the chapter.
References
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge, UK: Cambridge University Press.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 20(1), 1–34.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford, UK: Oxford University Press.
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, UK: Oxford University Press.
Bogdan, R. C., & Biklen, S. K. (2003). Qualitative research in education. Boston, MA: Allyn and Bacon.
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag.
Brown, J. D. (2001). Using surveys in language programs. Cambridge, UK: Cambridge University Press.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge, UK: Cambridge University Press.
Card, N. (2015). Applied meta-analysis for social science research. New York: The Guilford Press.
Carr, N. T. (2011). Designing and analyzing language tests. Oxford, UK: Oxford University Press.
Cortina, J. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. New York, NY: Wiley.
Derrick, D. (2015). Instrument reporting practices in second language research. TESOL Quarterly, 50(1), 132–153.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.
Ellis, R., & Barkhuizen, G. (2005). Analyzing learner language. Oxford, UK: Oxford University Press.
Guilford, J. P. (1954). Psychometric methods. Bombay, India: Tata-McGraw Hill.
Hadley, G. (2017). Grounded theory in applied linguistics research: A practical guide. London: Routledge.
Hamilton, J., Reddel, S., & Spratt, M. (2001). Teachers’ perception of on-line rater training and monitoring. System, 29(4), 505–520.
Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523–561.
Huot, B. A. (1993). The influence of holistic scoring procedures on reading and rating student essays. In M. M. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment (pp. 206–232). Cresskill, NJ: Hampton Press.
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64). Westport, CT: American Council of Education and Praeger Series on Higher Education.
Kane, M. (2013). Validating and interpretation and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65(1), 127–159.
Lim, G. (2011). The development and maintenance of rater quality in performance writing assessment: a longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560.
Linacre, J., & Wright, B. (1992). A user’s guide to FACETS: Rasch measurement computer program. Chicago, IL: MESA Press.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley Publishing Company.
Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13(14), 425–444.
McNamara, T. (1996). Measuring second language performance. London: Longman.
Muchinsky, P. M. (1996). The correction for attenuation. Educational and Psychological Measurement, 56(1), 63–75.
O’Sullivan, B., & Rignall, M. (2007). Assessing the value of bias analysis feedback to raters for the IELTS writing module. In L. Taylor & P. Falvey (Eds.), IELTS collected papers. Research in speaking and writing performance (pp. 446–478). Cambridge, UK: Cambridge University Press.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655–687.
Plonsky, L., & Derrick, D. (2016). A meta-analysis of reliability coefficients in second language research. Modern Language Journal, 100(2), 538–553.
Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research. Language Learning, 64, 878–912.
Shavelson, R., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15(1), 72–101.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–335.
Yang, Y., & Green, S. B. (2011). Coefficient Alpha: a reliability coefficient for the 21st Century? Journal of Psychoeducational Assessment, 29, 377–392.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Copyright information
© 2018 The Author(s)
About this chapter
Cite this chapter
Grabowski, K.C., Oh, S. (2018). Reliability Analysis of Instruments and Data Coding. In: Phakiti, A., De Costa, P., Plonsky, L., Starfield, S. (eds) The Palgrave Handbook of Applied Linguistics Research Methodology. Palgrave Macmillan, London. https://doi.org/10.1057/978-1-137-59900-1_24
Download citation
DOI: https://doi.org/10.1057/978-1-137-59900-1_24
Publisher Name: Palgrave Macmillan, London
Print ISBN: 978-1-137-59899-8
Online ISBN: 978-1-137-59900-1
eBook Packages: Social SciencesSocial Sciences (R0)