Reliability Analysis of Instruments and Data Coding

Grabowski, Kirby C.; Oh, Saerhim

doi:10.1057/978-1-137-59900-1_24

Reliability Analysis of Instruments and Data Coding

Kirby C. Grabowski⁵ &
Saerhim Oh⁵

Chapter

6701 Accesses
4 Citations

Abstract

The purpose of the chapter is to orient readers to reliability considerations specific to instruments and data coding practices in applied linguistics (AL) research. To that end, the chapter begins with a general discussion of different types of reliability (both internal and external to an instrument itself), including the different indices and models used to estimate reliability and their respective interpretations. Methods for improving the reliability of data coding and instrument scoring practices will then be discussed, followed by a summary of best practices in coder/rater training and norming. Throughout, guidelines for addressing common limitations with respect to reliability analysis and reporting in AL research will be outlined, including suggestions for how to address these issues in operational contexts.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
In this chapter, we will be using the general term instrument to encompass a variety of tools that may be employed in empirical studies, such as tests, surveys, performance assessments, questionnaires, and others. We will also use more specific terms where relevant and when a more precise illustration is helpful or required.
2.
It should also be noted that in some areas of AL research (e.g., large-scale language assessment), certain approaches to reliability analysis, such as parallel-forms reliability or test-retest reliability, are becoming more and more obsolete as item- and task-banking and computer-adaptive testing are replacing former assessment delivery methods, such as traditional paper-and-pencil tests and even some first-generation computer-based tests. Most of the data collected from these newer large-scale test-delivery systems have properties that make traditional approaches to reliability estimation inefficient, if not impossible. In most cases, psychometricians and other measurement professionals charged with analyzing the data often use item response theory (IRT)/Rasch in their analysis, as each test-taker may, in theory, receive a unique set of items or tasks on any given occasion. As access to this type of technology is quite rare in most AL research contexts, we have chosen to present approaches to reliability analysis here that are accessible to most, if not all, AL researchers.
3.
Most statistics discussed in the chapter are easily obtained using statistical software, such as SPSS (SPSS, Inc.), and, thus, do not require hand calculations. However, for information on formulae related to different types of reliability, or how to calculate reliability statistics by hand, please see Resources for Further Reading at the end of the chapter.

References

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.
Google Scholar
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge, UK: Cambridge University Press.
Book Google Scholar
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 20(1), 1–34.
Article Google Scholar
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford, UK: Oxford University Press.
Google Scholar
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, UK: Oxford University Press.
Google Scholar
Bogdan, R. C., & Biklen, S. K. (2003). Qualitative research in education. Boston, MA: Allyn and Bacon.
Google Scholar
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag.
Book Google Scholar
Brown, J. D. (2001). Using surveys in language programs. Cambridge, UK: Cambridge University Press.
Google Scholar
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge, UK: Cambridge University Press.
Book Google Scholar
Card, N. (2015). Applied meta-analysis for social science research. New York: The Guilford Press.
Google Scholar
Carr, N. T. (2011). Designing and analyzing language tests. Oxford, UK: Oxford University Press.
Google Scholar
Cortina, J. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104.
Article Google Scholar
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. New York, NY: Wiley.
Google Scholar
Derrick, D. (2015). Instrument reporting practices in second language research. TESOL Quarterly, 50(1), 132–153.
Article Google Scholar
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.
Article Google Scholar
Ellis, R., & Barkhuizen, G. (2005). Analyzing learner language. Oxford, UK: Oxford University Press.
Google Scholar
Guilford, J. P. (1954). Psychometric methods. Bombay, India: Tata-McGraw Hill.
Google Scholar
Hadley, G. (2017). Grounded theory in applied linguistics research: A practical guide. London: Routledge.
Book Google Scholar
Hamilton, J., Reddel, S., & Spratt, M. (2001). Teachers’ perception of on-line rater training and monitoring. System, 29(4), 505–520.
Article Google Scholar
Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523–561.
Article Google Scholar
Huot, B. A. (1993). The influence of holistic scoring procedures on reading and rating student essays. In M. M. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment (pp. 206–232). Cresskill, NJ: Hampton Press.
Google Scholar
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64). Westport, CT: American Council of Education and Praeger Series on Higher Education.
Google Scholar
Kane, M. (2013). Validating and interpretation and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
Article Google Scholar
Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261.
Article Google Scholar
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65(1), 127–159.
Article Google Scholar
Lim, G. (2011). The development and maintenance of rater quality in performance writing assessment: a longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560.
Article Google Scholar
Linacre, J., & Wright, B. (1992). A user’s guide to FACETS: Rasch measurement computer program. Chicago, IL: MESA Press.
Google Scholar
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.
Google Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley Publishing Company.
Google Scholar
Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13(14), 425–444.
Article Google Scholar
McNamara, T. (1996). Measuring second language performance. London: Longman.
Google Scholar
Muchinsky, P. M. (1996). The correction for attenuation. Educational and Psychological Measurement, 56(1), 63–75.
Article Google Scholar
O’Sullivan, B., & Rignall, M. (2007). Assessing the value of bias analysis feedback to raters for the IELTS writing module. In L. Taylor & P. Falvey (Eds.), IELTS collected papers. Research in speaking and writing performance (pp. 446–478). Cambridge, UK: Cambridge University Press.
Google Scholar
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655–687.
Article Google Scholar
Plonsky, L., & Derrick, D. (2016). A meta-analysis of reliability coefficients in second language research. Modern Language Journal, 100(2), 538–553.
Article Google Scholar
Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research. Language Learning, 64, 878–912.
Article Google Scholar
Shavelson, R., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15(1), 72–101.
Article Google Scholar
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223.
Article Google Scholar
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.
Article Google Scholar
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–335.
Article Google Scholar
Yang, Y., & Green, S. B. (2011). Coefficient Alpha: a reliability coefficient for the 21st Century? Journal of Psychoeducational Assessment, 29, 377–392.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Teachers College, Columbia University, New York, NY, USA
Kirby C. Grabowski & Saerhim Oh

Authors

Kirby C. Grabowski
View author publications
You can also search for this author in PubMed Google Scholar
Saerhim Oh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kirby C. Grabowski .

Editor information

Editors and Affiliations

Sydney School of Education and Social Work, University of Sydney, Sydney, NSW, Australia
Aek Phakiti
Department of Linguistics, Germanic, Slavic, Asian and African Languages, Michigan State University, East Lansing, MI, USA
Peter De Costa
Applied Linguistics, Northern Arizona University, Flagstaff, AZ, USA
Luke Plonsky
School of Education, UNSW Sydney, Sydney, NSW, Australia
Sue Starfield

Copyright information

About this chapter

Cite this chapter

Grabowski, K.C., Oh, S. (2018). Reliability Analysis of Instruments and Data Coding. In: Phakiti, A., De Costa, P., Plonsky, L., Starfield, S. (eds) The Palgrave Handbook of Applied Linguistics Research Methodology. Palgrave Macmillan, London. https://doi.org/10.1057/978-1-137-59900-1_24

Download citation

DOI: https://doi.org/10.1057/978-1-137-59900-1_24
Publisher Name: Palgrave Macmillan, London
Print ISBN: 978-1-137-59899-8
Online ISBN: 978-1-137-59900-1
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics