Abstract
Minimal measurement error (reliability) during the collection of interval- and ratio-type data is critically important to sports medicine research. The main components of measurement error are systematic bias (e.g. general learning or fatigue effects on the tests) and random error due to biological or mechanical variation. Both error components should be meaningfully quantified for the sports physician to relate the described error to judgements regarding ‘analytical goals’ (the requirements of the measurement tool for effective practical use) rather than the statistical significance of any reliability indicators.
Methods based on correlation coefficients and regression provide an indication of ‘relative reliability’. Since these methods are highly influenced by the range of measured values, researchers should be cautious in: (i) concluding acceptable relative reliability even if a correlation is above 0.9; (ii) extrapolating the results of a test-retest correlation to a new sample of individuals involved in an experiment; and (iii) comparing test-retest correlations between different reliability studies.
Methods used to describe ‘absolute reliability’ include the standard error of measurements (SEM), coefficient of variation (CV) and limits of agreement (LOA). These statistics are more appropriate for comparing reliability between different measurement tools in different studies. They can be used in multiple retest studies from ANOVA procedures, help predict the magnitude of a ‘real’ change in individual athletes and be employed to estimate statistical power for a repeated-measures experiment.
These methods vary considerably in the way they are calculated and their use also assumes the presence (CV) or absence (SEM) of heteroscedasticity. Most methods of calculating SEM and CV represent approximately 68% of the error that is actually present in the repeated measurements for the ‘average’ individual in the sample. LOA represent the test-retest differences for 95% of a population. The associated Bland-Altman plot shows the measurement error schematically and helps to identify the presence of heteroscedasticity. If there is evidence of heteroscedasticity or non-normality, one should logarithmically transform the data and quote the bias and random error as ratios. This allows simple comparisons of reliability across different measurement tools.
It is recommended that sports clinicians and researchers should cite and interpret a number of statistical methods for assessing reliability. We encourage the inclusion of the LOA method, especially the exploration of heteroscedasticity that is inherent in this analysis. We also stress the importance of relating the results of any reliability statistic to ‘analytical goals’ in sports medicine.
This is a preview of subscription content, access via your institution.
References
Yeadon MR, Challis JH. The future of performance-related sports biomechanics research. J Sports Sci 1994; 12: 3–32
Jakeman PM, Winter EM, Doust J. A review of research in sports physiology. J Sports Sci 1994; 12: 33–60
Hardy L, Jones G. Current issues and future directions for performance-related research in sport psychology. J Sports Sci 1994; 12: 61–92
Nevill AM. Statistical methods in kinanthropometry and exercise physiology. In. Eston R, Reilly T, editors. Kinanthropometry and exercise physiology laboratory manual. London: E and FN Spon, 1996: 297–320
Safrit MJ. An overview of measurement. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 3–20
Zar JH. Biostatistical analysis. London: Prentice Hall, 1996
Mathews JN. A formula for the probability of discordant classification in method comparison studies. Stat Med 1997; 16 (6): 705–10
Bates BT, Dufek JS, Davis HP. The effects of trial size on statistical power. Med Sci Sports Exerc 1992; 24 (9): 1059–65
Dufek JS, Bates BT, Davis HP. The effect of trial size and variability on statistical power. Med Sci Sports Exerc 1995; 27: 288–95
Atkinson G. [Letter]. British Association of Sports Sciences Newsletter, 1995 Sep: 5
Nevill AM. Validity and measurement agreement in sports performance [abstract]. J Sports Sci 1996; 14: 199
Ottenbacher KJ, Stull GA. The analysis and interpretation of method comparison studies in rehabilitation research. Am J Phys Med Rehab 1993; 72: 266–71
Hollis S. Analysis of method comparison studies. Ann Clin Biochem 1996; 33: 1–4
Liehr P, Dedo YL, Torres S, et al. Assessing agreement between clinical measurement methods. Heart Lung 1995; 24: 240–5
Ottenbacher KJ, Tomcheck SD. Measurement variation in method comparison studies: an empirical examination. Arch Phys Med Rehabil 1994; 75 (5): 505–12
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; I: 307–10
Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989
Baumgarter TA. Norm-referenced measurement: reliability. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 45–72
Atkinson G. Reilly T. Circadian variation in sports performance. Sports Med 1996; 21 (4): 292–312
Morrow JR, Jackson AW, Disch JG, et al. Measurement and evaluation in human performance. Champaign (IL): Human Kinetics, 1995
Morrow JR. Generalizability theory. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 73–96
Roebroeck ME, Harlaar J, Lankhorst GJ. The application of generalizability theory to reliability assessment: an illustration using isometric force measurements. Phys Ther 1993; 73 (6): 386–95
Chatburn RL. Evaluation of instrument error and method agreement. Am Assoc Nurse Anesthet J 1996; 64 (3): 261–8
Coldwells A, Atkinson G, Reilly T. Sources of variation in back and leg dynamometry. Ergonomics 1994; 37: 79–86
Hickey MS, Costill DL, McConnell GK, et al. Day-to-day variation in time trial cycling performance. Int J Sports Med 1992; 13: 467–70
Nevill A. Why the analysis of performance variables recorded on a ratio scale will invariably benefit from a log transformation. J Sports Sci 1997; 15: 457–8
Bland JM, Altman DG. Transforming data. BMJ 1996; 312 (7033): 770
Schultz RW. Analysing change. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 207–28
Morrow JR, Jackson AW. How ’significant’ is your reliability?. Res Q Exerc Sport 1993; 64 (3): 352–5
Altman DG. Practical statistics for medical research. London: Chapman and Hall, 1991: 396–403
Mathews JNS, Altman DG, Campbell MJ, et al. Analysis of serial measurements in medical research. BMJ 1990; 300: 230–5
Vincent J. Statistics in kinesiology. Champaign (IL): Human Kinetics Books, 1994
Ross JW, Fraser MD. Analytical goals developed from the inherent error of medical tests. Clin Chem 1993; 39 (7): 1481–93
Fraser CG, Hyltoft Peterson P, et al. Setting analytical goals for random analytical error in specific clinical monitoring situations. Clin Chem 1990; 36 (9): 1625–8
Zehr ER, Sale DG. Reproducibility of ballistic movement. Med Sci Sports Exerc 1997; 29: 1383–8
Hofstra WB, Sont JK, Sterk PJ, et al. Sample size estimation in studies monitoring exercise-induced bronchoconstriction in asthmatic children. Thorax 1997; 52: 739–41
Schabort EJ, Hopkins WG, Hawley JA. Reproducibility of selfpaced treadmill performance of trained endurance runners. Int J Sports Med 1998; 19: 48–51
Hopkins W. A new view of statistics. Internet site, 1997, http://www.sportsci.org/resource/stats/index.html
Bland M. An introduction to medical statistics. Oxford: University Press, 1995
Proceedings of the 43rd Meeting of the American College of Sports Medicine. Med Sci Sports Exerc 1996; 28: S1-211
Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies. Statistician 1983; 32: 307–17
Bland JM, Altman DG. Comparing two methods of clinical measurement: a personal history. Int J Epidemiol 1995; 24 Suppl. 1: S7–14
Bland JM, Altman DG. Measurement error. BMJ 1996; 312 (7047): 1654
Bland JM, Altman DG. Measurement error proportional to the mean. BMJ 1996; 313 (7049): 106
Thomas JR, Nelson JK. Research methods in physical activity. Champaign (IL): Human Kinetics, 1990
Nevill AN, Atkinson G. Assessing measurement agreement (repeatability) between 3 or more trials [abstract]. J Sports Sci 1998; 16: 29
Coolican H. Research methods and statistics in psychology. London: Hodder and Stoughton, 1994
Sale DG. Testing strength and power. In. MacDougall JD, Wenger HA, Green HJ, editors. Physiological testing of the high performance athlete. Champaign (IL): Human Kinetics, 1991: 21–106
Bates BT, Zhang S, Dufek JS, et al. The effects of sample size and variability on the correlation coefficient. Med Sci Sports Exerc 1996; 28 (3): 386–91
Perrin DH. Isokinetic exercise and assessment. Champaign (IL): Human Kinetics, 1993
Glass GV, Hopkins KD. Statistical methods in education and psychology. 2nd ed. Englewood Cliffs (NJ): Prentice-Hall, 1984
Estelberger W, Reibnegger G. The rank correlation coefficient: an additional aid in the interpretation of laboratory data. Clin Chim Acta 1995; 239 (2): 203–7
Nevill AN, Atkinson G. Assessing agreement between measurements recorded on a ratio scale in sports medicine and sports science. Br J Sports Med 1997; 31: 314–8
Atkinson G, Greeves J, Reilly T, et al. Day-to-day and circadian variability of leg strength measured with the lido isokinetic dynamometer. J Sports Sci 1995; 13: 18–9
Bailey SM, Sarmandal P, Grant JM. A comparison of three methods of assessing inter-observer variation applied to measurement of the symphysis-fundal height. Br J Obstet Gynaecol 1989; 96 (11): 1266–71
Sarmandal P, Bailey SM, Grant JM. A comparison of three methods of assessing inter-observer variation applied to ultrasonic fetal measurement in the third trimester. Br J Obstet Gynaecol 1989; 96 (11): 1261–5
Atkinson G, Coldwells A, Reilly T, et al. Does the within-test session variation in measurements of muscle strength depend on time of day?. [abstract] J Sports Sci 1997; 15: 22
Charter RA. Effect of measurement error on tests of statistical significance. J Clin Exp Neuropsychol 1997; 19 (3): 458–62
Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994; 13: 23–4, 2465-76
Eliasziw M, Young SL, Woodbury MG, et al. Statistical methodology for the concurrent assessment of inter-rater and intra-rater reliability: using goniometric measurements as an example. Phys Ther 1994; 74 (8): 777–88
Krebs DE. Declare your ICC type [letter]. Phys Ther 1986; 66: 1431
Atkinson G. A comparison of statistical methods for assessing measurement repeatability in ergonomics research. In. Atkinson G, Reilly T, editors. Sport, leisure and ergonomics. London: E and FN Spon, 1995: 218–22
Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Comput Biol Med 1990; 20: 337–40
Myrer JW, Schulthies SS, Fellingham GW. Relative and absolute reliability of the KT-2000 arthrometer for uninjured knees. Testing at 67, 89, 134 and 178 N and manual maximum forces. Am J Sports Med 1996; 24 (1): 104–8
Quan H, Shih WJ. Assessing reproducibility by the withinsubject coefficient of variation with random effects models. Biometrics 1996; 52 (4): 1195–203
Lin LI-K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255–68
Nickerson CAE. A note on ‘A concordance correlation coefficient to evaluate reproducibility’. Biometrics 1997; 53: 1503–7
Atkinson G, Nevill A. Comment on the use of concordance correlation to assess the agreement between two variables. Biometrics 1997; 53: 775–7
Stratford PW, Goldsmith CH. Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. Phys Ther 1997; 77 (7): 745–50
Payne RW. Reliability theory and clinical psychology. J Clin Psychol 1989; 45 (2): 351–2
Strike PW. Statistical methods in laboratory medicine. Oxford: Butterworth-Heinemann, 1991
Fetz CJ, Miller GE. An asymptotic test for the equality of coefficients of variation from k populations. Stat Med 1996; 15 (6): 646–58
Allison DB. Limitations of coefficient of variation as index of measurement reliability [editorial]. Nutrition 1993; 9 (6): 559–61
Yao L, Sayre JW. Statistical concepts in the interpretation of serial bone densitometry. Invest Radiol 1994; 29 (10): 928–32
Detwiler JS, Jarisch W, Caritis SN. Statistical fluctuations in heart rate variability indices. Am J Obstet Gynecol 1980; 136 (2): 243–8
Stokes M. Reliability and repeatability of methods for measuring muscle in physiotherapy. Physiother Pract 1985; 1: 71–6
Bishop D. Reliability of a 1-h endurance performance test in trained female cyclists. Med Sci Sports Exerc 1997; 29: 554–9
Bland JM, Altman DG. Comparing methods of measurement: why plotting difference against the standard method is misleading. Lancet 1995; 346 (8982): 1085–7
British Standards Institution. Precision of test methods I. Guide for the determination and reproducibility for a standard test method. BS5497: Pt 1. London: BSI, 1979
de Jong JS, van Diest PJ, Baak JPA. In response [letter]. Lab Invest 1996; 75 (5): 756–8
Wisen AG, Wohlfart B. A comparison between two exercise tests on cycle; a computerised test versus the Astrand test. Clin Physiol 1995; 15: 91–102
Wilmore JH, Costill DL. Physiology of sport and exercise. Champaign (IL): Human Kinetics, 1994
Pollock ML. Quantification of endurance training programmes. Exerc Sports Sci Rev 1973; 1: 155–88
Doyle JR, Doyle JM. Measurement error is that which we have not yet explained. BMJ 1997; 314: 147–8
Schaefer F, Georgi M, Zieger A, et al. Usefulness of bioelectric impedance and skinfold measurements in predicting fat-free mass derived from total body potassium in children. Pediatr Res 1994; 35: 617–24
Webber J, Donaldson M, Allison SP, et al. Comparison of skinfold thickness, body mass index, bioelectrical impedance analysis and x-ray absorptiometry in assessing body composition in obese subjects. Clin Nutr 1994; 13: 177–82
Fuller NJ, Sawyer MB, Laskey MA, et al. Prediction of body composition in elderly men over 75 years of age. Ann Hum Biol 1996; 23: 127–47
Gutin B, Litaker M, Islam S, et al. Body composition measurement in 9-11 year old children by dual energy x-ray absorptiometry, skinfold thickness measures and bioimpedance analysis. Am J Clin Nutr 1996; 63: 287–92
Reilly JJ, Wilson J, McColl JH, et al. Ability of bioelectric impedance to predict fat-free mass in prepubertal children. Pediatr Res 1996; 39: 176–9
Wood TM. The changing nature of norm-referenced validity. In. Safrit MJ, Wood TM, editors, Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 23–44
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Atkinson, G., Nevill, A.M. Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine. Sports Med 26, 217–238 (1998). https://doi.org/10.2165/00007256-199826040-00002
Published:
Issue Date:
DOI: https://doi.org/10.2165/00007256-199826040-00002
Keywords
- Adis International Limited
- Reliability Study
- Sport Science
- Human Kinetic
- Relative Reliability