Sports Medicine

, Volume 26, Issue 4, pp 217–238 | Cite as

Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine

  • Greg AtkinsonEmail author
  • Alan M. Nevill
Review Article


Minimal measurement error (reliability) during the collection of interval- and ratio-type data is critically important to sports medicine research. The main components of measurement error are systematic bias (e.g. general learning or fatigue effects on the tests) and random error due to biological or mechanical variation. Both error components should be meaningfully quantified for the sports physician to relate the described error to judgements regarding ‘analytical goals’ (the requirements of the measurement tool for effective practical use) rather than the statistical significance of any reliability indicators.

Methods based on correlation coefficients and regression provide an indication of ‘relative reliability’. Since these methods are highly influenced by the range of measured values, researchers should be cautious in: (i) concluding acceptable relative reliability even if a correlation is above 0.9; (ii) extrapolating the results of a test-retest correlation to a new sample of individuals involved in an experiment; and (iii) comparing test-retest correlations between different reliability studies.

Methods used to describe ‘absolute reliability’ include the standard error of measurements (SEM), coefficient of variation (CV) and limits of agreement (LOA). These statistics are more appropriate for comparing reliability between different measurement tools in different studies. They can be used in multiple retest studies from ANOVA procedures, help predict the magnitude of a ‘real’ change in individual athletes and be employed to estimate statistical power for a repeated-measures experiment.

These methods vary considerably in the way they are calculated and their use also assumes the presence (CV) or absence (SEM) of heteroscedasticity. Most methods of calculating SEM and CV represent approximately 68% of the error that is actually present in the repeated measurements for the ‘average’ individual in the sample. LOA represent the test-retest differences for 95% of a population. The associated Bland-Altman plot shows the measurement error schematically and helps to identify the presence of heteroscedasticity. If there is evidence of heteroscedasticity or non-normality, one should logarithmically transform the data and quote the bias and random error as ratios. This allows simple comparisons of reliability across different measurement tools.

It is recommended that sports clinicians and researchers should cite and interpret a number of statistical methods for assessing reliability. We encourage the inclusion of the LOA method, especially the exploration of heteroscedasticity that is inherent in this analysis. We also stress the importance of relating the results of any reliability statistic to ‘analytical goals’ in sports medicine.


Adis International Limited Reliability Study Sport Science Human Kinetic Relative Reliability 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yeadon MR, Challis JH. The future of performance-related sports biomechanics research. J Sports Sci 1994; 12: 3–32PubMedCrossRefGoogle Scholar
  2. 2.
    Jakeman PM, Winter EM, Doust J. A review of research in sports physiology. J Sports Sci 1994; 12: 33–60PubMedCrossRefGoogle Scholar
  3. 3.
    Hardy L, Jones G. Current issues and future directions for performance-related research in sport psychology. J Sports Sci 1994; 12: 61–92PubMedCrossRefGoogle Scholar
  4. 4.
    Nevill AM. Statistical methods in kinanthropometry and exercise physiology. In. Eston R, Reilly T, editors. Kinanthropometry and exercise physiology laboratory manual. London: E and FN Spon, 1996: 297–320Google Scholar
  5. 5.
    Safrit MJ. An overview of measurement. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 3–20Google Scholar
  6. 6.
    Zar JH. Biostatistical analysis. London: Prentice Hall, 1996Google Scholar
  7. 7.
    Mathews JN. A formula for the probability of discordant classification in method comparison studies. Stat Med 1997; 16 (6): 705–10CrossRefGoogle Scholar
  8. 8.
    Bates BT, Dufek JS, Davis HP. The effects of trial size on statistical power. Med Sci Sports Exerc 1992; 24 (9): 1059–65PubMedGoogle Scholar
  9. 9.
    Dufek JS, Bates BT, Davis HP. The effect of trial size and variability on statistical power. Med Sci Sports Exerc 1995; 27: 288–95PubMedGoogle Scholar
  10. 10.
    Atkinson G. [Letter]. British Association of Sports Sciences Newsletter, 1995 Sep: 5Google Scholar
  11. 11.
    Nevill AM. Validity and measurement agreement in sports performance [abstract]. J Sports Sci 1996; 14: 199PubMedCrossRefGoogle Scholar
  12. 12.
    Ottenbacher KJ, Stull GA. The analysis and interpretation of method comparison studies in rehabilitation research. Am J Phys Med Rehab 1993; 72: 266–71CrossRefGoogle Scholar
  13. 13.
    Hollis S. Analysis of method comparison studies. Ann Clin Biochem 1996; 33: 1–4PubMedGoogle Scholar
  14. 14.
    Liehr P, Dedo YL, Torres S, et al. Assessing agreement between clinical measurement methods. Heart Lung 1995; 24: 240–5PubMedCrossRefGoogle Scholar
  15. 15.
    Ottenbacher KJ, Tomcheck SD. Measurement variation in method comparison studies: an empirical examination. Arch Phys Med Rehabil 1994; 75 (5): 505–12PubMedGoogle Scholar
  16. 16.
    Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; I: 307–10CrossRefGoogle Scholar
  17. 17.
    Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989Google Scholar
  18. 18.
    Baumgarter TA. Norm-referenced measurement: reliability. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 45–72Google Scholar
  19. 19.
    Atkinson G. Reilly T. Circadian variation in sports performance. Sports Med 1996; 21 (4): 292–312PubMedCrossRefGoogle Scholar
  20. 20.
    Morrow JR, Jackson AW, Disch JG, et al. Measurement and evaluation in human performance. Champaign (IL): Human Kinetics, 1995Google Scholar
  21. 21.
    Morrow JR. Generalizability theory. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 73–96Google Scholar
  22. 22.
    Roebroeck ME, Harlaar J, Lankhorst GJ. The application of generalizability theory to reliability assessment: an illustration using isometric force measurements. Phys Ther 1993; 73 (6): 386–95PubMedGoogle Scholar
  23. 23.
    Chatburn RL. Evaluation of instrument error and method agreement. Am Assoc Nurse Anesthet J 1996; 64 (3): 261–8Google Scholar
  24. 24.
    Coldwells A, Atkinson G, Reilly T. Sources of variation in back and leg dynamometry. Ergonomics 1994; 37: 79–86PubMedCrossRefGoogle Scholar
  25. 25.
    Hickey MS, Costill DL, McConnell GK, et al. Day-to-day variation in time trial cycling performance. Int J Sports Med 1992; 13: 467–70PubMedCrossRefGoogle Scholar
  26. 26.
    Nevill A. Why the analysis of performance variables recorded on a ratio scale will invariably benefit from a log transformation. J Sports Sci 1997; 15: 457–8PubMedCrossRefGoogle Scholar
  27. 27.
    Bland JM, Altman DG. Transforming data. BMJ 1996; 312 (7033): 770PubMedCrossRefGoogle Scholar
  28. 28.
    Schultz RW. Analysing change. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 207–28Google Scholar
  29. 29.
    Morrow JR, Jackson AW. How ’significant’ is your reliability?. Res Q Exerc Sport 1993; 64 (3): 352–5PubMedGoogle Scholar
  30. 30.
    Altman DG. Practical statistics for medical research. London: Chapman and Hall, 1991: 396–403Google Scholar
  31. 31.
    Mathews JNS, Altman DG, Campbell MJ, et al. Analysis of serial measurements in medical research. BMJ 1990; 300: 230–5CrossRefGoogle Scholar
  32. 32.
    Vincent J. Statistics in kinesiology. Champaign (IL): Human Kinetics Books, 1994Google Scholar
  33. 33.
    Ross JW, Fraser MD. Analytical goals developed from the inherent error of medical tests. Clin Chem 1993; 39 (7): 1481–93PubMedGoogle Scholar
  34. 34.
    Fraser CG, Hyltoft Peterson P, et al. Setting analytical goals for random analytical error in specific clinical monitoring situations. Clin Chem 1990; 36 (9): 1625–8PubMedGoogle Scholar
  35. 35.
    Zehr ER, Sale DG. Reproducibility of ballistic movement. Med Sci Sports Exerc 1997; 29: 1383–8PubMedCrossRefGoogle Scholar
  36. 36.
    Hofstra WB, Sont JK, Sterk PJ, et al. Sample size estimation in studies monitoring exercise-induced bronchoconstriction in asthmatic children. Thorax 1997; 52: 739–41PubMedCrossRefGoogle Scholar
  37. 37.
    Schabort EJ, Hopkins WG, Hawley JA. Reproducibility of selfpaced treadmill performance of trained endurance runners. Int J Sports Med 1998; 19: 48–51PubMedCrossRefGoogle Scholar
  38. 38.
    Hopkins W. A new view of statistics. Internet site, 1997, Google Scholar
  39. 39.
    Bland M. An introduction to medical statistics. Oxford: University Press, 1995Google Scholar
  40. 40.
    Proceedings of the 43rd Meeting of the American College of Sports Medicine. Med Sci Sports Exerc 1996; 28: S1-211Google Scholar
  41. 41.
    Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies. Statistician 1983; 32: 307–17CrossRefGoogle Scholar
  42. 42.
    Bland JM, Altman DG. Comparing two methods of clinical measurement: a personal history. Int J Epidemiol 1995; 24 Suppl. 1: S7–14PubMedGoogle Scholar
  43. 43.
    Bland JM, Altman DG. Measurement error. BMJ 1996; 312 (7047): 1654PubMedCrossRefGoogle Scholar
  44. 44.
    Bland JM, Altman DG. Measurement error proportional to the mean. BMJ 1996; 313 (7049): 106PubMedCrossRefGoogle Scholar
  45. 45.
    Thomas JR, Nelson JK. Research methods in physical activity. Champaign (IL): Human Kinetics, 1990Google Scholar
  46. 46.
    Nevill AN, Atkinson G. Assessing measurement agreement (repeatability) between 3 or more trials [abstract]. J Sports Sci 1998; 16: 29Google Scholar
  47. 47.
    Coolican H. Research methods and statistics in psychology. London: Hodder and Stoughton, 1994Google Scholar
  48. 48.
    Sale DG. Testing strength and power. In. MacDougall JD, Wenger HA, Green HJ, editors. Physiological testing of the high performance athlete. Champaign (IL): Human Kinetics, 1991: 21–106Google Scholar
  49. 49.
    Bates BT, Zhang S, Dufek JS, et al. The effects of sample size and variability on the correlation coefficient. Med Sci Sports Exerc 1996; 28 (3): 386–91PubMedGoogle Scholar
  50. 50.
    Perrin DH. Isokinetic exercise and assessment. Champaign (IL): Human Kinetics, 1993Google Scholar
  51. 51.
    Glass GV, Hopkins KD. Statistical methods in education and psychology. 2nd ed. Englewood Cliffs (NJ): Prentice-Hall, 1984Google Scholar
  52. 52.
    Estelberger W, Reibnegger G. The rank correlation coefficient: an additional aid in the interpretation of laboratory data. Clin Chim Acta 1995; 239 (2): 203–7PubMedCrossRefGoogle Scholar
  53. 53.
    Nevill AN, Atkinson G. Assessing agreement between measurements recorded on a ratio scale in sports medicine and sports science. Br J Sports Med 1997; 31: 314–8PubMedCrossRefGoogle Scholar
  54. 54.
    Atkinson G, Greeves J, Reilly T, et al. Day-to-day and circadian variability of leg strength measured with the lido isokinetic dynamometer. J Sports Sci 1995; 13: 18–9Google Scholar
  55. 55.
    Bailey SM, Sarmandal P, Grant JM. A comparison of three methods of assessing inter-observer variation applied to measurement of the symphysis-fundal height. Br J Obstet Gynaecol 1989; 96 (11): 1266–71PubMedCrossRefGoogle Scholar
  56. 56.
    Sarmandal P, Bailey SM, Grant JM. A comparison of three methods of assessing inter-observer variation applied to ultrasonic fetal measurement in the third trimester. Br J Obstet Gynaecol 1989; 96 (11): 1261–5PubMedCrossRefGoogle Scholar
  57. 57.
    Atkinson G, Coldwells A, Reilly T, et al. Does the within-test session variation in measurements of muscle strength depend on time of day?. [abstract] J Sports Sci 1997; 15: 22CrossRefGoogle Scholar
  58. 58.
    Charter RA. Effect of measurement error on tests of statistical significance. J Clin Exp Neuropsychol 1997; 19 (3): 458–62PubMedCrossRefGoogle Scholar
  59. 59.
    Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994; 13: 23–4, 2465-76CrossRefGoogle Scholar
  60. 60.
    Eliasziw M, Young SL, Woodbury MG, et al. Statistical methodology for the concurrent assessment of inter-rater and intra-rater reliability: using goniometric measurements as an example. Phys Ther 1994; 74 (8): 777–88PubMedGoogle Scholar
  61. 61.
    Krebs DE. Declare your ICC type [letter]. Phys Ther 1986; 66: 1431PubMedGoogle Scholar
  62. 62.
    Atkinson G. A comparison of statistical methods for assessing measurement repeatability in ergonomics research. In. Atkinson G, Reilly T, editors. Sport, leisure and ergonomics. London: E and FN Spon, 1995: 218–22Google Scholar
  63. 63.
    Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Comput Biol Med 1990; 20: 337–40PubMedCrossRefGoogle Scholar
  64. 64.
    Myrer JW, Schulthies SS, Fellingham GW. Relative and absolute reliability of the KT-2000 arthrometer for uninjured knees. Testing at 67, 89, 134 and 178 N and manual maximum forces. Am J Sports Med 1996; 24 (1): 104–8PubMedCrossRefGoogle Scholar
  65. 65.
    Quan H, Shih WJ. Assessing reproducibility by the withinsubject coefficient of variation with random effects models. Biometrics 1996; 52 (4): 1195–203PubMedCrossRefGoogle Scholar
  66. 66.
    Lin LI-K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255–68PubMedCrossRefGoogle Scholar
  67. 67.
    Nickerson CAE. A note on ‘A concordance correlation coefficient to evaluate reproducibility’. Biometrics 1997; 53: 1503–7CrossRefGoogle Scholar
  68. 68.
    Atkinson G, Nevill A. Comment on the use of concordance correlation to assess the agreement between two variables. Biometrics 1997; 53: 775–7Google Scholar
  69. 69.
    Stratford PW, Goldsmith CH. Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. Phys Ther 1997; 77 (7): 745–50PubMedGoogle Scholar
  70. 70.
    Payne RW. Reliability theory and clinical psychology. J Clin Psychol 1989; 45 (2): 351–2PubMedCrossRefGoogle Scholar
  71. 71.
    Strike PW. Statistical methods in laboratory medicine. Oxford: Butterworth-Heinemann, 1991Google Scholar
  72. 72.
    Fetz CJ, Miller GE. An asymptotic test for the equality of coefficients of variation from k populations. Stat Med 1996; 15 (6): 646–58Google Scholar
  73. 73.
    Allison DB. Limitations of coefficient of variation as index of measurement reliability [editorial]. Nutrition 1993; 9 (6): 559–61PubMedGoogle Scholar
  74. 74.
    Yao L, Sayre JW. Statistical concepts in the interpretation of serial bone densitometry. Invest Radiol 1994; 29 (10): 928–32PubMedCrossRefGoogle Scholar
  75. 75.
    Detwiler JS, Jarisch W, Caritis SN. Statistical fluctuations in heart rate variability indices. Am J Obstet Gynecol 1980; 136 (2): 243–8PubMedGoogle Scholar
  76. 76.
    Stokes M. Reliability and repeatability of methods for measuring muscle in physiotherapy. Physiother Pract 1985; 1: 71–6CrossRefGoogle Scholar
  77. 77.
    Bishop D. Reliability of a 1-h endurance performance test in trained female cyclists. Med Sci Sports Exerc 1997; 29: 554–9PubMedCrossRefGoogle Scholar
  78. 78.
    Bland JM, Altman DG. Comparing methods of measurement: why plotting difference against the standard method is misleading. Lancet 1995; 346 (8982): 1085–7PubMedCrossRefGoogle Scholar
  79. 79.
    British Standards Institution. Precision of test methods I. Guide for the determination and reproducibility for a standard test method. BS5497: Pt 1. London: BSI, 1979Google Scholar
  80. 80.
    de Jong JS, van Diest PJ, Baak JPA. In response [letter]. Lab Invest 1996; 75 (5): 756–8Google Scholar
  81. 81.
    Wisen AG, Wohlfart B. A comparison between two exercise tests on cycle; a computerised test versus the Astrand test. Clin Physiol 1995; 15: 91–102PubMedCrossRefGoogle Scholar
  82. 82.
    Wilmore JH, Costill DL. Physiology of sport and exercise. Champaign (IL): Human Kinetics, 1994Google Scholar
  83. 83.
    Pollock ML. Quantification of endurance training programmes. Exerc Sports Sci Rev 1973; 1: 155–88CrossRefGoogle Scholar
  84. 84.
    Doyle JR, Doyle JM. Measurement error is that which we have not yet explained. BMJ 1997; 314: 147–8PubMedCrossRefGoogle Scholar
  85. 85.
    Schaefer F, Georgi M, Zieger A, et al. Usefulness of bioelectric impedance and skinfold measurements in predicting fat-free mass derived from total body potassium in children. Pediatr Res 1994; 35: 617–24PubMedCrossRefGoogle Scholar
  86. 86.
    Webber J, Donaldson M, Allison SP, et al. Comparison of skinfold thickness, body mass index, bioelectrical impedance analysis and x-ray absorptiometry in assessing body composition in obese subjects. Clin Nutr 1994; 13: 177–82PubMedCrossRefGoogle Scholar
  87. 87.
    Fuller NJ, Sawyer MB, Laskey MA, et al. Prediction of body composition in elderly men over 75 years of age. Ann Hum Biol 1996; 23: 127–47PubMedCrossRefGoogle Scholar
  88. 88.
    Gutin B, Litaker M, Islam S, et al. Body composition measurement in 9-11 year old children by dual energy x-ray absorptiometry, skinfold thickness measures and bioimpedance analysis. Am J Clin Nutr 1996; 63: 287–92PubMedGoogle Scholar
  89. 89.
    Reilly JJ, Wilson J, McColl JH, et al. Ability of bioelectric impedance to predict fat-free mass in prepubertal children. Pediatr Res 1996; 39: 176–9PubMedCrossRefGoogle Scholar
  90. 90.
    Wood TM. The changing nature of norm-referenced validity. In. Safrit MJ, Wood TM, editors, Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 23–44Google Scholar

Copyright information

© Adis International Limited. All rights reserved 1998

Authors and Affiliations

  1. 1.Research Institute for Sport and Exercise SciencesLiverpool John Moores UniversityLiverpoolEngland

Personalised recommendations