Minimal measurement error (reliability) during the collection of interval- and ratio-type data is critically important to sports medicine research. The main components of measurement error are systematic bias (e.g. general learning or fatigue effects on the tests) and random error due to biological or mechanical variation. Both error components should be meaningfully quantified for the sports physician to relate the described error to judgements regarding ‘analytical goals’ (the requirements of the measurement tool for effective practical use) rather than the statistical significance of any reliability indicators.
Methods based on correlation coefficients and regression provide an indication of ‘relative reliability’. Since these methods are highly influenced by the range of measured values, researchers should be cautious in: (i) concluding acceptable relative reliability even if a correlation is above 0.9; (ii) extrapolating the results of a test-retest correlation to a new sample of individuals involved in an experiment; and (iii) comparing test-retest correlations between different reliability studies.
Methods used to describe ‘absolute reliability’ include the standard error of measurements (SEM), coefficient of variation (CV) and limits of agreement (LOA). These statistics are more appropriate for comparing reliability between different measurement tools in different studies. They can be used in multiple retest studies from ANOVA procedures, help predict the magnitude of a ‘real’ change in individual athletes and be employed to estimate statistical power for a repeated-measures experiment.
These methods vary considerably in the way they are calculated and their use also assumes the presence (CV) or absence (SEM) of heteroscedasticity. Most methods of calculating SEM and CV represent approximately 68% of the error that is actually present in the repeated measurements for the ‘average’ individual in the sample. LOA represent the test-retest differences for 95% of a population. The associated Bland-Altman plot shows the measurement error schematically and helps to identify the presence of heteroscedasticity. If there is evidence of heteroscedasticity or non-normality, one should logarithmically transform the data and quote the bias and random error as ratios. This allows simple comparisons of reliability across different measurement tools.
It is recommended that sports clinicians and researchers should cite and interpret a number of statistical methods for assessing reliability. We encourage the inclusion of the LOA method, especially the exploration of heteroscedasticity that is inherent in this analysis. We also stress the importance of relating the results of any reliability statistic to ‘analytical goals’ in sports medicine.
Hardy L, Jones G. Current issues and future directions for performance-related research in sport psychology. J Sports Sci 1994; 12: 61–92PubMedCrossRefGoogle Scholar
Nevill AM. Statistical methods in kinanthropometry and exercise physiology. In. Eston R, Reilly T, editors. Kinanthropometry and exercise physiology laboratory manual. London: E and FN Spon, 1996: 297–320Google Scholar
Safrit MJ. An overview of measurement. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 3–20Google Scholar
Morrow JR, Jackson AW, Disch JG, et al. Measurement and evaluation in human performance. Champaign (IL): Human Kinetics, 1995Google Scholar
Morrow JR. Generalizability theory. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 73–96Google Scholar
Roebroeck ME, Harlaar J, Lankhorst GJ. The application of generalizability theory to reliability assessment: an illustration using isometric force measurements. Phys Ther 1993; 73 (6): 386–95PubMedGoogle Scholar
Chatburn RL. Evaluation of instrument error and method agreement. Am Assoc Nurse Anesthet J 1996; 64 (3): 261–8Google Scholar
Thomas JR, Nelson JK. Research methods in physical activity. Champaign (IL): Human Kinetics, 1990Google Scholar
Nevill AN, Atkinson G. Assessing measurement agreement (repeatability) between 3 or more trials [abstract]. J Sports Sci 1998; 16: 29Google Scholar
Coolican H. Research methods and statistics in psychology. London: Hodder and Stoughton, 1994Google Scholar
Sale DG. Testing strength and power. In. MacDougall JD, Wenger HA, Green HJ, editors. Physiological testing of the high performance athlete. Champaign (IL): Human Kinetics, 1991: 21–106Google Scholar
Bates BT, Zhang S, Dufek JS, et al. The effects of sample size and variability on the correlation coefficient. Med Sci Sports Exerc 1996; 28 (3): 386–91PubMedGoogle Scholar
Perrin DH. Isokinetic exercise and assessment. Champaign (IL): Human Kinetics, 1993Google Scholar
Glass GV, Hopkins KD. Statistical methods in education and psychology. 2nd ed. Englewood Cliffs (NJ): Prentice-Hall, 1984Google Scholar
Estelberger W, Reibnegger G. The rank correlation coefficient: an additional aid in the interpretation of laboratory data. Clin Chim Acta 1995; 239 (2): 203–7PubMedCrossRefGoogle Scholar
Nevill AN, Atkinson G. Assessing agreement between measurements recorded on a ratio scale in sports medicine and sports science. Br J Sports Med 1997; 31: 314–8PubMedCrossRefGoogle Scholar
Atkinson G, Greeves J, Reilly T, et al. Day-to-day and circadian variability of leg strength measured with the lido isokinetic dynamometer. J Sports Sci 1995; 13: 18–9Google Scholar
Bailey SM, Sarmandal P, Grant JM. A comparison of three methods of assessing inter-observer variation applied to measurement of the symphysis-fundal height. Br J Obstet Gynaecol 1989; 96 (11): 1266–71PubMedCrossRefGoogle Scholar
Sarmandal P, Bailey SM, Grant JM. A comparison of three methods of assessing inter-observer variation applied to ultrasonic fetal measurement in the third trimester. Br J Obstet Gynaecol 1989; 96 (11): 1261–5PubMedCrossRefGoogle Scholar
Atkinson G, Coldwells A, Reilly T, et al. Does the within-test session variation in measurements of muscle strength depend on time of day?. [abstract] J Sports Sci 1997; 15: 22CrossRefGoogle Scholar
Charter RA. Effect of measurement error on tests of statistical significance. J Clin Exp Neuropsychol 1997; 19 (3): 458–62PubMedCrossRefGoogle Scholar
Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994; 13: 23–4, 2465-76CrossRefGoogle Scholar
Eliasziw M, Young SL, Woodbury MG, et al. Statistical methodology for the concurrent assessment of inter-rater and intra-rater reliability: using goniometric measurements as an example. Phys Ther 1994; 74 (8): 777–88PubMedGoogle Scholar
Atkinson G. A comparison of statistical methods for assessing measurement repeatability in ergonomics research. In. Atkinson G, Reilly T, editors. Sport, leisure and ergonomics. London: E and FN Spon, 1995: 218–22Google Scholar
Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Comput Biol Med 1990; 20: 337–40PubMedCrossRefGoogle Scholar
Myrer JW, Schulthies SS, Fellingham GW. Relative and absolute reliability of the KT-2000 arthrometer for uninjured knees. Testing at 67, 89, 134 and 178 N and manual maximum forces. Am J Sports Med 1996; 24 (1): 104–8PubMedCrossRefGoogle Scholar
Quan H, Shih WJ. Assessing reproducibility by the withinsubject coefficient of variation with random effects models. Biometrics 1996; 52 (4): 1195–203PubMedCrossRefGoogle Scholar
Schaefer F, Georgi M, Zieger A, et al. Usefulness of bioelectric impedance and skinfold measurements in predicting fat-free mass derived from total body potassium in children. Pediatr Res 1994; 35: 617–24PubMedCrossRefGoogle Scholar
Webber J, Donaldson M, Allison SP, et al. Comparison of skinfold thickness, body mass index, bioelectrical impedance analysis and x-ray absorptiometry in assessing body composition in obese subjects. Clin Nutr 1994; 13: 177–82PubMedCrossRefGoogle Scholar
Fuller NJ, Sawyer MB, Laskey MA, et al. Prediction of body composition in elderly men over 75 years of age. Ann Hum Biol 1996; 23: 127–47PubMedCrossRefGoogle Scholar
Gutin B, Litaker M, Islam S, et al. Body composition measurement in 9-11 year old children by dual energy x-ray absorptiometry, skinfold thickness measures and bioimpedance analysis. Am J Clin Nutr 1996; 63: 287–92PubMedGoogle Scholar
Reilly JJ, Wilson J, McColl JH, et al. Ability of bioelectric impedance to predict fat-free mass in prepubertal children. Pediatr Res 1996; 39: 176–9PubMedCrossRefGoogle Scholar
Wood TM. The changing nature of norm-referenced validity. In. Safrit MJ, Wood TM, editors, Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 23–44Google Scholar