Journal of Neurology

, Volume 259, Issue 12, pp 2681–2694 | Cite as

What sample sizes for reliability and validity studies in neurology?

  • Jeremy C. Hobart
  • Stefan J. Cano
  • Thomas T. Warner
  • Alan J. Thompson
Original Communication


Rating scales are increasingly used in neurologic research and trials. A key question relating to their use across the range of neurologic diseases, both common and rare, is what sample sizes provide meaningful estimates of reliability and validity. Here, we address two questions: (1) to what extent does sample size influence the stability of reliability and validity estimates; and (2) to what extent does sample size influence the inferences made from reliability and validity testing? We examined data from two studies. In Study 1, we retrospectively reduced the total sample randomly and nonrandomly by decrements of approximately 50 % to generate sub-samples from n = 713–20. In Study 2, we prospectively generated sub-samples from n = 20–320, by entry time into study. In all samples we estimated reliability (internal consistency, item total correlations, test–retest) and validity (within scale correlations, convergent and discriminant construct validity). Reliability estimates were stable in magnitude and interpretation in all sub-samples of both studies. Validity estimates were stable in samples of n ≥ 80, for 75 % of scales in samples of n = 40, and for 50 % of scales in samples of n = 20. In this study, sample sizes of a minimum of 20 for reliability and 80 for validity provided estimates highly representative of the main study samples. These findings should be considered provisional and more work is needed to determine if these estimates are generalisable, consistent, and useful.


Multiple sclerosis Cervical dystonia Reliability Validity Sample size 


Conflicts of interest

The authors declare that they have no conflict of interest related to this research.


  1. 1.
    Zajicek J, Fox P, Sanders H et al (2009) Cannabinoids for treatment of spasticity and other symptoms related to multiple sclerosis (CAMS study): multi-centre randomised placebo-controlled trial. Lancet 362:1517–1526CrossRefGoogle Scholar
  2. 2.
    Lees K, Zivin J, Ashwood T et al (2006) NXY-059 for acute ischemic stroke. N Engl J Med 354:588–600PubMedCrossRefGoogle Scholar
  3. 3.
    Hobart J, Cano S, Zajicek J, Thompson A (2007) Rating scales as outcome measures for clinical trials in neurology: problems, solutions, and recommendations. Lancet Neurol 6:1094–1105PubMedCrossRefGoogle Scholar
  4. 4.
    Darzi A (2008) High quality care for all: NHS Next Stage Review final report. Department of Health, LondonGoogle Scholar
  5. 5.
    UK Department of Health (2010) Equity and excellence: liberating the NHS. Her Majesty’s Stationery Office, LondonGoogle Scholar
  6. 6.
    Food and Drug Administration (2009). Patient reported outcome measures: use in medical product development to support labelling claims [online]. Available at:
  7. 7.
  8. 8.
    McDowell I, Jenkinson C (1996) Development standards for health measures. J Health Serv Res Policy 1:238–246PubMedGoogle Scholar
  9. 9.
    Fitzpatrick R, Davey C, Buxton MJ, Jones DR (1998). Evaluating patient-based outcome measures for use in clinical trials. Health Technol Assess 2:1–86Google Scholar
  10. 10.
    Scientific Advisory Committee of the Medical Outcomes Trust (2002) Assessing health status and quality of life instruments: attributes and review criteria. Qual Life Res 11:193–205CrossRefGoogle Scholar
  11. 11.
    Feldt L, Woodruff D, Sailh F (1987) Statistical inference for coefficient alpha. Appl Psychol Measure 11:93–103CrossRefGoogle Scholar
  12. 12.
    Donner A, Eliasziw M (1987) Sample size requirements for reliability studies. Stat Med 6:441–448PubMedCrossRefGoogle Scholar
  13. 13.
    DeVellis RF (1991) Scale development: theory and applications. Sage publications, LondonGoogle Scholar
  14. 14.
    Rea L, Parker R (1992) Designing and conducting survey research: a comprehensive guide. Jossey-Bass, San FransiscoGoogle Scholar
  15. 15.
    Ferguson E, Cox T (1993) Exploratory factor analysis: a user’s guide. Int J Select Assess 1:84–94CrossRefGoogle Scholar
  16. 16.
    Nunnally JC, Bernstein IH (1994) Psychometric theory, 3rd edn. McGraw-Hill, New YorkGoogle Scholar
  17. 17.
    Eliasziw M, Young S, Woodbury M, Fryday-Field K (1994) Statistical methodology for the concurrent assessment of interrater and interrater reliability: using goniometric measurements as an example. Phys Therapy 74:777–788Google Scholar
  18. 18.
    Streiner DL, Norman GR (1995) Health measurement scales: a practical guide to their development and use, 2nd edn. Oxford University Press, OxfordGoogle Scholar
  19. 19.
    Cantor AB (1996) Sample-size calculations for Cohen’s Kappa. Psych Methods 1:150–153CrossRefGoogle Scholar
  20. 20.
    Ware JE, Harris WJ, Gandek B, Rogers BW, Reese PR (1997) MAP-R for windows: multitrait/multi-item analysis program—revised user’s guide. Health Assessment Lab, BostonGoogle Scholar
  21. 21.
    Feldt L, Ankenmann R (1998) Appropriate sample size for a test of equality of alpha coefficients. Appl Psychol Measure 22:170–178CrossRefGoogle Scholar
  22. 22.
    Feldt L, Ankenmann R (1999) Determining sample size for a test of equality of alpha coefficients when the number of part-tests is small. Psychol Methods 4:366–377CrossRefGoogle Scholar
  23. 23.
    Cocchetti D (1999) Sample size requirements for increasing the precision of reliability estimates: problems and proposed solutions. J Clin Exper Neuropsychol 21:567–570CrossRefGoogle Scholar
  24. 24.
    Charter R (1999) Sample size requirements for precise estimates of reliability, generalizability, and validity coefficients. J Clin Exper Neuropsychol 21:559–566CrossRefGoogle Scholar
  25. 25.
    MacCallum R, Widaman K, Zhang S, Hong S (1999) Sample size in factor analysis. Psychol Methods 4:84–99CrossRefGoogle Scholar
  26. 26.
    Mendoza J, Stafford K, Stauffer J (2000) Large-sample confidence intervals for validity and reliability coefficients. Psychol Methods 5:356–369PubMedCrossRefGoogle Scholar
  27. 27.
    Perkins D, Wyatt R, Bartko J (2000) Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trials. Biol Psychiatr 47:762–766CrossRefGoogle Scholar
  28. 28.
    Bonett D (2002) Sample size requirements for testing and estimating coefficient alpha. J Educ Behav Stat 27:335–340CrossRefGoogle Scholar
  29. 29.
    Maydeu-Olivares A, Coffman D, Hartman W (2007) Asymptotically distribution-free (ADF) interval estimation of coefficient alpha. Psychol Methods 12:157–176PubMedCrossRefGoogle Scholar
  30. 30.
    Bonett D (2002) Sample size requirements for estimating intraclass correlations with desired precision. Stat Med 21:1331–1335PubMedCrossRefGoogle Scholar
  31. 31.
    Barrett P, Kline P (1981) The observation to variable ratio in factor analysis. Personality Study Group Behav 1:23–33Google Scholar
  32. 32.
    Hobart JC, Lamping DL, Fitzpatrick R, Riazi A, Thompson AJ (2001) The Multiple Sclerosis Impact Scale (MSIS-29): a new patient-based outcome measure. Brain 124:962–973PubMedCrossRefGoogle Scholar
  33. 33.
    Cano SJ, Warner TT, Linacre JM et al (2004) Capturing the true burden of dystonia on patients: the cervical dystonia impact profile (CDIP-58). Neurology 63:1629–1633PubMedCrossRefGoogle Scholar
  34. 34.
    Ware JE, Sherbourne DC (1992) The MOS 36-Item Short-Form Health Survey (SF-36): I. Conceptual framework and item selection. Med Care 30:473–483PubMedCrossRefGoogle Scholar
  35. 35.
    Cella DF, Dineen K, Arnason B et al. (1996) Validation of the functional assessment of multiple sclerosis quality of life instrument. Neurology 47:129–139PubMedCrossRefGoogle Scholar
  36. 36.
    EuroQoL Group (1990) EuroQoL: a new facility for the measurement of health-related quality of life. Health Policy 16:199–208CrossRefGoogle Scholar
  37. 37.
    Goldberg DP, Hillier VF (1979) A scaled version of the General Health Questionnaire. Psychol Medicine 9:139–145CrossRefGoogle Scholar
  38. 38.
    Gompertz P, Pound P, Ebrahim S (1994) A postal version of the Barthel Index. Clin Rehabil 8:233–239CrossRefGoogle Scholar
  39. 39.
    Zigmond AS, Snaith RP (1983) The hospital anxiety and depression scale. Acta Psychiatr Scand 67:361–370PubMedCrossRefGoogle Scholar
  40. 40.
    Cronbach LJ (1951) Coefficient alpha and the internal structure of tests. Psychometrika 16:297–334Google Scholar
  41. 41.
    Nunnally JC (1978) Psychometric theory, 2nd edn. McGraw-Hill, New YorkGoogle Scholar
  42. 42.
    Eisen M, Ware JE, Donald CA, Brook RH (1979) Measuring components of children’s health status. Med Care 17:902–921PubMedCrossRefGoogle Scholar
  43. 43.
    Gulliksen H (1950) Theory of mental tests. Wiley, New YorkCrossRefGoogle Scholar
  44. 44.
    Green S, Lissitz R, Mulaik S (1977) Limitations of coefficient alpha as an index of test unidimensionality. Educ Psychol Measure 37:827–838CrossRefGoogle Scholar
  45. 45.
    McGraw KO, Wong SP (1996) Forming inferences about some intraclass correlation coefficients. Psychol Methods 1:30–46CrossRefGoogle Scholar
  46. 46.
    Lohr KN, Aaronson NK, Alonso J et al (1996) Evaluating quality of life and health status instruments: development of scientific review criteria. Clin Therapeutics 18:979–992CrossRefGoogle Scholar
  47. 47.
    McHorney CA, Ware JEJ, Raczek AE (1993) 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 31:247–263PubMedCrossRefGoogle Scholar
  48. 48.
    Spearman CE (1904) The proof and measurement of association between two things. American J Psychol 15:72–101CrossRefGoogle Scholar
  49. 49.
    Cano S, Warner T, Thompson A, Bhatia K, Fitzpatrick R, Hobart J (2008) The cervical dystonia impact profile (CDIP-58): can a Rasch developed patient reported outcome measure satisfy traditional psychometric criteria? Health Qual Life Outcomes 6:58PubMedCrossRefGoogle Scholar
  50. 50.
    Cohen J, Cohen P, West S, Aiken L (2003) Applied multiple regression/correlation analysis for the behavioral sciences. Erlbaum, HillsdaleGoogle Scholar
  51. 51.
    Freeman JA, Hobart JC, Langdon DW, Thompson AJ (2000) Clinical appropriateness: a key factor in outcome measure selection. The 36-item Short Form Health Survey in multiple sclerosis. J Neurol Neurosurg Psychiatry 68:150–156PubMedCrossRefGoogle Scholar
  52. 52.
    Riazi A, Hobart J, Lamping D, Fitzpatrick R, Thompson A (2002) Multiple Sclerosis Impact Scale (MSIS-29): reliability and validity in hospital based samples. J Neurol Neurosurg Psychiatry 73:701–704PubMedCrossRefGoogle Scholar
  53. 53.
    Cano S, Hobart J, Edwards M et al (2006) CDIP-58 can measure the impact of botulinum toxin treatment in cervical dystonia. Neurology 67:2230–2232PubMedCrossRefGoogle Scholar
  54. 54.
    Bentler P, Chou C (1987) Practical issues in structural modelling. Sociol Methods Res 16:78–117CrossRefGoogle Scholar
  55. 55.
    Hancock G, Freeman M (2001) Power and sample size for the root mean square error of approximation test of not close fit in structural equation modelling. Educ Psychol Meas 61:741–758CrossRefGoogle Scholar
  56. 56.
    Muthén L, Muthén B (2002) How to use Monte Carlo study to decide on sample size and determine power. Struct Equ Model 9:599–620CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Jeremy C. Hobart
    • 1
    • 4
  • Stefan J. Cano
    • 1
  • Thomas T. Warner
    • 2
  • Alan J. Thompson
    • 3
  1. 1.Clinical Neurology Research Group, Peninsula College of Medicine and Dentistry, Tamar Science ParkPlymouthUK
  2. 2.Department of Clinical NeurosciencesUCL Institute of NeurologyLondonUK
  3. 3.Department of Brain Repair and RehabilitationUCL Institute of NeurologyLondonUK
  4. 4.Department of Clinical NeurosciencePeninsula College of Medicine and Dentistry, Tamar Science ParkPlymouthUK

Personalised recommendations