Skip to main content
Log in

Statistics in Brief: The Importance of Sample Size in the Planning and Interpretation of Medical Research

  • Residents' Corner
  • Published:
Clinical Orthopaedics and Related Research

Abstract

The increasing volume of research by the medical community often leads to increasing numbers of contradictory findings and conclusions. Although the differences observed may represent true differences, the results also may differ because of sampling variability as all studies are performed on a limited number of specimens or patients. When planning a study reporting differences among groups of patients or describing some variable in a single group, sample size should be considered because it allows the researcher to control for the risk of reporting a false-negative finding (Type II error) or to estimate the precision his or her experiment will yield. Equally important, readers of medical journals should understand sample size because such understanding is essential to interpret the relevance of a finding with regard to their own patients. At the time of planning, the investigator must establish (1) a justifiable level of statistical significance, (2) the chances of detecting a difference of given magnitude between the groups compared, ie, the power, (3) this targeted difference (ie, effect size), and (4) the variability of the data (for quantitative data). We believe correct planning of experiments is an ethical issue of concern to the entire community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1A–C
Fig. 2

Similar content being viewed by others

References

  1. Altman DG. Practical Statistics for Medical Research. London, UK: Chapman & Hall; 1991.

    Google Scholar 

  2. Bacchetti P, Wolf LE, Segal MR, McCulloch CE. Ethics and sample size. Am J Epidemiol. 2005;161:105–110.

    Article  PubMed  Google Scholar 

  3. Bailey CS, Fisher CG, Dvorak MF. Type II error in the spine surgical literature. Spine. 2004;29:1146–1149.

    Article  PubMed  Google Scholar 

  4. Bauer P, Brannath W. The advantages and disadvantages of adaptive designs for clinical trials. Drug Discov Today. 2004;9:351–357.

    Article  PubMed  Google Scholar 

  5. Bijur PE, Latimer CT, Gallagher EJ. Validation of a verbally administered numerical rating scale of acute pain for use in the emergency department. Acad Emerg Med. 2003;10:390–392.

    Article  PubMed  Google Scholar 

  6. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Earlbaum Associates; 1988.

    Google Scholar 

  7. Ellenberg JH. Selection bias in observational and experimental studies. Stat Med. 1994;13:557–567.

    Article  PubMed  CAS  Google Scholar 

  8. Freedman KB, Back S, Bernstein J. Sample size and statistical power of randomised, controlled trials in orthopaedics. J Bone Joint Surg Br. 2001;83:397–402.

    Article  PubMed  CAS  Google Scholar 

  9. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121:200–206.

    PubMed  CAS  Google Scholar 

  10. Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;288:358–362.

    Article  PubMed  Google Scholar 

  11. Handl M, Drzik M, Cerulli G, Povysil C, Chlpik J, Varga F, Amler E, Trc T. Reconstruction of the anterior cruciate ligament: dynamic strain evaluation of the graft. Knee Surg Sports Traumatol Arthrosc. 2007;15:233–241.

    Article  PubMed  Google Scholar 

  12. Harris WH. Traumatic arthritis of the hip after dislocation and acetabular fractures: treatment by mold arthroplasty: an end-result study using a new method of result evaluation. J Bone Joint Surg Am. 1969;51:737–755.

    PubMed  CAS  Google Scholar 

  13. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician. 2001;55:19–24.

    Article  Google Scholar 

  14. Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression. Stat Med. 1998;17:1623–1634.

    Article  PubMed  CAS  Google Scholar 

  15. Jaeschke R, Singer J, Guyatt GH. Measurement of health status: ascertaining the minimal clinically important difference. Control Clin Trials. 1989;10:407–415.

    Article  PubMed  CAS  Google Scholar 

  16. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman & Hall/CRC; 2000.

    Google Scholar 

  17. Kapoor B, Clement DJ, Kirkley A, Maffulli N. Current practice in the management of anterior cruciate ligament injuries in the United Kingdom. Br J Sports Med. 2004;38:542–544.

    Article  PubMed  CAS  Google Scholar 

  18. Kjaergard LL, Villumsen J, Gluud C. Reported methodologic quality and discrepancies between large and small randomized trials in meta-analyses. Ann Intern Med. 2001;135:982–989.

    PubMed  CAS  Google Scholar 

  19. Machin D, Campbell MJ. Statistical Tables for the Design of Clinical Trials. Oxford, UK: Blackwell Scientific Publications; 1987.

    Google Scholar 

  20. Marx RG, Jones EC, Angel M, Wickiewicz TL, Warren RF. Beliefs and attitudes of members of the American Academy of Orthopaedic Surgeons regarding the treatment of anterior cruciate ligament injury. Arthroscopy. 2003;19:762–770.

    Article  PubMed  Google Scholar 

  21. Michener LA, McClure PW, Sennett BJ. American Shoulder and Elbow Surgeons Standardized Shoulder Assessment Form, patient self-report section: reliability, validity, and responsiveness. J Shoulder Elbow Surg. 2002;11:587–594.

    Article  PubMed  Google Scholar 

  22. Mirza F, Mai DD, Kirkley A, Fowler PJ, Amendola A. Management of injuries to the anterior cruciate ligament: results of a survey of orthopaedic surgeons in Canada. Clin J Sport Med. 2000;10:85–88.

    Article  PubMed  CAS  Google Scholar 

  23. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P, Klassen TP. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet. 1998;352:609–613.

    Article  PubMed  CAS  Google Scholar 

  24. Pearson J, Neyman ES. On the use, interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika. 1928;20A:175–240.

    Google Scholar 

  25. Pearson ES, Neyman J. On the use, interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika. 1928;20A:263–294.

    Google Scholar 

  26. Rubinstein LV, Korn EL, Freidlin B, Hunsberger S, Ivy SP, Smith MA. Design issues of randomized phase II trials and a proposal for phase II screening trials. J Clin Oncol. 2005;23:7199–7206.

    Article  PubMed  Google Scholar 

  27. Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA. 1995;273:408–412.

    Article  PubMed  CAS  Google Scholar 

  28. Stenning SP, Parmar MK. Designing randomised trials: both large and small trials are needed. Ann Oncol. 2002;13(suppl 4):131–138.

    PubMed  Google Scholar 

  29. Sterne JA, Davey Smith G. Sifting the evidence: what’s wrong with significance tests? BMJ. 2001;322:226–231.

    Article  PubMed  CAS  Google Scholar 

  30. Vaeth M, Skovlund E. A simple approach to power and sample size calculations in logistic regression and Cox regression models. Stat Med. 2004;23:1781–1792.

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

We thank the editor whose thorough readings of, and accurate comments on drafts of the manuscript have helped clarify the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Jean Biau MD.

Additional information

Each author certifies that he or she has no commercial associations (eg, consultancies, stock ownership, equity interest, patent/licensing arrangements, etc) that might pose a conflict of interest in connection with the submitted article.

Appendices

Appendix 1

The sample size (n) per group for comparing two means with a two-sided two-sample t test is

$$ \hbox{n} = \frac{{2 \times ({\hbox{z}}_{{1 - \upalpha /2}} + \hbox{z}_{{1 - \upbeta }} )^{2} }} {{\hbox{d}_{\rm t} ^{2} }} + 0.25 \times \hbox{z}^{2}_{{1 - \upalpha /2}} $$

where z1−α/2 and z1−β are standard normal deviates for the probability of 1 − α/2 and 1 − β, respectively, and dt = (μ0 − μ1)/σ is the targeted standardized difference between the two means.

The following values correspond to the example:

  • α = 0.05 (statistical significance level)

  • β = 0.10 (power of 90%)

  • 0 − μ1| = 10 (difference in the mean score between the two groups)

  • σ = 20 (standard deviation of the score in each group)

  • z1−α/2 = 1.96

  • z1−β = 1.28

Therefore:

$$ \hbox{n} = \frac{{2 \times (1.96 + 1.28)^{2} }} {{(10/20)^{2} }} + 0.25 \times 1.96^{2} = 85. $$

Two-sided tests which do not assume the direction of the difference (ie, that the mean value in one group would always be greater than that in the other) are generally preferred. The null hypothesis makes the assumption that there is no difference between the treatments compared, and a difference on one side or the other therefore is expected.

Appendix 2

Computation of Confidence Interval

To determine the estimation of a parameter, or alternatively the confidence interval, we use the distribution of the parameter estimate in repeated samples of the same size. For instance, consider a parameter with observed mean, m, and standard deviation, sd, in a given sample. If we assume that the distribution of the parameter in the sample is close to a normal distribution, the means, xn, of several repeated samples of the same size have true mean, μ, the population mean, and estimated standard deviation,

$$ {\text{se}} = {{\text{sd}}} \mathord{\left/ {\vphantom {{{\text{sd}}} {{\surd{\text{n}} }}}} \right. \kern-\nulldelimiterspace} {{\surd{\text{n}} }}, $$
(1)

also known as standard error of the mean, and

$$ {{\left( {{\text{x}}_{{\text{n}}} - \upmu } \right)}} \mathord{\left/ {\vphantom {{{\left( {{\text{x}}_{{\text{n}}} - \mu } \right)}} {{\text{se}}}}} \right. \kern-\nulldelimiterspace} {{\text{se}}} $$
(2)

follows a t distribution. For a large sample, the t distribution becomes close to the normal distribution; however, for a smaller sample size the difference is not negligible and the t distribution is preferred. The precision of the estimation is

$$ 2 \times {\text{t}}_{{{\left( {1 - \upalpha /2} \right)},\;{\text{n}} - 1}} \times {\text{se}}, $$
(3)

and the confidence interval for μ is the range of values extending either side of the sample mean m by

$$ {\text{t}}_{{{\left( {1 - \upalpha /2} \right)},\;{\text{n}} - 1}} \times {\text{se}}. $$
(4)

For example, Handl et al. [11] in a biomechanical study of 21 fresh-frozen cadavers reported a mean ultimate load failure of 4-strand hamstring tendon constructs of 4546 N under dynamic loading with standard deviation of 1500 N. If we were to plan an experiment, the anticipated precision of the estimation at the 95% level would be

$$ 2 \times {\left( {{2.78 \times 1500} \mathord{\left/ {\vphantom {{2.78 \times 1500} {{\surd 5 }}}} \right. \kern-\nulldelimiterspace} {{\surd 5 }}} \right)} = 3725 $$
(5)

for five specimens,

$$ 2 \times {\left( {{2.26 \times 1500} \mathord{\left/ {\vphantom {{2.26 \times 1500} {{\surd {10} }}}} \right. \kern-\nulldelimiterspace} {{\surd {10} }}} \right)} = 2146\;{\text{for}}\;10\;{\text{specimens}}, $$
(6)
$$ 2 \times {\left( {{2.06 \times 1500} \mathord{\left/ {\vphantom {{2.06 \times 1500} {{\surd {25} }}}} \right. \kern-\nulldelimiterspace} {{\surd {25} }}} \right)} = 1238\;{\text{for}}\;25\;{\text{specimens}}, $$
(7)
$$ 2 \times {\left( {{2.01 \times 1500} \mathord{\left/ {\vphantom {{2.01 \times 1500} {{\surd {50} }}}} \right. \kern-\nulldelimiterspace} {{\surd {50} }}} \right)} = 853\;{\text{for}}\;50\;{\text{specimens}}, $$
(8)
$$ {\text{and}}\;2 \times {\left( {{1.98 \times 1500} \mathord{\left/ {\vphantom {{1.98 \times 1500} {{\surd {100} }}}} \right. \kern-\nulldelimiterspace} {{\surd {100} }}} \right)} = 595\;{\text{for}}\;100\;{\text{specimens}}{\text{.}} $$
(9)

The values 2.78, 2.26, 2.06, 2.01, and 1.98 correspond to the t distribution deviates for the probability of 1 − α/2, with 4, 9, 24, 49, and 99 (n − 1) degrees of freedom; the well known corresponding standard normal deviate is 1.96. Given an estimated mean of 4546 N, the corresponding 95% confidence intervals are 2683 N to 6408 N for five specimens, 3473 N to 5619 N for 10 specimens, 3927 N to 5165 N for 25 specimens, 4120 N to 4972 N for 50 specimens, and 4248 N to 4844 N for 100 specimens (Fig. 2).

Similarly, for a proportion p in a given sample with sufficient sample size to assume a nearly normal distribution, the confidence interval extends either side of the proportion p by

$$ {{{\text{z}}_{{\left( {1-\upalpha/2} \right)}}}} \times {\text{se}}\;{\text{with}}\;{\text{se}} = {\surd {{{\text{p}}{\left( {1 - {\text{p}}} \right)}} \mathord{\left/ {\vphantom {{{\text{p}}{\left( {1 - {\text{p}}} \right)}} {\text{n}}}} \right. \kern-\nulldelimiterspace} {\text{n}}} }. $$
(10)

For a small sample size, exact confidence interval for proportions should be used.

About this article

Cite this article

Biau, D.J., Kernéis, S. & Porcher, R. Statistics in Brief: The Importance of Sample Size in the Planning and Interpretation of Medical Research. Clin Orthop Relat Res 466, 2282–2288 (2008). https://doi.org/10.1007/s11999-008-0346-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11999-008-0346-9

Keywords

Navigation