Advertisement

Journal of Cancer Education

, Volume 27, Issue 4, pp 664–669 | Cite as

Understanding the Limits of Large Datasets

  • Catherine M. Sanders
  • Sidney L. SaltzsteinEmail author
  • Matthew M. Schultzel
  • Duy H. Nguyen
  • Helen Shi Stafford
  • Georgia Robins Sadler
Article

Abstract

Many health professionals use large datasets to answer behavioral, translational, or clinical questions. Understanding the impact of missing data in large databases, such as disease registries, can avoid erroneous interpretations of these data. Using the California Cancer Registry, the authors selected seven common cancers, seven sociodemographic and clinical variables, and the top three reporting sources, as examples of the type of data that would be deemed critical to most studies. The gender variable had no missing data, followed by age (<0.1 % missing), ethnicity (1.7 %), stage (9.8 %), differentiation (39.1 %), and birthplace (41.1 %). Reports from hospitals and clinics had the lowest percentages of missing data. Users of large datasets should anticipate the limitations of missing data to prevent methodological flaws and misinterpretations of research findings. Knowledge of what and how much data may be missing in large datasets can help prevent errors in research conclusions, while better guiding treatment modalities and public health policies and programs.

Keywords

Missing data Disease registries Clinical variables Data limitations Tumor variables Reporting 

References

  1. 1.
    American Joint Committee on Cancer (1988) Manual for staging of cancer, Thirdth edn. J.B. Lippincott, PhiladelphiaGoogle Scholar
  2. 2.
    Furie B et al (2003) Clinical hematology and oncology. Presentation, diagnosis, and treatment. Churchill Livingstone, PhiladelphiaGoogle Scholar
  3. 3.
    Gomez SL, Glaser SL (2005) Quality of cancer registry birthplace data for Hispanics living in the United States. Cancer Causes Control 16(6):713–723PubMedCrossRefGoogle Scholar
  4. 4.
    Gomez SL et al (2004) Bias in completeness of birthplace data for Asian groups in a population-based cancer registry (United States). Cancer Causes Control 15(3):243–253PubMedCrossRefGoogle Scholar
  5. 5.
    Lin SS, O'Malley CD, Lui SW (2001) Factors associated with missing birthplace information in a population-based cancer registry. Ethn Dis 11(4):598–605PubMedGoogle Scholar
  6. 6.
    Gomez SL et al (2003) Hospital policy and practice regarding the collection of data on race, ethnicity, and birthplace. Am J Public Health 93(10):1685–1688PubMedCrossRefGoogle Scholar
  7. 7.
    Konowitz PM, Petrossian GA, Rose DN (1984) The underreporting of disease and physicians' knowledge of reporting requirements. Public Health Rep 99(1):31–35PubMedGoogle Scholar
  8. 8.
    Seixas NS, Rosenman KD (1986) Voluntary reporting system for occupational disease: pilot project, evaluation. Public Health Rep 101(3):278–282PubMedGoogle Scholar
  9. 9.
    Mettlin CJ et al (1997) A comparison of breast, colorectal, lung, and prostate cancers reported to the National Cancer Data Base and the Surveillance, Epidemiology, and End Results Program. Cancer 79(10):2052–2061PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Catherine M. Sanders
    • 1
    • 4
    • 6
  • Sidney L. Saltzstein
    • 1
    • 3
    Email author
  • Matthew M. Schultzel
    • 1
    • 6
    • 7
  • Duy H. Nguyen
    • 1
    • 5
  • Helen Shi Stafford
    • 1
    • 5
    • 6
  • Georgia Robins Sadler
    • 1
    • 2
  1. 1.Rebecca and John Moores UCSD Cancer CenterUniversity of California San DiegoLa Jolla, CAUSA
  2. 2.Division of General Surgery, Department of SurgeryUniversity of California San DiegoLa JollaUSA
  3. 3.Department of Pathology and Department of Family and Preventive Medicine (Epidemiology)University of California San DiegoLa JollaUSA
  4. 4.Ohio State University College of MedicineColumbusUSA
  5. 5.School of MedicineUniversity of California San DiegoLa JollaUSA
  6. 6.Department of Surgery and the Sam and Rose Stein Institute for Research on AgingUniversity of CaliforniaLa JollaUSA
  7. 7.University California San DiegoSan DiegoUSA

Personalised recommendations