Many health professionals use large datasets to answer behavioral, translational, or clinical questions. Understanding the impact of missing data in large databases, such as disease registries, can avoid erroneous interpretations of these data. Using the California Cancer Registry, the authors selected seven common cancers, seven sociodemographic and clinical variables, and the top three reporting sources, as examples of the type of data that would be deemed critical to most studies. The gender variable had no missing data, followed by age (<0.1 % missing), ethnicity (1.7 %), stage (9.8 %), differentiation (39.1 %), and birthplace (41.1 %). Reports from hospitals and clinics had the lowest percentages of missing data. Users of large datasets should anticipate the limitations of missing data to prevent methodological flaws and misinterpretations of research findings. Knowledge of what and how much data may be missing in large datasets can help prevent errors in research conclusions, while better guiding treatment modalities and public health policies and programs.
Missing data Disease registries Clinical variables Data limitations Tumor variables Reporting
This is a preview of subscription content, log in to check access.
American Joint Committee on Cancer (1988) Manual for staging of cancer, Thirdth edn. J.B. Lippincott, PhiladelphiaGoogle Scholar
Furie B et al (2003) Clinical hematology and oncology. Presentation, diagnosis, and treatment. Churchill Livingstone, PhiladelphiaGoogle Scholar
Gomez SL, Glaser SL (2005) Quality of cancer registry birthplace data for Hispanics living in the United States. Cancer Causes Control 16(6):713–723PubMedCrossRefGoogle Scholar
Gomez SL et al (2004) Bias in completeness of birthplace data for Asian groups in a population-based cancer registry (United States). Cancer Causes Control 15(3):243–253PubMedCrossRefGoogle Scholar
Lin SS, O'Malley CD, Lui SW (2001) Factors associated with missing birthplace information in a population-based cancer registry. Ethn Dis 11(4):598–605PubMedGoogle Scholar
Gomez SL et al (2003) Hospital policy and practice regarding the collection of data on race, ethnicity, and birthplace. Am J Public Health 93(10):1685–1688PubMedCrossRefGoogle Scholar
Konowitz PM, Petrossian GA, Rose DN (1984) The underreporting of disease and physicians' knowledge of reporting requirements. Public Health Rep 99(1):31–35PubMedGoogle Scholar
Seixas NS, Rosenman KD (1986) Voluntary reporting system for occupational disease: pilot project, evaluation. Public Health Rep 101(3):278–282PubMedGoogle Scholar
Mettlin CJ et al (1997) A comparison of breast, colorectal, lung, and prostate cancers reported to the National Cancer Data Base and the Surveillance, Epidemiology, and End Results Program. Cancer 79(10):2052–2061PubMedCrossRefGoogle Scholar