Despite advances in knowledge concerning risk factor reduction and improvements in early detection and treatment for several cancers, socioeconomic inequalities persist in cancer incidence, morbidity, mortality, and survival [13]. In some instances, such inequalities may even be widening [4]. The disparities in cancer burden among racial and ethnic minorities and other disadvantaged groups prompted congressional legislation (Public Law 104-208 in 1997) mandating a review of the research programs at the National Institutes of Health (NIH) by the Institute of Medicine (IOM). The IOM report [5] was published in 1999 and was followed by Congressional legislation in 2000 (Public Law 106-525) requesting the establishment of the NIH National Center for Minority Health and Health Disparities and a strategic plan in health disparities research. In its 2006 review [6] of the Strategic Plan, the IOM study committee recommended NIH research priority areas “should include, first, the development and refinement of valid measures of exposure relevant to understanding and evaluating health disparities.” As an example, it specifically called for, “the inclusion of information on racial and ethnic subpopulations and other relevant characteristics, such as immigrant status, language preference, and detailed socioeconomic data” in population-based studies.

Population-based cancer registry data from the Surveillance, Epidemiology, and End Results (SEER) Program at the National Cancer Institute (NCI) are generally the authoritative source of data for describing disparities in cancer burden among racial/ethnic groups. However, these data are mainly based on medical records and administrative information, and thus lack individual-level data on socioeconomic status (SES). Socio-demographic information on individual cancer patients in the NCI’s SEER database is limited to age, sex, race/ethnicity [7], marital status, and place of birth and residence. Key measures of individual socioeconomic status (SES), such as educational attainment, occupation, income, and employment status are not available. Data on current health status, co-morbidity, health care access, and health-risk behaviors, such as cigarette smoking, are also lacking. Consequently, socioeconomic analyses of surveillance data on cancer incidence, disease stage, treatment, and patient survival in the U.S. have generally relied on more readily available aggregate ecological data [8, 9]. To overcome the absence of individual-level SES data in cancer registries, and to provide a unique research resource that can be used for describing disparities in cancer burden, in 1999, the NCI initiated the SEER-NLMS project, linking population-based SEER cancer registry data to that from the U.S. representative National Longitudinal Mortality Study (NLMS). The NLMS provides self-reported, detailed demographic and socioeconomic data from the Social and Economic Supplement to the Census Bureau’s Current Population Survey (CPS). The objective of this record linkage project was to supplement the socioeconomic information on SEER cancer patients and to assess differentials in cancer incidence, tumor characteristics, and patient survival, based on self-reported race/ethnicity, marital status, educational attainment, income, occupation, industry, employment status, nativity/immigrant status, smoking status, health status, and availability of health insurance [10, 11].

This paper presents some initial findings that pertain to the identification of health disparities from this unique database, including cancer disparities according to individual-level socioeconomic status and demographic characteristics for all cancers combined and for cancers of the lung, breast, prostate, cervix, and melanoma. In addition, the linked database itself is described including an overview of its structure, the record linkage methodology used to create it, data confidentiality issues, the representativeness of the cancer data, and its analytic potential for research.

Materials and methods

The Surveillance, Epidemiology, and End Results Program

Begun in 1973, the NCI SEER Program is a population-based cancer registration program, which identifies all primary cancers occurring in residents of defined geographic regions. Cancer registries of the SEER Program currently cover approximately 26% of the U.S. population. SEER collects detailed data on patient demographics, tumor characteristics, and initial therapy, and maintains follow-up of all registered patients for vital status in order to provide statistics on cancer patient survival [12]. The primary sources of SEER data are hospital medical records, pathology and radiotherapy reports, outpatient surgical center records, death certificates, and other routinely collected administrative and health records available to each registry. Quality control has been an integral part of the SEER Program since its inception [13]. Annual studies are conducted in SEER registries to evaluate the quality and completeness of the data being reported.

The Current Population Survey and National Longitudinal Mortality Study

The CPS is a monthly survey of about 50,000 households conducted by the U.S. Bureau of the Census for the Bureau of Labor Statistics. It is the primary source of information on the labor force and demographic characteristics of the U.S. population between decennial censuses. CPS samples are selected to represent the U.S. civilian non-institutional population. Respondents are interviewed either by telephone or in-person to obtain information about the employment status of each member of the household who is 15 years of age or older [14]. In March, the Annual Social and Economic Supplement (named the Annual Demographic Survey Supplement before 2003) of CPS collects in-depth information on income and a variety of demographic characteristics. Response is higher in CPS than in many other surveys. For example, the non-response rate for the March 2002 basic CPS was 8.3% and the non-response rate for the March supplement was an additional 8.6%, which amounted to a total 2002 supplement response rate of 83.8% [15].

The NLMS is an on-going mortality follow-up study of selected cohorts of CPS respondents and the 1980 E sample (a post-enumeration sample used to measure the undercount of the 1980 Decennial Census). Currently, it contains 26 cohorts: one from the 1980 E sample and 25 from CPS, totaling approximately 2.4 million people. The 25 CPS cohorts in the NLMS were sampled between 1973 and 1998, and their surveys were conducted in March 1973, February 1978, March 1979, April 1980, August 1980, December 1980, September 1985, and for each March in the period 1981–1998. The NLMS study combined the self-reported data with death certificate information to identify mortality status and cause of death for its 26 cohorts, for the purpose of studying the effects of demographic and socioeconomic characteristics on U.S. mortality rates [16].

The SEER-NLMS study

The SEER-NLMS study consists of identifying and matching SEER cancer patient records to NLMS records. Records for cancer patients diagnosed between 1973 and 2001 and reported to 11 SEER registries were matched to the 26 NLMS cohorts. The 11 participating SEER registries included the states of Connecticut (1973–2001 data), Hawaii (1973–2001), Iowa (1973–2001), Kentucky (1995–2001), Louisiana (1988–2001), and Utah (1973–2001); the metropolitan areas of Detroit (1973–2000), Los Angeles (1988–2001), Northern California (1973–2001 data that include the top 20 primary cancer sites for Greater Bay Area including San Francisco, Oakland, San Jose, and Monterey regions), and Seattle (1974–2001); and Greater California (the state of California excluding Los Angeles and Northern California; 1988–2001 data). Each participating SEER registry obtained approval from the appropriate institutional review board prior to the linkage.

The algorithm used to match SEER records to the CPS self-reports in the NLMS was derived directly from the two-step process to identify mortality in the NLMS [17] using personal identifiers: social security number (SSN), name (first and last), and date of birth (month and year). The first step consisted of the application of a computer-scoring algorithm to identify clearly true and clearly false matches by comparing a SEER patient’s record with an NLMS record. A pair agreeing on SSN was identified as a deterministic match and considered as a true match if name and birth date also agreed. Pairs that did not agree on SSN were identified as a probabilistic match if the pair agreed on name and birth date. Probabilistic matches were scored for agreement on name, year of birth, as well as variations of demographic variables such as sex, race, and place of residence. If the agreement score exceeded an upper cut-off value, the match was considered to be true. If the agreement score was below the lower cut-off value the pair was not a match. Upper and lower cut-off values of the computer algorithm were derived empirically using two databases for which manual decisions were made in advance for all pairs. The questionable matched-pairs consisted of those deterministic matches that disagreed in either sex or birth date or those probabilistic matches with a score in the middle range. In the second step, all questionable matched-pairs were judged in a manual review by a panel of three judges operating independently to decide the final outcome of true match or false match where all information on the SEER and the NLMS records was compared for agreement. An independent verification of the validity of the NLMS matching algorithm has been conducted [18] on an American Cancer Society database.

The SEER-NLMS record matching was conducted by the Census Bureau on its premises. The matched SEER-NLMS data are kept on the premises of Census Bureau and are protected by the statutory confidentiality authority of the Census Bureau, Sect. 9 of Title 13 [19]. In all, 2.4 million NLMS records from the 25 CPS and the Census E sample were compared with 4,172,139 cancer patient records in 11 SEER registries, generating 26,844 patient matches. Of these matched patients, 2,663 patients were diagnosed with more than one primary cancer, resulting in a total of 29,883 primary cancers diagnosed during the period 1973–2001.

Of the 26,844 matched patients, we excluded 146 patients whose CPS survey data were incomplete and would not have been eligible for inclusion in the NLMS study. A small number of cancer patients were identified in records from more than one SEER registry (n = 106) and were excluded from the study. Because the 1980 Census E sample lacked socioeconomic information and its cohort was excluded from this study, we also excluded 1,337 patients whose SEER medical records were matched to this sample. We excluded 345 matched patients who were under 25 years of age at the time of their survey under the rationale that their reported family income was more likely reflective of their parents’ rather than their own. Thus, we limited our study to the individuals who were 25 years of age or older at the time of their survey. In addition, we excluded 3,369 patients whose cancer was diagnosed before their survey and 1,392 patients who had been diagnosed with only non-invasive cancers. Hence, 20,149 matched patients were eligible for inclusion in this study.

For the cancer incidence part of the analysis (Tables 2, 3, 4, 5), an additional 8,685 matched patients were excluded. This included 3,334 patients whose SEER records were matched to the March 1973 and February 1978 CPS cohorts (because they lack follow-up information for vital status), 2,356 matched patients who were residents of one SEER registry territory at time of their CPS survey but diagnosed in another SEER area, and 2,995 patients whose cancers were diagnosed after 1998 because the NLMS mortality follow-up for the cohorts ended by 12/31/1998. Hence, 11,464 matched patients were included for the incidence analyses. Analyses on late-stage diagnoses (Table 6) are based on 15,357 patients, after excluding the 4,792 cancer patients lacking information on tumor stage from the 20,149 eligible patients.

Demographic, socioeconomic, and other variables

All demographic and socioeconomic variables used in this analysis are from survey self-reports, except age at diagnosis, stage at diagnosis, and sex (for matched cancer cases), which are from SEER data. Therefore, for the incidence analyses, the sex variable came from NLMS for those survey participants who did not have a cancer diagnosed as of December 31, 1998, i.e., their survey record did not link to SEER database prior to this date. For late-stage diagnosis analyses, the sex variable is from SEER data.

Race and ethnic variables were categorized as non-Hispanic white, non-Hispanic black, American Indian or Alaska Native (AI/AN), Asian or Pacific Islander (API), Hispanic with its two subcategories of Mexican Hispanic and Other Hispanic, and Other or Unknown. The “Other or Unknown” category grouped all racial and ethnic categories other than the categories specified above, including those patients with missing race or ethnicity data. Marital status was classified as married, widowed, divorced/separated, never married, and unknown status. Place of residence at the time of the survey was classified into urban, rural, and unknown based on the definitions from the 1970 census (CPS cohorts 1973–1985), the 1980 census (CPS cohort 1986–1993), or the 1990 census (CPS cohorts 1994–1998) [20, 21].

Educational attainment was grouped into four categories by years of education: less than high school (<12 years), high school graduate (12 years), some post high school education (13–15 years), college education or beyond (16 years or more), and unknown. Family income refers to the total combined income of all family members during the 12 months preceding the survey and it was adjusted to 1990 dollars for inflation for individuals from different NLMS cohorts. The 1989 [22] median family income in the US was $35,255 with the poverty threshold of $12,674 for a four-person family. Thus, we categorized family income as <$12,500, $12,500–$24,999, $25,000–$34,999, $35,000–$49,999, $50,000 or more, and unknown. The poverty status for all individuals in the database was measured as of the 1990 census in terms of the ratio of the family income to the poverty threshold for a four-person family and grouped into ≤100%, 100 to <200%, 200 to <400%, 400 to <600%, and 600% or above.

Employment status was determined on the basis of employment activity during the week prior to the survey and was classified into five categories for the present analysis: employed, unemployed (seeking work during the past 4 weeks), retired, unable to work (long-term physical or mental disability), and outside the labor force (consisting of homemakers and those in school) [10]. Employment sector was defined for those employed and included the following groupings: government (federal, state, local), private, and self-employed.

Late stage is defined as the distant stage of cancer presentation at the time of diagnosis by the SEER Historical Staging scheme. Distant-stage cancer indicates that cancers have spread from the organ/site of origin to distant sites.

Statistical analysis

Incidence analyses were conducted for all cancers combined and for six major cancers separately: lung and bronchus, colon/rectum, breast, prostate, uterine cervix, and melanoma of the skin. Age-specific cancer incidence rates were calculated by dividing the number of cancer patients in each 5-year age group by the follow-up time (in person-years) accumulated for that age group of survey participants. These age-specific rates were then age-adjusted by the direct method using the age composition of the 2000 U.S. standard population (Census p25-1130). Follow-up time for each individual started from the CPS survey date up until the date of the underlying cancer diagnosis, loss to follow-up (available only for matched patients), death, or end of study (12/1998), whichever occurred first. It was accumulated into different age groups as the individual aged. In computing the incidence rates for all cancers combined, only the first primary cancer diagnosed in a patient was counted, regardless of the cancer site, and follow-up time was allowed to accumulate only until the date of diagnosis of that first cancer. When computing the incidence rate for a specific cancer, such as female breast cancer, only the first primary breast cancer occurring in a patient was considered and the follow-up time contribution for that individual stopped at the date of diagnosis of that first breast cancer although the patient might have been diagnosed with another cancer prior to her breast cancer diagnosis.

Adjusted incidence rate ratios (i.e., hazard ratio) and their 95% confidence intervals were derived using Cox regression models that stratified baseline risks of cancer diagnosis by NLMS cohort and by their age at the survey. The six age strata used were: 25–34, 35–44, 45–54, 55–64, 65–74, and 75 years or older. Follow-up times were recoded in months.

To analyze disparities in the likelihood of late- or distant-stage diagnoses for colorectal, prostate, and breast cancer, logistic regression models adjusting for age at diagnosis (25–54, 55–64, 65–74, and 75+ years), period of diagnosis (1973–1989, 1990–1994, and 1995–2001), and SEER registry were used. Results of the late-stage diagnosis analyses are presented as adjusted odds ratios with their corresponding 95% confidence intervals. All analyses were performed using SAS statistical software (SAS Institute, Inc., Cary, North Carolina). All statistical tests are two-sided and the level of statistical significance is 0.05.


Representativeness of matched cancer cases included for study

Table 1 compares the distribution of selected characteristics among matched SEER-NLMS patients that were included in the incidence analysis with that for the full SEER registry case file originally submitted for matching. Due to the large size of the study population, comparisons within each category of characteristics (age group, sex, etc.) were statistically significant. The magnitude of most of the differences, however, is small, and thus likely not of practical importance. Men are slightly over-represented among matched cases included in these analyses. While whites form essentially the same percentage of submitted and included cases, blacks are underrepresented and Asian/Pacific Islanders are over-represented in included cases. The percentages of non-Hispanic whites and Hispanics included in the incidence analysis are similar to those for the originally submitted cases. Differences in years of diagnosis reflect the higher likelihood to be matched to NLMS cohorts for patients diagnosed in later years than those diagnosed in earlier years. Overall, the magnitude of the differences is small and the population of patients included in these analyses can be considered to be reasonably representative of the total SEER patient population from which they were drawn.

Table 1 Comparison of SEER cancer patient demographic characteristics, year of cancer diagnosis, and cancer site between matched cancer patients (used in incidence analyses) and original SEER case file

Selected findings on individual-level SES disparities in cancer

Differentials in cancer incidence

Tables 2, 3, 4, 5 show site-specific cancer incidence counts, age-adjusted rates, standard errors, rate ratios, and corresponding 95% confidence intervals, by race/ethnicity, educational attainment, family income, poverty status, employment status, employment sector, marital status, and rural/urban residence. Although data are provided for all cancers combined for the purpose of showing how the total cancer incidence burden varies by SES characteristics, the emphasis is placed on interpreting SES disparities in incidence of specific cancers, as they are likely to reveal important clues regarding cancer etiology and the distribution of risk factors by measures of socioeconomic status.

Table 2 Age-adjusted incidence ratesa, standard errors (SE), covariate-adjusted rate ratios (RR)b, and 95% confidence intervals (CI) by selected socioeconomic and demographic characteristics: all cancers combined
Table 3 Age-adjusted incidence ratesa, standard errors (SE), covariate-adjusted rate ratios (RR)b, and 95% confidence intervals (CI) by selected socioeconomic and demographic characteristics: lung cancer
Table 4 Age-adjusted incidence ratesa, standard errors (SE), covariate-adjusted rate ratios (RR)b, and 95% confidence intervals (CI) by selected socioeconomic and demographic characteristics: colorectal cancer, prostate cancer, and female breast cancer
Table 5 Age-adjusted incidence ratesa, standard errors (SE), covariate-adjusted rate ratios (RR)b, and 95% confidence intervals (CI) by selected socioeconomic and demographic characteristics: melanoma and cervical cancer

There were consistent gradients in incidence rates for major cancers such as lung, female breast, prostate, cervix, and melanoma of the skin by self-reported educational attainment, family income, and poverty status. For example, during 1979–1998, men with less than a high school education and those with a high school education had lung cancer rate ratios of 3.01 and 2.32, respectively, compared to their college-educated counterparts (Table 3). Educational gradients in lung cancer for women were smaller than those for men. Women with less than a high school education and those with a high school diploma had lung cancer rate ratios of 2.02 and 1.74 comparing to women with at least a college degree. For prostate and female breast cancers (Table 4), higher educational attainment was associated with higher cancer incidence. Compared to their college-educated counterparts, men and women with less than a high school education had rate ratios of 0.79 and 0.74 for prostate and breast cancer incidence, respectively. Educational differences in colorectal cancer were small but statistically significant, with those with a high school education or less having a rate of 1.45 times of that with a college education. Educational differentials in melanoma of the skin and cervical cancer were significant although numbers of cases are much smaller than for cancer sites described above (Table 5). Compared to those with a college education, those with less than high school education had a reduced risk for melanoma of the skin (rate ratio = 0.55), but an elevated risk for cervical cancer (rate ratio = 3.24).

Income gradients in male and female lung cancer incidence were significant (Table 3), with those with family incomes less than $12,500 having an incidence rate more than 1.7 times that of those with family incomes of $50,000 or more. The income gradient for prostate cancer (Table 4) incidence shows men with lower incomes at reduced risk relative to those with a family income of $50,000 or more. An income gradient was also observed for melanoma of the skin. Those with family incomes less than $12,500 and $12,500–$24,999 had rate ratios of 0.59 and 0.88, respectively, relative to those with a family income of $50,000 or more. There were substantial gradients for both income and poverty in cervical cancer incidence. Women at or below 100% and 100–200% of the poverty rate had cervical cancer rates of 4.30 and 3.35, respectively, higher than those with family incomes exceeding 600% of the poverty threshold.

Substantial racial/ethnic variations in incidence rates are noted for all cancers combined as well for the specific cancers examined (Tables 2, 3, 4). Compared to non-Hispanic whites, Hispanics and Asian/Pacific Islanders had significantly lower incidence rates for all cancers combined as well as for several other cancers. Specifically, compared to non-Hispanic whites, Mexicans had a lower overall cancer rate (rate ratio = 0.73), lower rates of lung cancer (male rate ratio = 0.55, female rate ratio = 0.25), and a lower rate of female breast cancer (rate ratio = 0.73). Compared to non-Hispanic whites, Asian/Pacific Islanders had a lower rate for overall cancer rate (rate ratio = 0.74), male lung cancer (rate ratio = 0.65), female lung cancer (rate ratio = 0.56), colorectal cancer (rate ratio = 0.77), prostate cancer (rate ratio = 0.59), and female breast cancer (rate ratio = 0.82). Compared to non-Hispanic white men, non-Hispanic black men had a higher overall cancer rate (rate ratio = 1.49), with higher rates of lung cancer (rate ratio = 1.73), and prostate cancer (rate ratio = 1.87), while non-Hispanic black women had a higher rate of cervical cancer (rate ratio = 2.00) relative to non-Hispanic white women. Colorectal cancer rates were also higher among non-Hispanic blacks (rate ratio = 1.44).

Tables 2, 3, 4, 5 also show site-specific incidence rates and rate ratios by marital status, employment status, employment sector/class of worker, and rural/urban residence. Worth noting are the significantly increased rates of lung cancer associated with divorce or separation and with unemployment. Divorced or separated men and women had higher rates of lung cancer than their married counterparts (rate ratios = 1.34 and 1.83, respectively); as did unemployed men and women compared to their employed counterparts (rate ratios = 1.83 and 2.09, respectively). Relative to married women, women who were divorced/separated, or never married had higher risks of cervical cancer (rate ratios = 1.74 and 1.80, respectively). Incidence rates did not vary significantly by rural–urban residence for any of the cancers examined.

Differentials in late-stage cancer diagnosis

Table 6 shows demographic and socioeconomic effects on the likelihood of late-stage cancer diagnoses. The P-values are from testing for the overall effect of each demographic and SES characteristic by using the Wald test statistic. The overall test (with more than one degree of freedom) was not a trend test (with one degree of freedom), because we did not assume that the effect of an SES characteristic is linear. Lower income was statistically significantly associated with an increased likelihood of being diagnosed with a late-stage prostate (P = 0.002) or breast cancer (P = 0.02). For example, men with family incomes less than $12,500 and between $12,500 and $24,999 had elevated odds of late-stage disease compared to men with family incomes ≥$50,000. The odds for late-stage breast cancer for the two lowest income categories are 2.3 and 1.8 times higher than those of the highest income group, respectively. In terms of racial/ethnic differences, the odds of being diagnosed with late-stage prostate cancer for non-Hispanic black males was 2.6 times higher and the odds of being diagnosed with late-stage breast cancer for non-Hispanic black females was 2.2 times higher than their non-Hispanic white counterparts, respectively. The likelihood of a diagnosis of late-stage colorectal cancer did not vary significantly for any of the SES characteristics examined.

Table 6 Differentials in distant-stage cancer diagnoses among those aged 25+ years at cancer diagnosis by selected baseline socioeconomic and demographic characteristics


Reducing disparities in overall health and in cancer outcomes is a major priority of the U.S. Department of Health and Human Services and of the National Cancer Institute [6]. Reliable data on cancer-related health disparities among socioeconomic and demographic groups is required to set and track the national goals for reducing such disparities. Using data from the SEER-NLMS record linkage study, we have documented for the first time disparities in cancer incidence and late-stage diagnosis by a variety of self-reported individual-level socioeconomic and demographic characteristics for a major segment of the US population. The findings reported here should serve as important baseline statistics for the United States and aid in making future domestic and international comparisons of cancer rates based on individual-level social inequalities in cancer incidence and stage at diagnosis.

The magnitude of individual-level SES disparities in cancer incidence and patient survival shown here may differ from those based on area-level SES data. In the absence of individual socioeconomic information, researchers have often used area-based socioeconomic characteristics of places of residence (e.g., county, zip code, census tract, or block group) appended to cancer and other disease/health records to analyze socioeconomic disparities [2328]. However, area-based socioeconomic measures are qualitatively and conceptually different from individual-level SES variables [29]. They should not be viewed as proxies for the individual information when the latter is not available. Rather, they should be viewed as community, neighborhood, or social structural influences, which may contribute to individual cancer risks, independently from individual socioeconomic characteristics [29, 30]. We plan in our future studies to employ a multilevel framework to examine both area- and individual-level socioeconomic inequalities in cancer incidence, stage, and patient survival utilizing the SEER-NLMS linked data.

The major findings of this study are generally consistent with the patterns identified in the literature [3141]. The racial/ethnic patterns in cancer incidence based on this linkage study are generally consistent with those obtained from the cross-sectional SEER data in California for the period 1979–1998 [42]. Significant ethnic and SES disparities in overall cancer incidence were found in the California study, with Asian/Pacific Islanders, Mexicans, and other Hispanics experiencing lower incidence rates and non-Hispanic blacks and those in lower education and income strata experiencing higher rates. However, the magnitude and the direction of the relationship between SES and cancer incidence varied by cancer site and gender. In a study of cancer patients in the San Francisco Bay area SEER registry, the inverse socioeconomic gradients in lung and cervical cancer incidence were particularly pronounced, whereas breast and prostate cancer and melanoma incidence increased substantially with increasing SES [43]. Others have reported socioeconomic patterns in cancer stage that were generally consistent with our study results across the cancers examined; e.g., late-stage diagnosis associated with lower SES [36, 4446].

Social disparities in cancer incidence may be related to socioeconomic and demographic differences in cancer-related risk factors and behaviors, such as cigarette smoking, poor diet, physical inactivity, obesity, reproductive factors, human papillomavirus (HPV) infection, and sun exposure [31, 47, 48]. Disparities in health care access and use [49], particularly in preventive health services, such as cancer screening [8, 5052], may contribute to differentials in cancer stage distributions, especially in the late stage diagnosis. Individuals at lower levels of SES, particularly with low educational attainment, are more likely than those with higher education or higher SES levels to be current smokers, to be physically inactive, and to be obese [47]. Marked marital status differentials in cancer incidence may partly reflect differences in SES, behavioral factors [49], social networks, and social support characteristics. More research is needed to determine the causal factors underlying socioeconomic risk gradients, in order to develop innovative and targeted health promotion strategies. For example, Harris [31] noted that smoking behavior was sensitive to price: a tax reform policy may then reduce smoking in low socioeconomic populations, who are most at risk of lung cancer.

Our study is limited by small numbers of cancers diagnosed in some groups. In addition, cancer incidence rates shown in this paper may be underestimated if CPS respondents moved to a non-SEER area and were subsequently diagnosed with cancer. Other limitations of the study include the exclusion of the institutionalized population in the CPS and the time-fixed nature of the covariates over the relatively long cancer incidence follow-up. It is important to point out that socioeconomic characteristics measured closer to the time of cancer diagnosis may be a poor indicator of the effects of socioeconomic position accumulated over the life course [53]. Some characteristics, such as educational attainment is nearly stable or fixed after 25 years of age; while others, such as income [15], marital status, and employment status are more likely to change over time. However, because we used broad family income and occupation categories, the relative impact of any expected changes in social mobility or time-varying covariates should be somewhat minimized. It is also possible that cases matched to the NLMS cohorts are a biased subset of cancer cases identified by SEER Program registries. While analyses of the representativeness of cases included in this study show statistically significant differences, this is not surprising given the large number of cases involved. The magnitude of the differences is small, however, decreasing their epidemiologic importance.

The analytic potential of this linked longitudinal database is not limited to the types of analyses shown here. The database can be used to analyze individual-level variations in site-specific cancer incidence, patient survival, mortality, stage at diagnosis, extent of disease, and treatment by a variety of self-reported characteristics. In addition to the variables we included in our analyses, there are data available from the survey on detailed race/ethnicity, ethnic origin, household size and composition, housing type and tenure, residential mobility, internal migration, veteran status, metropolitan/suburban/non-metropolitan residence, industry, earnings, welfare assistance, labor supply (annual number of hours worked), unemployment duration, availability and type of health insurance coverage, cigarette smoking, and self-assessed health status. In this study we focused on the individual effects of the various socioeconomic factors on cancer rates controlling for age and period of diagnosis, SEER registry area, and sex when relevant. In our future analyses, we will simultaneously examine effects of these factors on cancer outcomes because they may confound with each other.

The SEER-NLMS record linkage study has enabled an evaluation of the quality of demographic data (e.g., race/ethnicity and place of birth) available from medical records and reported by SEER registries as compared with the self-reported data and its impact on health disparity studies [16]. It will also allow multilevel modeling of the effects of area deprivation, environmental factors, health services, and individual socioeconomic status on various cancer outcomes; and assess changing socioeconomic and geographic patterns in cancer incidence, mortality, stage of disease, and survival over time. Moreover, since the SEER-NLMS is being expanded to include additional CPS cohorts and additional cancer patients both from more recent years of diagnoses and from the participation of all SEER registries, the expansion will add greatly to the analytic capability of the linked SEER-NLMS data, which is currently partly limited by its small numbers in certain sociodemographic subgroups. The addition of Medicare enrollment and claims data (from 1990 onward) increases even further the research potential of the linked SEER-NLMS data.