Background

The Surveillance, Epidemiology, and End Results (SEER) program is an authoritative source for cancer statistics that President Richard Nixon initiated on January 1, 1973. This program is funded by the National Cancer Institute to provide cancer data to the public for clinical studies with the goal of lowering the cancer burden in the United States (US) [1]. The SEER program collects demographic, clinical, and outcome data on all malignancies diagnosed in representative geographic regions and subpopulations in the US. Originally, there were only 9 initial tumor registries, and now there are 22 US geographic areas participating in the SEER program, encompassing about 48% of the total cancer patients in the US population. Information about the detailed surgical procedures has been included in the program since 1983, and tumor types were also covered from 1998. In addition, specific tumor hallmarks have been included for testicular, breast, and prostate cancers since 2004. Based on the 7th edition of the American Joint Committee on Cancer Staging Manual, SEER data were markedly enriched with tumor grades, invasion/metastasis status (bone, brain, lung, and liver), site-specific variables, and tumor stages. Information about the types of radiotherapy, surgical procedures, and the status of chemotherapy was later included in the program.

One of the main targets of the SEER program is to record cancer incidences and mortality rates for the entire US population. To provide insight into the potential etiologies, the program monitors the trends in annual cancer incidence to detect unusual changes in certain cancers that occur in populations stratified by demographic, geographic, and social characteristics. In addition, it facilitates the accumulation of information about disease progression, the identification of prognostic factors, the patterns of healthcare and clinical practices, as well as the variables for determining patient survival quality. As one of the most widely used open-access databases, SEER has facilitated the development of precision medicine and individualized therapies, which could enhance the quality of health care, cut unnecessary costs, improve prevention strategies, and encourage healthy lifestyles at the population level [2,3,4]. The SEER database can also be used in observational studies and national and local public health programs that could promote health through the prevention and control of diseases [5,6,7]. Moreover, SEER-based studies have been proven to be useful in the dissection of disease etiologies and have provided guidance for measures that aim to eliminate ethnic disparities [8, 9]. More than 17,000 articles published from 1973 to 2020 used the SEER database as the primary source of data, and more than 86,000 articles referenced SEER in their studies. Figure 1 shows the progressive growth in the number of published articles based on SEER data in PubMed over the past 25 years (1998–2022). Considering that a handy user guide for the application of the SEER database is still lacking, this review aims to discuss the commonly used methodologies and study designs for SEER-based research.

Fig. 1
figure 1

Research articles based on the Surveillance, Epidemiology, and End Results (SEER) (not SEER-Medicare) that had been published in journals from 1998 to 2022 searched by PubMed. The joinpoint analysis program chose the most suitable loglinear regression model to detect calendar years (known as “joinpoints”) with significant changes in APCs, allowing for the minimum number of joinpoints necessary to fit the data. Joinpoint regression analyses detected three segments (1998 – 2008, 2008 – 2015, and 2015 – 2022) that had significant APC changes in the number of published papers. The diamond dots reflect the observed value, whereas the line formed via joinpoint analysis represents the predicted value. The data were assessed on April 23, 2023. Asterisks (*) represent a P-value less than 0.05. APC annual percentage change

Data are of paramount importance in today’s world [10]. In particular, “big data” is thought to have a considerable positive impact on the healthcare system, as in finance and other systems [11]. High degrees of dimensionality, continuous and rapid renewal, scarcity, and irregularity are characteristics of clinical data [12]. To better use big data, it is necessary to overcome various challenges related to technologies, populations, and organizational differences [13]. In addition, identification of the availability of medical databases, data-mining methodologies, and data standardization procedure are essential for successful and reliable clinical and epidemiological studies [14, 15]. For the purpose of facilitating the use of the SEER database, we will discuss the 10 commonly used analytic approaches and 7 study designs. Typical examples will be provided for each topic in order to make it easier for readers to understand the practical application of the SEER database (Fig. 2). SEER updates the database on patient-specific and tumor-specific variables on a regular basis. Therefore, the common variables currently used in the SEER database, including patient demographics, socioeconomic and geographic characteristics, primary tumor locations, tumor morphologies, stages at diagnosis, first-course treatments, follow-up for vital status, causes of death, and other descriptions, are shown in Table 1.

Fig. 2
figure 2

The available methodologies and study designs used in Surveillance, Epidemiology, and End Results (SEER)-based analyses. There are more than 10 analytic methodologies and 7 study designs available for the analysis of the SEER data. The selection of proper study design and analytic methodologies is crucial for utilizing SEER data to generate clinical benefits

Table 1 Commonly used variables in SEER database

Statistical methods

Logistic regression model

The logistic function was developed during the nineteenth century to describe population expansion and the progress of autocatalytic chemical processes [16]. The binary logistic regression model is one of the most extensively used prediction models in medicine to predict the occurrence of a clinical event, such as disease, recurrence, mortality, or recovery. A closed exponential formula is applied to calculate the probability of an occurrence based on a set of parameters [17]. Odds ratios (ORs), which correspond to the probabilities of binary outcomes, are commonly reported in the medical literature [18]. Logistic regression analysis is a type of generalized linear model [19, 20] that is frequently examined in SEER-based studies for short-term survival analysis (less than 1 year) [21, 22]. As a measure of short-term surgical outcome, the 1-month survival rate has been widely used for the evaluation of treatment effectiveness [23]. For example, it has been reported that logistic regression was used to identify two covariables associated with 1-month mortality in 5428 surgically treated brain tumor patients [24]. The authors found that pediatric patients under 1 year old had a significantly higher risk of 1-month mortality [OR = 5.9, 95% confidence interval (CI) 3.4–10.4]. Identifying compatible individuals for a certain medication is also an efficient technique for implementing precision treatment from a medical standpoint. Whether or not patients should undergo treatment has been one of the hot topics in cancer research [25]. In one study, external-beam radiation was independently related to higher 1-year survival in postoperative patients with gallbladder cancer. In addition, patients at a younger age with tumor spread beyond the serosa, intermediate to poorly differentiated tumors, and lymph node metastases are more likely to have received external-beam radiation treatment (OR > 1) [26]. Logistic regression has become a standard statistical tool for SEER-based research, such as risk assessments in the presence of synchronous metastases. In particular, the associations of age and sex with the presence of synchronous brain metastases (SBMs) have been studied intensively [20, 27]. Indeed, logistic regression is a widely utilized method for estimating propensity scores by regressing the binary treatment or exposure indicator variable on pretreatment variables [28].

Cox proportional-hazards model

David Cox established the proportional-hazards model in 1972 to evaluate how multiple covariates influence the time to failure of a system [29]. Cox proportional-hazards regression is one of the most commonly used regression methods for survival analysis and is used to correlate multiple risk variables or exposure types with survival time [30, 31]. In most cases, different groups are compared based on their hazards, and thus the hazard ratio (HR) is used as it is equivalent to an OR in the framework of logistic regression analysis [32]. The SEER program provides long-term follow-up outcome data that are regularly updated, making it ideal for Cox regression analyses. For example, the primary tumor of the triple-negative subtype (vs. hormone receptor+/HER2: HR = 1.98, 95%CI 1.56–2.50) had the highest adjusted risk of death in multivariable Cox regression for all-cause mortality among breast cancer patients with SBMs from the SEER database [33]. The best treatment modality for patients with malignancies has been intensively studied by Cox regression analyses. Pausch et al. [34] found that chemotherapy and cancer-directed surgery are significant protective prognostic factors (HR < 1, P < 0.05) for patients with oligometastatic pancreatic ductal adenocarcinoma (PDAC). In addition, several SEER-based studies reported the application of Cox proportional-hazards models to evaluate the associations of examined lymph node count [35], socioeconomic status [36], insurance status [37], marital status [38], and other clinicopathological variables with the prognosis of cancer patients. Within this class of analysis, the Kaplan–Meier method has been used to estimate survival, and a stratified log-rank test was used to assess differences in survival [39]. It should be noted that a restricted cubic spline (RCS) function is required when nonlinearities appear [20, 40].

Competing-risks model

Prognostic models should consider competing events because they affect assessments of the impact of the event of interest and, thus, the benefit of an intervention [41]. The competing-risks data inherent in medical research can be analyzed using proportional cause-specific hazard and proportional subdistribution hazard (SDH) models [42]. The Fine-Gray regression method introduced by Fine and Gray [43] in 1999 is one of the most widely used models for proportional-hazards modeling of the SDH. SDH models are considered to be more desirable for direct evaluations of actual hazards, and, therefore, they can be used for prognosis assessment and in medical decision-making [44]. Although the cause of mortality could be difficult to define accurately, the SEER program divides the cause of death into cancer-specific death and other causes of death. These two groups can be set as the main or competing events. Accordingly, Li et al. [45] used the Cox regression model to perform a SEER-based analysis and revealed that the risk of other causes of death increased with age, which was supported by the findings from a competing-risks model, which indicated an association between an increased risk of all-cause death and advanced age. Another SEER-based study found that the prognosis was worse in Medicaid patients than in insured patients (subhazard ratio = 1.87, 95%CI 1.72–2.04, P < 0.0001) based on a Fine-Gray competing-risks model. It should be noted that the cumulative incidence function is typically used instead of Kaplan–Meier curves in the case of competing risks since the Kaplan–Meier estimator often overestimates the cumulative incidence in the presence of competing risks [46, 47].

RCS model

As reported previously [48], cubic spline functions are computationally easy to use, and they can define various geometries if sufficient knots are included. RCSs are the cubic splines that are constrained to be linear in the tail of a distribution developed by Stone and Koo [49]. Herndon and Harrell [50] demonstrated that in a homogeneous setting (i.e., with no covariables), the RCS hazard function has enough flexibility to describe a wide range of hazard-function shapes without becoming computationally intractable. However, only a few continuous variables appropriate for RCS analysis could be obtained from the SEER program [20].

Poisson regression model

Poisson regression is one of the generalized linear models that is used when the dependent variable is described by the count data [51]. It is suitable for summarizing relative risks and analyzing complicated interactions among factors. In addition, Poisson regression can be broadly applied to the estimation of disease incidence based on assumptive etiological processes of exposure or disease-related features in a population [52]. For example, Tsikitis et al. [53] used Poisson regression to evaluate trends in incidence rates of gastrointestinal neuroendocrine tumors over time, with the year of diagnosis as a continuous variable. In addition, Muskens et al. [54] utilized a Poisson regression model to compare age-adjusted incidence rates of pediatric glioma and medulloblastoma in a multiple-variable analysis. A Poisson regression model was also used in a SEER-based study to examine the characteristics of Wilms tumors that impacted lymph node density [55].

Nomogram

A nomogram provides an easy-to-interpret graphical depiction of a statistical prediction model that can predict the probability of a particular clinical event [56]. Because of their ability to provide personalized predictions, nomograms can be used to identify high-risk populations and stratify patients in clinical trials. The combination of a user-friendly interface with easy online access has led to their widespread acceptance by both oncologists and patients [57]. Iasonos et al. [56] described the following steps for constructing a nomogram for cancer patients: (1) screening patients; (2) determination of outcome; (3) screening significant predictors; (4) construction of a nomogram; (5) validation of the nomogram; and (6) interpretation of the nomogram. The nomograms in previous SEER-based studies have primarily been constructed based on logistic regression, Cox regression, and competing-risks models. Pan et al. [58] applied a Cox regression model to screen 9 prognostic factors for the overall survival (OS) of patients with inflammatory breast cancer. They developed a nomogram that was internally and externally validated to predict the 1-, 3-, and 5-year OS rates for patients with inflammatory breast cancer. Wu et al. [59] used a logistic regression model to identify 3 independent factors for the construction of a nomogram that can predict the lymph node metastatic status of breast mucinous carcinoma. The nomogram can also be constructed using a competing-risks model to predict the survival of patients with node-negative localized renal cell carcinoma [60]. In addition, nomograms can be used for the clinical risk stratification of malignancies [61, 62].

Regression using the least absolute shrinkage and selection operator (LASSO)

The LASSO was developed by Tibshirani [63]. The merit of this method is that it can reduce certain coefficients and sets other than zero in order to keep the best characteristics of both subset selection and ridge regression. The LASSO regression was later proved mathematically by Zhao et al. [64]. It can be used in SEER-based studies to identify predictors for a binary outcome. Che et al. [20, 27] used LASSO regression models to identify predictors associated with the presence of SBMs in patients with breast cancer. The prognostic variables impacting OS and cancer-specific survival in patients with pancreatic adenosquamous carcinoma were also identified using LASSO analyses [65].

Artificial intelligence (AI)

There are two types of AI applications in medicine: virtual and physical. Machine learning is a virtual type of AI [66], which is implemented by mathematical algorithms that increase learning ability via experience [67]. There is an increasing and irreversible trend of discipline convergence between medical science and AI [68]. Yu et al. [69] developed the DeepSurv model, which combines machine learning with a multilayer neural network to predict the survival of patients with rectal adenocarcinoma. They showed that the AI-based prediction model had a higher C-index and better predictive capacity than traditional Cox regression survival analysis [69]. Senders et al. [70] further constructed an AI-based online calculator for predicting the survival rates of patients with glioblastoma. A comparison of the prediction accuracies of 15 statistical and machine-learning methods revealed that the accelerated failure-time model performed the best [70]. However, whether AI provides superior performance in the field of medicine requires more investigation.

Joinpoint regression model

Kim et al. [71] developed a joinpoint regression model for analyzing the changes in cancer mortality and incidence trends. They further used the grid-search method to fit the regression function. Their algorithm determined the calendar year (as the name “joinpoints” implies) during which there were significant annual percentage changes by choosing the best-fitting log-linear regression model that needed the fewest number of joinpoints to fit the data. In addition, Lim et al. [72] used a joinpoint regression model to analyze incidence and mortality data of patients with thyroid cancer in the US obtained during 1974–2013 from 9 registries in the SEER database to analyze the true incidence and mortality rates. They found that the overall incidence and mortality rates of thyroid cancer increased annually by 3% and 1.1%, respectively. Some studies further suggest that the joinpoint regression model is a topical ecological research method in SEER-based studies [73, 74].

Propensity-score matching (PSM)

Rosenbaum and Rubin [75] developed the PSM method for constructing a small control group with a covariate distribution comparable to the distribution of the treatment group in an observational study. Propensity-score analyses have been shown to be able to successfully imitate various randomized clinical trials that assess diverse target groups. They also showed that this method could eliminate bias in comparisons between treated and control populations [76]. PSM has become a well-established method for estimating causal treatment effects [77]. The most popular PSM technique uses 1:1 nearest-neighbor matching (also known as greedy matching), in which each person who received treatment A is evaluated sequentially to another person who received treatment B with the closest propensity-score matched, typically within a predetermined bound on the closest propensity scores [78]. The influence of treatment (surgery, chemotherapy, or radiotherapy) on the prognosis of patients with malignancies has been examined in numerous SEER-based studies using PSM [34, 79, 80]. PSM is ideal for adjusting pertinent confounding variables when studies focus on different subtypes of a particular malignancy [81].

Other models

SEER-based studies may also employ several other research methodologies. For example, mediation analysis is typically used to identify the indirect impact of a covariate on cancer survival through one or a few mediating factors [82, 83]. Possible interactions of treatment and other variables with mortality have been explored in subgroup analysis, which could enhance the reliability of the results [84, 85]. Exploratory factor analysis based on varimax rotation was used to diminish the data set, leading to the discovery of the intricately connected structure of county-level socioeconomic status indicators [36, 86].

Study designs

Real-world study

The method of using a real-world study that was first introduced by Kaplan et al. [87] in 1993 involved acquiring real-world data from various sources, including electronic health records, administrative data, health insurance claims and billing data, product and disease registries, personal devices, and health applications [88, 89]. Furthermore, real-world evidence has shown that variables such as clinical settings, provider features, and health-system characteristics could affect treatment effects and outcomes [88]. As one of the most important cancer registries in the US, the SEER program collected complete and accurate data on all cancers diagnosed among the inhabitants in specified geographic regions. It is maintained with a continuous quality control and improvement program to ensure that high-quality data are obtained. Obviously, the SEER program is an important source of real-world data. Under the premise of using appropriate analytic tools and methods, SEER-based real-world studies can generate valuable real-world evidence [90]. For example, in a SEER-based investigation, Yuan et al. [91] used a real-world study design and discovered that the overall mortality risk was higher for focal treatment than for active surveillance or watchful waiting, indicating that the latter could offer OS benefits. Nevertheless, a careful examination of the literature revealed that there have been very few real-world studies that make use of the SEER database. We hope that this review will raise awareness of the availability of real-world data from the SEER program.

Ecological study

Being one of the most fundamental types of observational studies, ecological study is ideally suited for SEER-based research. This study examines groups of individuals who were typically categorized according to their geographic location or chronological associations [92, 93]. It can also estimate the prevalence of diseases in a community by assigning a single exposure level for each unique group. An elegant example of a SEER-based ecological study involved a description of incidence trends and disparities in cancers related to Helicobacter pylori reported by Lai et al. [94]. They found that the incidence of Helicobacter pylori-related cancers showed a significant downward trend from 2000 to 2019 and identified the racial/ethnic and geographic disparities in incidence rates. In addition, the demographic disparities in the incidence rates of SBMs [95], gliomas [74], and thyroid cancer [72] have been reported using this approach. The Rate Session in SEER*Sat software can be used to obtain the data when demographic covariables are considered in the exposure indicator and the outcome is a cancer diagnosis.

Proportional mortality ratio (PMR) study

The SEER database contains information obtained from state-issued death certificates about the causes of death [96], and data collected from the US Census Bureau can be used to compute mortality statistics. These data are used for the PMR studies. The PMR and standardized mortality ratio (SMR) are the epidemiological outcomes of this type of study. The two ratios represent the proportions of cause-specific deaths relative to all deaths for each exposure group [97]. Long-term follow-up analysis revealed that PMR is likely to be higher for cardiovascular disease than for classic Hodgkin lymphoma among patients with stage I and stage II classic Hodgkin lymphoma [98]. By analyzing the relative risk of mortality compared with all people using the SMR, Zaorsky et al. [99] identified variations in the risks of death from index and non-index cancers among primary cancer locations. It should be cautioned that assessments by PMR may not always be reliable due to a lack of information about the populations at risk. Therefore, even though the denominator or numerator of the ratio is skewed, it is suggested that SMRs should be used instead of PMRs. In fact, the frequency of using SMR is higher than that of PMR in SEER-based studies. The corresponding statistical data can be obtained using SEER*Sat software under the MP-SIR session.

Cohort study

The term “cohort” was first used in medical applications in 1935 by Wade Hampton Frost, an epidemiologist who studied age-specific mortality rates [100]. According to the field of epidemiology, nowadays the term refers to a group of people with defined characteristics who are followed up for the assessment of incidence or mortality from a specific cause of death, all causes of death, or some other outcomes [101]. In a typical cohort study, a group of participants is followed over time. As an exemplary cohort study, Pausch et al. [34] conducted a SEER-based study and discovered a relationship between cancer-directed surgery and a better prognosis of patients with PDAC. Based on the final follow-up date for recorded survival on December 31, 2015, the authors found that cancer-directed surgery significantly increased the median OS of patients with PDAC from 5 to 10 months. It should be noted that the SEER-based study has a tendency to be imbalanced in baseline characteristics, and propensity score matching (PSM) can be utilized to reduce the associated bias in this situation. The SEER data used in cohort, case–control, and case-series studies can be obtained using SEER*Sat software under the Case-Listing session.

Cross-sectional study

Cross-sectional study is conducted at a specific point in time or spanning a relatively short time frame. These studies are often used to estimate the prevalence of an outcome of interest in a specific population, especially for planned public health strategies. Together with outcome information, data on individual characteristics such as exposure to risk factors can be obtained from these studies [101]. Depending on whether the results are evaluated for potential association with risk variables or exposures, cross-sectional studies can be descriptive or analytical [102]. In a previous cross-sectional study, we recorded the prevalence of SBMs and analyzed the relationship between SBMs and clinicopathological data in midlife patients [27]. The outcome was the presence or absence of SBMs, whereas the exposure factors were patient age, sex, race, marital status, and other covariables. Specifically, we analyzed the clinicopathological data of patients, assessed their SBM status, and evaluated the outcomes and exposure data simultaneously. Given that the cross-sectional studies estimate prevalence rates, they are particularly useful for analyzing the burden of a disease or condition for planning health care services. The data can be obtained using SEER*Sat software under Survival and Case-Listing session.

Case–control study

Case–control study has been widely used to address significant public health issues [103]. This design was first applied in the breast cancer study by Lane-Claypon in 1926 [104], leading to the conclusion that a low fertility rate increases the risk of breast cancer. Because of the inherent characteristics of the SEER-Medicare database and the SEER database, the former is more suitable for case–control studies. However, researchers will need to further evaluate whether SEER-based studies use this study design appropriately.

Case-series study

A case series includes multiple individuals across time who were diagnosed with the same disease or received the same treatment [105]. Case-series studies are subsets of descriptive studies that do not explore the effectiveness of hypothesized treatments [106]. This characteristic makes case series a relatively efficient and cost-effective approach because it does not use randomization or comparison groups. However, despite being one of the most representative large databases of tumors in North America, SEER-based case-series investigations are uncommon. The description of the defining characteristics of patients with malignant thyroid teratomas would be a typical SEER-based case series. A study using 8 patients with malignant thyroid teratoma indicated a high rate of extrathyroidal extension and nodal involvement, as well as easy recurrence and metastases, which are characteristic features of these neoplasms [107]. The main goal of a case-series study is to generate hypotheses that can be further validated by rigorous statistical methods.

Conclusions and perspectives

The effective use of the SEER database for cancer research depends on the appropriate application of study designs and statistical models. The purpose of this review is to assist clinical researchers in understanding the types of advanced statistical modeling methods and study designs. Appropriate use of the SEER database can ensure that correct research conclusions are drawn and maximize the benefits to clinicians and patients. Through recently published exemplary cases, we have shown that there are diverse statistical methodologies and study designs that can be applied to SEER-based research. It is important to point out that a SEER-based study usually has a complex integrated design and involves various statistical methods. It is hoped that the structural framework of this review will help readers obtain relevant data and better understand and choose their study designs and methods.

The types of study designs used in the SEER-based studies have been progressively refined [108]. The SEER program currently records information on around 400,000 cancer cases annually. The volume of SEER data has been growing fast [109]. The analysis of greater volumes of big data with higher dimensionality necessitates novel ideas and methodologies. The present review offers several implications for data collection, standardization of analysis, and cancer surveillance for national and military health systems surveillance institutes.