Background

Cancer is a multi-pathway disease, assembled as a heterogeneous and hierarchically organized system, and still one of the major causes of death worldwide – with an increasing burden given the aging population [1,2,3]. Cancer data has grown exponentially in the last decade with new advanced technologies resulting in highly diverse, mixed data types and huge volumes of information (e.g.: 542045 is the number of publications retrieved in PubMed when searching the terms ‘cancer’ AND ‘data’ on August 2017). Due to the nature of this emerged “Big Cancer Data” and the demand for high-sensitive and high-specific biomarkers, there is a need for significant sample sizes and advanced mathematical and statistical models [4, 5] capable of extracting relevant clinical and biological information [6, 7]. These more systematic-based approaches, replacing single biomarker analyses by multiple profiling testing, may provide new avenues for biomarker development in cancer diagnosis and management [8, 9]. Recent studies have adopted these integrative approaches assessing multiple serum markers simultaneously for cancer diagnosis [10,11,12,13]. Furthermore, the concept of the exposome has been introduced into the field of cancer epidemiology [14]. It refers to every non-genetic exposure to which an individual is subjected from conception to death [14, 15] . Specifically, metabolites, part of the internal exposome, are both genetically and environmentally determined and can consequently be used to characterize environmental exposures and reveal biochemical mechanisms that link exposure to disease [15,16,17,18]. Hence, the internal distribution of metabolites and their interactions might help unravelling cancer susceptibility in a population.

With the overall goal of identifying statistical methods to stratify individuals based on their underlying risk of developing cancer and risk of increasing mortality, we conducted a data driven approach utilizing standard serum markers available from routine health check-ups to study susceptibility to cancer and death in a well-defined cohort of 13,615 participants from the AMORIS study (Apolipoprotein MOrtality RISk) [19, 20]. More specifically, the study was set out to explore population heterogeneity and cancer susceptibility by investigating serum metabolic profiles using latent class analysis (LCA). This data reduction method clusters covariates based on models of data distribution probabilities. It allows for evaluation of clusters of biomarkers linked to carcinogenesis and their intrinsic associations, which ultimately helps us assess their possible role in predicting long-term cancer and mortality.

Results

Characteristics of the study population

A total of 1,956 individuals (14.37%) developed cancer after at least 3 years of follow-up, including 655 breast and genito-urinary cancers, 330 cases of digestive cancer, 133 cases of respiratory cancers and 129 lymphatic and hematopoietic cancers during a mean follow-up time for cancer of 16.6 years, median follow-up time in the cohort of 17.22 years with a minimum of 3.01 years and a maximum of 24.77. Three thousand one hundred fifty-eight participants (23.20%) died during a mean follow-up of 17.3 years, comprising 706 cancer-specific deaths. Study population characteristics by cancer status is illustrated in Table 1.

Table 1 Characteristics of the study population by cancer status defined at the end of the follow up period. All the serum markers are dichotomized using standard clinical cut-offs

Latent class analysis characterizes the study population into four metabolic profiles

LCA was executed using the dichotomized values of the biomarkers to facilitate the biological interpretation of the results. The Chi-squared distribution criterion for model selection indicated a best fit model comprehend of four LCA classes, while AIC and BIC stabilized at 4 classes (Fig. 1a, b) [21]. All the criterions did not converge to a local maximum from class 12 onwards. The class allocation of the observations (individuals), the class conditional probability of each biomarker and the latent mixing proportions were obtained when running poLCA package in R statistical language.

Fig. 1
figure 1

a Line-graph depicting the goodness of fit indicators AIC and BIC. The model that best fits the dataset comprehends of four latent classes as determined by the minimum value reached by AIC and BIC criterions before stabilization of the values. The criterion did not converge to a local maximum from class 12 onwards. b Line-graph depicting the goodness of fit indicators (X^2 (1) (Chi-square). The model that best fits the dataset comprehends of four latent classes as determined by the minimum value reached by Chi-square. The criterions did not converge to a local maximum from class 12 onwards

Table 2 and Fig. 2 outline the LCA-derived classes with the estimated class population proportions, the class conditional probabilities of belonging to each latent class for each of the biomarkers and the biological interpretation of the LCA-derived classes. The four mutually exclusive classes characterize the population in metabolic profiles based on class conditional probabilities: (1) those with probabilities for all abnormal values of the markers under 0.3; therefore, considered the normal class (63% of population); (2) those with abnormal values for lipid markers (22%); (3) those with abnormal values for liver function markers (9%); (4) those with abnormal values for iron and inflammation metabolism (6%).

Table 2 Predicted class memberships of the clinically abnormal biomarkers cut-off values for the 4 latent classes model. Estimated class population shares for the four different LCA classes
Fig. 2
figure 2

Class Membership Probabilities for abnormal clinical values of the serum markers for the four LCA – derived metabolic classes. The four different biomarker profiles are represented in the graph

A validation of the characterization of the population performed with the Latent class methodology is outlined in Additional file 1: Table S3. The baseline clinical characteristics of the individuals by LCA-derived metabolic classes (Additional file 1: Table S3) replicate the results displayed in Table 2 for the class conditional probabilities.

LCA derived metabolic profiles as cancer and mortality predictors

We then investigated the prediction capabilities of the four LCA-derived metabolic profiles to estimate overall cancer risk, specific cancer types risk, cancer mortality and overall mortality, assigning the reference level to the healthy metabolic profile Class 1 (Tables 3 and 4).

Table 3 Hazard ratios and 95% confidence interval for the association of LCA-derived metabolic classes and overall cancer risk and cancer specific risk
Table 4 Hazard ratios and 95% confidence interval for the association of LCA- derived metabolic classes and all causes death and Cancer death

All metabolic profiles increased risk of cancer and mortality compared to Class 1. For instance, individuals in Class 3 (abnormal liver function profile) had a higher risk of overall cancer (HR: 1.28 (95% CI: 1.10–1.50)), but also a worse cancer-specific survival and overall survival as compared to those in Class 1 (Tables 3 and 4). Class 2 (abnormal lipid profile) and Class 4 (abnormal iron markers and inflammatory) were positively associated with overall death, while Class 2 was also associated with cancer–specific death. The results were consistent for both time-scales (Tables 3 and 4).

When assessing the risk of specific cancer types, several patterns occurred (Tables 3 and 4). Individuals in Class 2 (abnormal lipid markers) presented a higher risk of lymphatic and hematopoietic tissue cancer (HR: 1.72 (95% CI: 1.15–2.56)). There was a greater risk of digestive cancers in individuals in Class 3 (abnormal values of liver enzymes) (HR: 2.12 (95% CI: 1.54–2.91)), while individuals in Class 4 (abnormal iron markers and inflammation) were exposed to a higher risk of buccal and oral system cancers in comparison with the individuals in Class 1 (HR: 3.94 (95% CI 1.38–11.30)) (Table 3).

Moreover, the connective tissue and endocrine glands cancer risk was higher in individuals grouped in liver metabolic profile (HR: 2.65 (95% CI: 1.00–7.02) and in participants belonging to the iron markers and inflammation (HR: 3.00 (95% CI: 1.11–8.11)). Similar associations were observed when using the age scale for the multivariable cox proportional hazard regression model (Tables 3 and 4).

Discussion

We demonstrated that standard of care baseline serum markers when assembled into meaningful metabolic profiles can help stratify the population for cancer risk, cancer mortality and overall mortality. More specifically, we observed that abnormal values for markers of the lipid metabolism, liver function and inflammatory and iron metabolism distinguish participants into metabolic profiles, which are predictive of long term cancer risk and/or mortality.

Metabolic profiles

Among the biological pathways addressed in our LCA, abnormalities in the lipid metabolism were the most common. Hyperlipidemia was present in about a quarter of the study population explaining the largest abnormal metabolic profile. The weight of the lipid profile in the analysis was consistent with the reported global prevalence of hypercholesterolemia among adults (37% for males and 40% for females) as reported in the Global Health Observatory in 2008 estimates by the World Health Organization (WHO) and the results from the Swedish population in the WHO MONICA project [22]. Dyslipidemias are associated with higher risk of CVD and other chronic diseases such as cancer, as also observed in our study [23]. Liver dysfunction, iron deficiency and altered inflammatory markers profiles also distinguished important subgroups in our study population. About 9% of our population had abnormal values for markers of liver functioning (GGT, AST and ALT), which is similar to the results obtained in a population-based survey in the United States that estimated abnormal alanine aminotransferase (ALT) was present in 9% of respondents in absence of viral hepatitis C or excessive alcohol consumption [24]. Moreover, these enzymes are known to be linked to cancer because of their role in preserving the intracellular homeostasis of the oxidative stress [25,26,27], which is concordant with the results of these analyses. The iron profile and inflammatory markers clustered 6% of individuals in the study, which was predominantly driven by low levels of serum iron and TIBC, as well as high levels of CRP and leukocytes. This could potentially point towards anemia of inflammation, a chronic inflammation presenting low iron values, that occurs because the iron deficiency provides the body with infection resistance, which demonstrates the tightly connection between the inflammatory response and the iron and its homeostasis [28]. This condition has been reported in more than 30% of cancer patients at time of diagnosis.

Metabolic profiles as a risk factor for long term cancer and mortality

The above-described three classes of abnormal metabolic profiles were all associated with an increased risk of cancer and worse survival, as compared to the healthy class. The findings therefore confirm the key importance of these metabolisms in the maintenance of the intracellular homeostasis and how their unbalance can be related with the etiology of cancer disease and mortality [2]. The LCA adapted in this study thus illustrates how a biomarker-wide approach can help assess markers of the blood exposome in the context of carcinogenesis and mortality [29] (Fig. 3).

Fig. 3
figure 3

Study statistical pipeline describing the methodology followed in the project. We explored the blood exposome using metabolic markers of the population to assess how population heterogeneity is associated with cancer risk and mortality

More specifically, individuals presenting abnormal liver function markers carried worse outcomes in terms of overall cancer risk and cancer death, and a positive association with digestive, connective and endocrine cancers diagnosis. Moreover, the participants with this profile had a higher probability of overall death. These results are consistent with previous published data. A positive association between elevated GGT and overall cancer risk, with no interaction of ALT, was found in the AMORIS cohort previously [30], and it was also reported in other large cohort studies [31, 32]. These studies also found strong associations with elevated levels of GGT and digestive and respiratory cancer incidence. Elevated GGT has been associated with mortality from all causes, liver disease, cancer and diabetes, while ALT only showed associations with liver disease death in a large US cohort [33]. However, in a study based on an elderly population it was found that GGT was associated with increased cardiovascular disease mortality, and ALP and AST with increased cancer-related mortality [34]. Moreover, a meta-analysis evaluating the associations between liver enzymes and all-cause mortality found positive independent associations of baseline levels of GGT and ALP with all-cause mortality [35]. In the present study, the liver biomarker profile was positive associated with all the outcomes studied, suggesting a key role of this pathway in the development of cancer, probably related with its active role maintaining the intracellular redox regulation. Further investigations are necessary to establish the potential of the altered liver enzyme profile as a tool for cancer risk stratification.

Individuals allocated to the lipid profile presented positive associations with cancer mortality, and overall mortality and higher risk of lymphatic and hematopoietic cancers. The link between hyperlipidemia and mortality has been studied broadly, with associations with established links for cancer and all-cause mortality [36,37,38]. The association between lipids and lymphatic and hematopoietic cancers is more controversial, as other studies found an inverse association for these cancers and high levels of serum cholesterol [39, 40]. However, a systematic literature review from 2016 found no association [41].

Participants clustered in the unbalanced iron profile and inflammation had an increased risk of endocrine, buccal and oral cancers and were observed to have a higher risk of all-causes death. Altered inflammation and iron metabolisms are key metabolic ‘hallmarks of cancer’ [2, 42, 43]. Our observation of an association with an increased risk of buccal and oral cancer corroborates previous findings in AMORIS [42].

Population heterogeneity and risk stratification: the need for data reduction techniques

The modulation effect of population heterogeneity on the association between potential risks factors and disease is a new avenue to understand the variability of risk in the population [44]. For instance, in a targeted metabolomics exercise Shan et al. performed a principal component analysis and time to event analysis identifying metabolic profiles to predict risk of CVD [13]. Another study used Monte Carlo Cross Validation and Lasso logistic regression to evaluate serum biomarkers as an alternative to fecal immunochemical testing to improve detection of colorectal cancer [11]. In 2010, the European Prospective Investigation on Cancer and Nutrition (EPIC) cohort reported that a specific prediagnostic plasma phospholipid fatty acid profile could predict the risk of gastric cancer [45]. As rationalized in the HELIX project, these multiple profiling approaches aim to identify groups of individuals in the population that share a similar exposome that might account for differences on the specific risk of study [46].Together with these studies, our systematic data integration approach based on LCA demonstrates the potential of investigating population heterogeneity using metabolic profiling as risk factors for long term cancer risk and mortality prediction. However, in order to establish the prediction capability of these LCA metabolic profiles and implement their use in a clinical setting, further studies to validate the results whilst allowing to measure sensitivity and specificity, will need to be conducted such as a nested case-control in AMORIS that could determine the predictive capabilities of the metabolic profiles to estimate cancer risk and mortality.

Strengths and limitations

The present study has been conducted in a large and well-defined population, applying a multi-faced approach covering main biological pathways to assess biomarker profiles that could indicate cancer risk, cancer survival and mortality. The major strength of these analyses lies in the innovative avenue to study population heterogeneity and susceptibility to disease and mortality in a large cohort of participants with multiple measurements, all measured on fresh blood samples on the same day at the same clinical laboratory. We included all the markers available in the cohort for a large population (n > 13000), however not every marker of the central metabolic pathways was available in the database (i.e. Complete Blood Count). Life-style factors established as cancer risk factors such as tobacco smoking, low physical activity, poor diet, alcohol intake, obesity and hypertension were partially available in AMORIS which limited their used in the study. To mitigate the lack of some of these external factors such as BMI, the analyses have been adjusted for Charlson Comorbidity Index which includes comorbidities such as obesity and hypertension. The lack of others life-style factors such as alcohol consumption was mitigated by using information on serum biomarkers such as gamma glutamyl transferase and other liver enzymes. All participants were selected by analyzing blood samples from health check-ups in non-hospitalized individuals from the greater Stockholm area ensuring good internal validity in the study. Future studies will benefit from a longitudinal approach with repeated serum markers measurements that will capture the population phenotypic variations in relation to disease over long periods of time and will help to improve our understanding of the biomarkers’ impact on carcinogenesis and mortality.

Conclusion

Our findings support the recently expressed need for a shift from the classical epidemiological approach of assessing one exposure to a systemic approach with multiple exposures. The LCA adapted in this study illustrates how a biomarker-wide approach can help assess population susceptibility to disease and provide insight into disease etiology in the context of carcinogenesis and mortality (Fig. 3). Given the environmental and genetic modulation of metabolic molecules, metabolic profiling based on standard of care serum markers could become a useful non-invasive predictive signature for risk stratification and an important area of research for mechanisms and clinical relevance.

Methods

Study design and study population

The AMORIS study, a large prospective cohort study, has been described in detail elsewhere [19, 47, 48]. Briefly, the AMORIS database is based on linkages with the Central Automation Laboratory (CALAB) database, which analyzed fresh blood samples from subjects from the greater Stockholm area. All individuals were either healthy individuals referred for clinical laboratory testing as part of a general health check-up or outpatients between 1985 and 1996. The AMORIS cohort has been linked to several Swedish national registries such as the National Cancer Register, the Patient Register, the Cause of Death Register, the consecutive Swedish Censuses during 1970–1990, and the National Register of Emigration, using the Swedish 10-digit personal identity number. These linkages provide detail information on demographics, lifestyle, socio-economic status, vital status, cancer diagnosis, comorbidities and emigration. The AMORIS study conformed to the declaration of Helsinki and was approved by the ethics board of the Karolinska Institute.

From the AMORIS cohort, we included all individuals aged 20 years or older with measurements for the following serum biomarkers (n = 13,615), which were all measured on the same day, using fully automated methods with automatic calibration performed on fresh blood samples, at the same laboratory (CALAB) of high quality according to international blinded testing [49] (Additional file 1: Table S1 and Table S2): total cholesterol (TC) (mmol/L), triglycerides (TG) (mmol/L), apolipoprotein A-1 (ApoA-I) (g/L), apolipoprotein B (ApoB) (g/L), high density lipoprotein (HDL) (mmol/L), low density lipoprotein (LDL) (mmol/L), glucose (mmol/L), fructosamine (FAMN) (mmol/L), gamma-glutamyl transferase (GGT) (IU/L), alanine aminotransferase (ALT) (IU/L), aspartate aminotransferase (AST) (IU/L), albumin (g/L), leukocytes (WBC) (109 cells/L), C-reactive protein (CRP) (mg/L), iron (FE) (μmol/L), total iron binding capacity (TIBC) (mg/dL), creatinine (μmol/L), phosphate (mmol/L) and calcium (mmol/L). All methods have previously been described [48].

These biomarkers were selected to reflect common metabolic pathways: lipid (TC, TG, ApoA-I, ApoB, HDL and LDL) and glucose metabolism (Glucose, FAMN), liver function (GGT, ALT and AST), inflammation (Albumin, WBC and CRP), iron metabolism (FE and TIBC), kidney function (Creatinine) and phosphate (Phosphate and Calcium). The blood metabolites included in the analysis were all the standard serum markers available from routine health check-ups. Most of the markers included have been previously studied individually in AMORIS, however no systemic integrative approach to examine the metabolic markers interactions and susceptibility to cancer has been conducted to date [30, 42, 50,51,52,53,54,55,56,57,58,59]. All participants were free from cancer at time of study entry and none were diagnosed with cancer within the first three years of follow-up to avoid reverse causation.

The main exposure variables for the analyses were the above-mentioned metabolic biomarkers, for which the values were categorized using standardized clinical cut-offs based on recognized medical criteria to facilitate interpretation of the results (Additional file 1: Table S2). The main outcomes were first cancer diagnosis, as registered in the National Cancer Register using ICD-9 for the years 1987–1992, ICD-O/2 for years 1993–2004 and for year 2005 onwards has been coded in ICD-O/3), and mortality. As secondary outcomes, we explored those cancer types for which there were more than 30 events during follow-up. Likewise, cancer mortality was explored. Follow-up time was assessed specifically for each of the outcomes studied. For cancer diagnosis, follow-up time was defined as time from blood drawn until date of first cancer diagnosis, death, emigration or study closing date (31st of December 2012), whichever occurred first. The follow-up time for death was described as time from blood drawn until date of death, emigration or study closing date (31st of December 2012), whichever occurred first.

Information on the following potential confounders was also incorporated: age, sex and comorbidities. The latter was quantified using the Charlson Comorbidity Index (CCI) calculated based on data from the National Patient Register. The CCI comprises 19 disease categories, all assigned a weight. The sum of an individual’s weights was used to create the CCI ranging from no comorbidity to severe comorbidity (0, 1, 2, and ≥ 3) [60].

Data analysis

First, we calculated Pearson correlation coefficients to measure the strength of association between the biomarkers included in the analysis. Pearson’s correlation analyses showed strong correlation between the different biomarkers in the lipid metabolism (TC, LDL and ApoB (r > 0.7); HDL and ApoA-I (r > 0.8)). We replaced the individual lipid biomarkers by the established ApoB/ApoA-I ratio and log (TG/HDL) ratio [20, 49, 61, 62] to avoid collinearity and to comply with the principle of local independence as required by latent class analysis [63]. Most of the markers were normally distributed except from the liver biomarkers.

Latent Class Analysis (LCA) [63, 64] is a model-based clustering method that reduces the dimension of the data by clustering covariates into latent classes, using a probabilistic model that describes the data distribution, and it assesses the probability that individuals belong to certain latent classes. LCA avoids the use of a linear combination or a random distance definition to reduce the number of covariates [65] and has recently been employed in health sciences [21, 66]. More specifically, we applied LCA to characterize different classes of individuals based on their metabolic profiles [67] and to evaluate intrinsic associations between the biomarkers, using the poLCA package [68] in R statistical programming language. We first determined the optimal number of LCA-derived classes by executing step-wise models with different numbers of classes, starting with the null model and adding one extra class in each model until reaching the total number of biomarkers in the data, while the model kept converging into a local maximum likelihood. The criterions used for model selection (Akaike information criterion (AIC), Bayesian information criterion (BIC) and Chi-squared distribution) were evaluated to estimate the best goodness of fit model and to define the optimal number of LCA-derived metabolic classes that characterized our dataset. To identify which sets of biomarkers predominantly explained each latent class, how the classes were distributed across the study population and which individuals were allocated to each class, we assessed the conditional probabilities, mixed proportions and class memberships of the best fitted latent class model.

Once each subject was assigned to its LCA-derived metabolic class, we conducted multivariable Cox proportional hazard regression to examine whether the LCA-derived metabolic classes were associated with long term risk of overall cancer as well as specific cancer types. In addition, we evaluated how the classes were associated with all cause-death and cancer-specific death. All models were adjusted for age, sex, and CCI. We performed a sensitivity analysis using age as a time-scale, as this is potentially a strong confounder. Moreover, Schoenfeld residuals were tested to ensure the proportional hazard assumption of the multivariable cox proportional hazard regression analysis.

Data management and statistical analyses were performed using Statistical Analysis Systems (SAS) release 4.3 (SAS Institute, Cary, NC) and R version 3.0.2 (R Foundation for Statistical Computing, Vienna, Austria).