Breast cancer is the most common cancer among women worldwide [1]. Known modifiable hormonal and lifestyle risk factors, however, are estimated to be responsible for only around 30% of breast cancers in high-income countries [2,3,4,5,6,7,8], so a better understanding of the etiology of the disease and of the biological mechanisms is needed.

The metabolome reflects endogenous processes and environmental and lifestyle factors [9,10,11,12,13]. Metabolomics can detect subtle differences in metabolism; therefore, it is a promising tool to identify new etiological pathways. Previous prospective studies of breast cancer which have employed metabolomics have used both targeted (analyses of a pre-defined panel of metabolites) [14] or untargeted (where as many metabolites as possible are measured and then characterized [15]) approaches [16,17,18]. In previous studies, lysophosphatidylcholine a C18:0 [14], various lipids, acetone, and glycerol-derived compounds [16], 16a-hydroxy-DHEA-3-sulfate, 3-methylglutarylcarnitine [17], and caprate (10:0), were associated with breast cancer development [18]. The number of cases included in these studies was, however, limited (from 200 to 621) and heterogeneity by subtype was investigated in only one study [18].

In the current study, we employed a targeted metabolomics approach to prospectively investigate the associations between 127 metabolites measured by mass spectrometry in pre-diagnostic plasma samples and risk of breast cancer, overall, and by breast cancer subtype, accounting for established breast cancer risk factors.


Study population, blood collection, and follow-up

EPIC is an ongoing multi-center cohort study including approximately 520,000 participants recruited between 1992 and 2000 from ten European countries [19]. Female participants (n = 367,903) were aged 35–75 years old at inclusion. At recruitment, detailed information was collected on dietary, lifestyle, reproductive, medical, and anthropometric data [19]. Around 246,000 women from all countries provided a baseline blood sample. Blood was collected according to a standardized protocol in France, Germany, Greece, Italy, the Netherlands, Norway, Spain, and the UK [19]. Serum (except in Norway), plasma, erythrocytes, and buffy coat aliquots were stored in liquid nitrogen (− 196 °C) in a centralized biobank at IARC. In Denmark, blood fractions were stored locally in the vapor phase of liquid nitrogen containers (− 150 °C), and in Sweden, they were stored locally at − 80 °C in standard freezers.

Incident cancer cases were identified through record linkage with cancer registries in most countries and through health insurance records, cancer and pathology registries, and active follow-up of study subjects in France, Germany, and Greece. For each EPIC center, closure dates of the study period were defined as the latest dates of complete follow-up for both cancer incidence and vital status (dates varied between centers, from June 2008 to December 2012).

All participants provided written informed consent to participate in the EPIC study. This study was approved by the ethics committee of the International Agency for Research on Cancer (IARC) and all centers.

Selection of cases and controls

Subjects were selected among participants who were cancer-free (other than non-melanoma skin cancer) and had donated blood at recruitment into the cohort. Cancers were coded according to the Third Edition of the International Classification of Diseases for Oncology (code C50). Women diagnosed with first primary invasive breast cancer at least 2 years after blood collection and before December 2012, for whom estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) statuses of the tumors were available, were selected as cases for the current study.

For each breast cancer case, one control was chosen at random among appropriate risk sets comprising all female cohort members who were alive and without cancer diagnosis (except non-melanoma skin cancer) at the time of diagnosis of the index case. Using incidence density sampling, controls were matched to cases on center of recruitment, age (± 6 months), menopausal status (premenopausal, perimenopausal, postmenopausal, surgically postmenopausal [20]), phase of the menstrual cycle [20], use of exogenous hormone at blood collection, time of the day (± 1 h), and fasting status at blood collection (non-fasting (< 3 h since last meal), in between (3–6 h), fasting (> 6 h), unknown).

Initially, 1626 cases and 1626 controls were eligible for the study, but after the exclusion of pregnant women at blood collection, a final population of 1624 cases and 1624 controls were included in the analysis.

Laboratory measurements

All plasma samples were assayed in the Biomarkers laboratory at IARC, using the AbsoluteIDQ p180 platform (Biocrates Life Sciences AG, Innsbruck, Austria) and following the procedure recommended by the vendor. A QTRAP5500 mass spectrometer (AB Sciex, Framingham, MA, USA) was used to measure 147 metabolites (19 acylcarnitines, 21 amino acids, 13 biogenic amines, 79 glycerophospholipids, 14 sphingolipids and hexoses). Samples from matched case-control sets were assayed in the same analytical batch. Laboratory personnel were blinded to case-control status of the samples.

Selection of metabolites

Metabolites were analyzed in samples from 3247 distinct subjects (one subject included in 2 pairs). Completeness of measures and coefficients of variation (median = 5.3%, interquartile range = 1.4%) are shown in Additional file 1: Table S1. Values lower than the lower limit of quantification (LLOQ), or higher than the upper limit of quantification (ULOQ), as well as lower than batch-specific limit of detection (LOD) (for compounds measured with a semi-quantitative method: acylcarnitines, glycerophospholipids, sphingolipids), were considered out of the measurable range. Metabolites were excluded from the statistical analyses if more than 20% of observations were outside the measurable range (n = 20). A total of 127 metabolites (8 acylcarnitines, 20 amino acids, 6 biogenic amines, 78 glycerophospholipids, 14 sphingolipids and hexoses) were finally retained for statistical analyses. Of these 127 metabolites, 113 had all values included in the measurable range. For the remaining 14 metabolites, values outside the quantifiable range (all lower than LLOQ or LOD) were imputed with half the LLOQ or half the batch-specific LOD, respectively.

Statistical analysis

Characteristics of cases and controls were described using mean and standard deviation (SD) or frequency. Geometric means were used to describe non log-transformed metabolite concentrations among cases and controls. Log-transformed metabolite concentrations were used in all other analyses. Partial Pearson’s correlations between metabolites, adjusted for age at blood collection, were estimated among controls.

We used conditional logistic regression to estimate the risk of breast cancer per standard deviation (SD) increase in metabolite concentration. The analysis was conditioned on the matching variables. Likelihood ratio tests were performed to compare linear models with cubic polynomial models in order to assess departure from linearity. Multiple testing was addressed by controlling for family-wise error rate at α = 0.05 by permutation-based stepdown minP adjustment of P values, as this method better accounts for the dependence of the tests [21, 22]. For comparison with previous studies, we also adjusted the raw P values using Bonferroni correction (P < 0.05/127) and controlling for the false discovery rate (FDR) at α = 0.05 [23]. All statistical tests were two-sided.

Metabolites showing a statistically significant association with risk of breast cancer after correcting for multiple testing were categorized into quintiles based on the distribution of the concentrations among controls, and odds ratios (OR) for risk of breast cancer were estimated in each category. For tests of linear trend, participants were assigned the median value in each quintile and we modeled the corresponding variable as a continuous term. To identify potential confounders, models of the metabolites of interest (continuous and quintiles) were adjusted separately for each potential confounder and estimates obtained were compared with estimates from models with matching variables only. Only variables that changed parameter estimates by more than 10% were retained in the multivariable model. Variables tested were as follows: age at first menstrual period (continuous), number of full-term pregnancies (0/1/2/≥ 3), age at first full-term pregnancy (never pregnant/quartiles), breastfeeding (ever/never/never pregnant/missing; duration in quintiles), ever use of oral contraceptive (yes/no), ever use of MHT (yes/no/missing), smoking status (never/former/current), level of physical activity (Cambridge index [24]: inactive/moderately inactive/moderately active/active), alcohol consumption (nondrinkers/> 0–3/3–12/12–24 g/day), education level (no schooling or primary/technical, professional or secondary/longer education), energy intake (continuous, quintiles), height (continuous, quintiles), sitting height (missing/quartiles), weight (continuous, quintiles), body mass index (continuous, quintiles), waist circumference (continuous, quintiles), hip circumference (continuous, quintiles), and hypertension (yes/no). For these variables, missing values were assigned the median (continuous variables) or mode (categorical variables) if they represented less than 5% of the population, or were otherwise classified in a “missing” category (breastfeeding, ever use of MHT, sitting height). Only waist circumference (continuous), hip circumference (continuous), and weight (continuous) were included in the final models. Given the correlations between these variables (> 0.77), these variables were included separately in three different models.

For those metabolites showing a significant association with breast cancer risk after controlling for multiple testing, heterogeneity was investigated by menopausal status at blood collection, use of exogenous hormones at blood collection, fasting status at blood collection, age at diagnosis (age 50 or older/younger than age 50), breast cancer subtype (ER+PR+/−HER2+, ER+PR+/−HER2−, ER−PR−HER2+, ER−PR−HER2−), time between blood collection diagnosis (2–8.6 years/more than 8.6 years), and at recruitment waist circumference (WC) (< 80 cm/≥80 cm), BMI (< 25 kg/m2/≥25 kg/m2), and country, by introducing interaction terms in the models. Subgroup analyses were conducted on the raw models. For WC, unconditional logistic regression adjusted for each matching factor was used. P values were not corrected for multiple tests since heterogeneity was investigated only for metabolites showing statistically significant associations with risk overall, after correction for multiple testing.

A sensitivity analysis of all 127 metabolites was performed on hormone non-users (1124 cases and 1124 controls) and by cancer subtype.

Analyses were conducted using SAS software for Windows (version 9.4, Copyright© 2017, SAS Institute Inc.) and R software (packages Epi and NPC) [25, 26].


Cases were diagnosed on average 8.3 years after blood collection, at a mean age of 60.8 years. The majority of tumors were ER-positive (80.7%), PR-positive (68.2%), and HER2-negative (78.2%) (Table 1). Mean concentrations of metabolites by case/control status are shown in Additional file 1: Table S2.

Table 1 Main characteristics of the study population

Overall, positive, moderate correlations were observed among some of the amino acids, phosphatidylcholines (PCs), lysoPCs, and sphingomyelins (see Additional file 1: Figure S1); the average absolute correlations within each class was 0.36, 0.39, 0.45, and 0.55, respectively (data not tabulated).

Associations of metabolites with breast cancer risk

Prior to correction for multiple testing, 29 metabolites were significantly associated with the risk of breast cancer with a raw P value lower than 0.05 (Fig. 1a and Table 2), mainly amino acids, PCs (inversely associated), and acylcarnitines (directly associated). However, after adjusting for multiple testing (Fig. 1b), only C2 (OR for 1 SD increment = 1.15, 95% CI = 1.06–1.24, corrected P value = 0.031) and phosphatidylcholine PC ae C36:3 (OR for 1 SD increment = 0.88, 95% CI = 0.82–0.95, corrected P value = 0.044) remained significantly associated with risk of breast cancer (Table 2). Adjustment for multiple testing using FDR procedure identified similar significant metabolites, while with Bonferroni correction, only C2 remained associated with risk of breast cancer with a borderline significant P value (Bonferroni P value = 0.051) (Table 2). Departure from linearity was suggested for glutamate, C0, kynurerine, and SDMA. However, when non-linear models were examined, and after controlling for multiple tests, no non-linear association remained significant (results not shown).

Fig. 1
figure 1

Odds ratios (ORs) for the associations between metabolites and breast cancer. a Raw P values. b Adjusted P values. PC: phosphatidylcholine; SM: sphingomyelin. ORs are estimated per standard deviation (SD) increase in log-transformed metabolite concentrations, from logistic regression conditioned on matching variables. a Statistical significance based on raw P values (significant metabolites above dotted line). b Statistical significance based on P values adjusted by permutation-based stepdown minP (see “Methods” section for details); adjusted P values above 0.05 (dotted line) were considered statistically significant after correction for multiple tests

Table 2 Associations between metabolites (continuous) and risk of breast cancer, for metabolites with raw P values < 0.05

When C2 and PC ae C36:3 were further analyzed as categorical variables, results similar to those of the linear analysis were obtained; logistic regression conditioned on the matching variables showed a linear trend across quintiles of C2 (OR quintile 5 versus quintile 1 = 1.54, 95% CI = 1.21–1.95, P trend = 0.0002) and of PC ae C36:3 (OR quintile 5 versus quintile 1 = 0.73, 95% CI = 0.58–0.91, P trend = 0.0003) (Table 3). Adjusting for anthropometric variables in separate models had little effect on the risk estimates (Table 3).

Table 3 Associations between C2 and PC ae C 36:3 and risk of breast cancer

Stratification by hormone therapy

Statistically significant heterogeneity was observed by use of hormones at blood collection for the associations of C2 (P homogeneity = 0.035) and PC ae C36:3 (P homogeneity = 0.017) with breast cancer, with statistically significant associations restricted to hormone non-users (C2: OR per SD = 1.23, 95% CI = 1.11–1.35; PC ae C36:3: OR per SD = 0.83, 95% CI = 0.76–0.90) and no associations observed in users (C2: OR per SD = 1.03, 95% CI = 0.91–1.17; PC ae C36:3: OR per SD = 1.00, 95% CI = 0.88–1.13; Fig. 2).

Fig. 2
figure 2

Associations between C2 (a) and PC ae C36:3 (b) and breast cancer, by selected variables. CI: confidence interval; ER: estrogen receptor; HER2: human epidermal growth factor receptor 2; PC: phosphatidylcholine; PR: progesterone receptor; SM: sphingomyelin. Odds ratios (ORs) are estimated per standard deviation (SD) increase in log-transformed metabolite concentrations, from logistic regression conditioned on matching variables. Homogeneity was tested by adding an interaction term in the conditional logistic regression model for menopausal status, use of hormones at blood collection, fasting status, breast cancer subtype, and age at diagnosis (all matching factors or case characteristics). For waist circumference (non-matching factor), logistic regression adjusted for each matching factor was used

In an analysis of the 127 metabolites restricted to hormone non-users (n = 2248) (Fig. 3), we identified additional metabolites showing statistically significant inverse associations with risk of breast cancer after adjustment of P values for multiple testing, for which heterogeneity was also investigated. These metabolites were as follows: arginine (OR per SD = 0.79, 95% CI = 0.70–0.90; P homogeneity = 0.002), asparagine (OR per SD = 0.83, 95% CI = 0.74–0.92; P homogeneity = 0.12), PC aa C36:3 (OR per SD = 0.84, 95% CI = 0.77–0.93; P homogeneity = 0.12), PC ae C34:2 (OR per SD = 0.85, 95% CI = 0.78–0.94; P homogeneity = 0.04), PC ae C36:2 (OR per SD = 0.85, 95% CI = 0.78–0.88; P homogeneity = 0.04), and PC ae C38:2 (OR per SD = 0.84, 95% CI = 0.0.76–0.93; P homogeneity = 0.10).

Fig. 3
figure 3

Adjusted P values for associations between metabolites and breast cancer, hormone non-users (1124 cases, 1124 controls). PC: phosphatidylcholine; SM: sphingomyelin. Odds ratios (ORs) are estimated per standard deviation (SD) increase in log-transformed metabolite concentrations, from logistic regression conditioned on matching variables. Raw P values were adjusted by permutation-based stepdown minP (see “Methods” section for details); adjusted P values above 0.05 (dotted line) were considered statistically significant after correction for multiple tests

No significant heterogeneity was observed for the association of C2 and PC ae C36:3 with breast cancer by menopausal status, fasting status at blood collection, breast cancer subtype, age at diagnosis, WC (P homogeneity all > 0.12, Fig. 2), country (P homogeneity of 0.50 for C2 and 0.12 for PC ae C36:3) or by time between blood collection and diagnosis (2–8.6/≥8.6 years (median); P homogeneity of 0.17 for C2 and 0.98 for PC ae C36:3) (data not shown).

Stratification by breast cancer subtypes for all metabolites (see Additional file 1: Figure S2) showed that no metabolite reached statistical significance after correction for multiple testing in each subtype, although for ER+PR+/−HER2− cases (n = 1084 cases), PC ae C36:3 and PC aa C36:3 had adjusted P values close to statistical significance (0.066 and 0.074, respectively).


In this prospective analysis that investigated the association of 127 circulating metabolites with breast cancer incidence, among women not using hormones at baseline, and after control for multiple tests, acylcarnitine C2 was positively associated with risk of breast cancer, while levels of a set of phosphatidylcholines (ae C36:3, aa C36:3, ae C34:2, ae C36:2 and ae C38:2) and the amino acids arginine and asparagine were inversely associated with disease risk. In the overall population (hormone users and non-users), only C2 and PC ae C36:3 were associated with risk of breast cancer independently from breast cancer subtype, age at diagnosis, fasting and menopausal status at collection, or adiposity.

Acylcarnitine C2 plays a key role in the transport of fatty acids into the mitochondria for β-oxidation [27, 28]. In human intervention studies, plasma concentration levels have been seen to vary according to the activity of the fatty oxidation pathway [28, 29]. High C2 levels are associated to other known mechanisms involved in breast cancer development, such as hyperinsulinemia and insulin resistance [30], consistent with some studies showing increased plasma concentrations of acetylcarnitine in pre-diabetic or diabetic women [31,32,33]. An explanation for the associations observed only in women not using hormones, for C2 and for other metabolites, could be that due to their increased exposure to estrogens, MHT users are already at a higher risk of breast cancer than non-users [34], similarly to what is observed for BMI and postmenopausal breast cancer risk [35].

Phospholipids are a major component of cell membranes and play a major role in cell signaling and cell cycle regulation. Previous studies of phospholipids showed that PC ae C36:3 concentrations were decreased in type 2 diabetes [36, 37] and that lower serum levels were predictive of future diabetes [38]. Lower concentrations of PCs ae C38:2 and ae C34:2 were also observed in diabetic men compared to non-diabetics [37]. A biological basis for such inverse associations could rely on observed antioxidant effect of PCs [39].

In line with the inverse association observed between arginine and risk of breast cancer in hormone non-users, decreased plasma concentrations of arginine has been observed in breast cancer patients [40] compared with controls. Both human [41] and animal [42] studies have observed a reduction in anti-tumor immune responses in the context of arginine depletion in breast cancer, suggesting a link between arginine and immunity. In addition, higher plasma concentrations of arginine were correlated with lower estradiol and insulin-like growth factor 1 concentrations in premenopausal women [43], linking arginine to known mechanisms leading to breast cancer development. Regarding asparagine, a recent animal and in vitro study suggested that reduced asparagine bioavailability resulted in slower disease progression [44]. However, the role of asparagine in cancer development is not clear.

Prospective data on metabolomics and risk of breast cancer are limited [14, 16,17,18], and differences in approaches (targeted or untargeted metabolomics), analytical methods (NMR or MS), and samples (serum or plasma) make comparisons of the results difficult. Only one previous analysis used a similar targeted metabolomics approach with measurement of the same metabolites [14] and showed that lysophosphatidylcholine a C18:0 was inversely associated with risk of breast cancer after Bonferroni correction of P values, and that an inverse association close to statistical significance was observed for PC ae C38:1. However, none of the metabolites identified in the present work were associated with risk of breast cancer in this previous study, which did not investigate heterogeneity by use of hormones.

In a previous study applying NMR-based metabolomics analyses in the SU.VI.MAX cohort [16], several amino acids, lipoproteins, lipids, and glycerol-derived compounds were identified as significantly associated to breast cancer risk, suggesting that modifications in amino acid metabolism and energetic homeostasis in the context of setting up of insulin resistance could play a role in the disease. Results from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening (PLCO) study, based on an MS-based metabolomics approach in serum samples, indicated that some metabolites correlated with alcohol intake (androgen pathway metabolites, vitamin E, and animal fats) [18], and with BMI (metabolites involved in steroid hormones metabolism and branched-chain amino acids) [17], were also associated with breast cancer risk.

Heterogeneity by subtype was investigated only in the PLCO study, showing that some metabolites (allo-isoleucine, 2-methylbutyrylcarnitine [17], etiocholanolone glucuronide, 2-hydroxy-3-mthylvalerate, pyroglutamine, 5α-androstan-3β, 17β-diol disulfate [18]) were associated with risk of ER+ breast cancer, but not with breast cancer overall, indicating that the etiology of breast cancer differs by subtype. In our work, however, we did not observe any heterogeneity of results according to receptor status of the cancers.

This study is the largest prospective investigation of metabolomics and risk of breast cancer to date. Strengths of this work include its large sample size, which allowed us to examine associations by breast cancer subtype. In addition, the exclusion of cases diagnosed less than 2 years after blood collection reduces the risk of reverse causation in our findings. Finally, the assessment of numerous lifestyle factors and anthropometric measures allowed us to examine and control for potential confounding.

A potential limitation to our work is that blood was collected from participants at one time point only. Nevertheless, the reliability of plasma metabolites analyzed here has been shown to be relatively stable over 4 months to 2 years, leading to the conclusion that a single measurement might be sufficient [45, 46, 47]. In addition, although fasting samples might be preferable over non-fasting samples, in our study, cases and controls were matched on fasting status and the results did not differ by fasting state. Another limitation is that the technologies that were used for some of the metabolites (such as PCs and lysoPCs) do not allow for a precise identification of the compounds measured, since the signal observed is not specific and may correspond to several compounds. Lastly, it is important to note that the aim of the present work was to screen metabolites associated with risk, but that further work is needed to identify the factors that influence biological levels of the metabolites associated with risk and to understand their biological connection with breast cancer development. Future studies should also integrate other molecular markers known to be linked to breast cancer to gain insight into biological mechanisms.


We observed a positive association between acetylcarnitine (C2) and risk of breast cancer, and an inverse association between PC ae C36:3 and risk of breast cancer. These associations were limited to women not using hormones, as were inverse associations with arginine, asparagine, PCs aa C36:3, ae C34:2, ae C36:2, and ae C38:2. These metabolites might be biomarkers of future breast cancer development. These results need to be replicated in other epidemiological studies, and more research is needed to identify determinants of these metabolites.