Breast cancer is the most common cancer and the leading cause of cancer death for women worldwide with a higher incidence among women in developed countries [1]. Besides several reproductive and lifestyle-associated risk factors [2], exposure to light-at-night has been suggested to promote breast cancer [3]. In 2007, the International Agency for Research on Cancer (IARC) classified shift work that includes circadian disruption as probably carcinogenic to humans (group 2A) [4]. In their 2019 re-evaluation the IARC confirmed and specified this classification to night-shift work. Studies on the effects of light in animal bioassays were key to this evaluation [5]. Although risk estimates between epidemiological studies vary due to different exposure assessments and study populations, a large pooled analysis of case–control studies confirmed the association between a high number of night shifts and breast cancer [6].

The circadian system is orchestrated by a multisynaptic pathway that is governed by a master clock, the suprachiasmatic nucleus (SCN), located in the hypothalamus. Following photic input, the pathway is set into operation via the retino-hypothalamic tract by intrinsically photosensitive retinal ganglion cells. The light signal is directly projected into the SCN to finally synapse with the pineal gland via complex networks including the sympathetic nervous system, superior cervical ganglions as well as other participating hypothalamic areas (paraventricular nucleus, PVN) [7, 8]. In the circadian clock, mitogen-activated protein kinase (MAPK) pathways function both as input pathways to maintain or reset the oscillator to 24 h environmental cycles, and output pathways that connect the timekeeping oscillator through control of the expression of a large number of functionally related genes [9]. Several variants in circadian genes have been linked to general breast-cancer susceptibility [10, 11].

Melatonin (N-acetyl-5-methoxytryptamine) is the key-player for this synchronization of bodily circadian rhythms. Its biosynthesis follows a multistep process starting with the hydroxylation of the precursor molecule L-tryptophan catalyzed by tryptophan hydroxylase (TPH). Decarboxylation of 5-hydroxy-L-tryptophan by L-aromatic amino acid decarboxylase (AADC) results in the neurotransmitter serotonin, the acetylation of which by aralkylamine N-acetyltransferase (AANAT) and methylation by N-acetylserotonin O-methyltransferase (ASMT, alias HIOMT) finally yields melatonin [12]. During darkness, AANAT activity increases via phosphorylation thereby blocking its proteasomal proteolysis, and its high affinity to serotonin leads to a strong increase in melatonin production [12].

Melatonin is mainly secreted from the pineal gland upon photic neural input, but also produced by other ocular tissues such as photoreceptors and ciliary body epithelium, albeit to a lesser extent, as well as other bodily tissues [13, 14]. With its secretion being affected by the light–dark cycle, melatonin synchronizes bodily circadian rhythms relevant to many endogenous processes including the production of sex hormones [15, 16]. The desynchronization of SCN activity either by day length or timing/phasing of light exposure consequently affects the production of melatonin by the pineal gland and is referred to as circadian disruption [3, 17].

The light-at-night-associated breast-cancer risk has been attributed to a reduced nocturnal biosynthesis and lower secretion of melatonin [3, 17]. In particular, an increased risk for hormone-sensitive breast cancer has been mechanistically accredited to a modified crosstalk between melatonin-receptor and estrogen-receptor pathways triggered upon reduced melatonin and modulated estrogen exposure [18]. Here we investigated the putative contribution of genetic polymorphisms of key enzymes of melatonin biosynthesis and signaling to the risk of developing breast cancer, and highlight a cooperative role in favor of this risk based on a large international association study of more than 44,000 breast-cancer cases and controls.

Material and methods

Study population

We screened 106,621 breast-cancer cases and control subjects with available pheno- and genotype data deposited in the database of the Breast Cancer Association Consortium (BCAC) [19, 20] at the University of Cambridge. Exclusion criteria at the study and individual level are specified in Fig. 1. Studies were not included if reference age (age at diagnosis for cases, age at interview for controls) was missing in > 30% of the study participants, relevant epidemiological variables were not recorded, or controls were not population-based or a case-only design was used. All subjects had to be women as well as of European descent, and cases were required to have a diagnosis of primary breast cancer. Based on these criteria, 44,405 eligible women (22,992 cases and 21,413 controls) from 14 population-based case–control studies were included in the analysis. Individual study descriptions are given in Supplementary table S1. All studies were approved by local ethics committees and all participants gave informed consent.

Fig. 1
figure 1

Flow chart of BCAC data set for the inclusion of case-control studies in the analysis, detailed information on individual studies is provided in Suppl. table S1

Polymorphisms and genotype data

We focused on 97 single nucleotide polymorphisms (SNPs) at 18 genes including melatonin biosynthesis (e.g. TPH1 and TPH2), melatonin receptors (MTNR1A and MTNR1B) as well as various MAP kinases (e.g. MAP2K1, MAP2K2, MAPK1, MAPK8). All 97 SNPs are listed in Supplementary table S2 together with their characteristics in the study population. Corresponding genotypes were retrieved from the BCAC database Cambridge. They were previously generated within the framework of the Collaborative Oncological Gene-environment Study (COGS) using a custom Illumina iSelect array with 211,155 SNPs as described elsewhere [19]. For the SNP selection, all available SNPs on the array at the aforementioned genes were considered.

Statistical analysis

Quality criteria

We checked for Hardy–Weinberg-Equilibrium (HWE) by Χ2-tests and analyzed the heterogeneity between studies by calculating Cochran’s Q for the heterozygous and homozygous rare genotypes for each SNP (Suppl. table S2). To consider multiple testing, we used 0.05 divided by the number of analyzed SNPs as threshold for p values.

Main-effect analysis and confounder selection

For the main-effect analysis of each SNP, we estimated odds ratios (ORs) and 95% confidence intervals (CIs) for each SNP assuming different genetic models (dominant, recessive, and additive model). All models were adjusted for reference age, study, and a set of eight principal components to consider a possible population stratification effect. Furthermore, we adjusted these models for parity (nulliparous/1+ full term pregnancies/unknown), breast-feeding status (never/ever/unknown), smoking status (never/ever/unknown), and current use of estrogen-progesterone combined menopausal hormone therapy (MHT) (no/yes/unknown). Regarding menopausal hormone therapy, current use was defined as ‘use at reference date or within six months prior to the reference date’. Missing values in categorical covariates were coded as ‘unknown’. In a sensitivity analysis, we additionally adjusted the models for menopausal status (pre- and peri-menopausal/post-menopausal/unknown). Post-menopausal was defined as ‘last menstruation more than 12 months before the reference date’. We also calculated ORs separately for estrogen-receptor (ER) positive/negative (ER±) and progesterone-receptor (PR) positive/negative (PR±) cases.

Interaction analyses with logic regression

To analyze interactions between SNPs, we used a two-staged procedure based on logic regression models [21]. In short, a logic regression model is a so-called logic tree embedded in a generalized linear model. The logic tree consists of binary covariates linked by logic expressions with the AND-expression (conjunction) representing interactions (notation of interactions: A \(\wedge\) B (A and B); C \(\wedge\) !D (C and not D)). An optimization algorithm is used to select interactions for the logic tree. Here, we used the logit as link function of the framing generalized linear model with the case–control status as outcome and the simulated annealing algorithm to select interactions for the logic tree as independent variable. To express SNP-SNP interactions in logic regression models, SNP coding in the dominant and recessive genetic model was required [21].

In the first stage of our procedure, we selected interactions for the logic tree by using the logic Feature Selection (logicFS) algorithm to avoid overfitting [21]. Here, we used logicFS to fit 100 logic regression models from bootstrap samples and to calculate a variable importance measure for the multiple tree approach based on the number of correctly classified out-of-bag observations for each bootstrap sample for every interaction consisting of up to six terms included in these models [21]. We ran the algorithm three times with a different random seed and selected the 20 most important interactions each. In the second stage, we fitted individual adjusted logistic regression models with the selected terms.

To account for multiple testing and an increased type I error rate, we calculated the Bayesian false-discovery probability (BFDP) for SNPs/interaction terms with a p value < 0.05 in the adjusted models, assuming a four-fold cost of a false non-discovery compared to a false discovery as suggested by Wakefield [22]. Effects with BFDP < 0.8 are termed noteworthy. We calculated the BFDP for three different prior probabilities (0.1, 0.05, 0.01) for a true association and the OR corresponding to the 97.5% quantile of the prior OR was set to 1.2 for positive associations and to 0.83 for negative associations. Linkage disequilibrium was checked for noteworthy SNPs.

The statistical software R, version 3.4.2, was used for all calculations [23]. The R-packages ‘logicFS’ and ‘LogicReg’ were used for the interaction analysis [24, 25]. All statistical models were fitted as complete-case analyses, including the category ‘unknown’ for missing values in categorical variables, therefore the number of individuals available for calculations varied respectively. This also accounts for slight differences in ORs between the main-effect analysis and the interaction analysis, when an interaction term consists of only one SNP.


The study population of 44,405 women was contributed by 14 case–control studies (Suppl. table S1) of which the smallest study comprised 243 women (NBHS) and the largest 16,746 women (SEARCH). Among the eligible 22,992 cases and 21,413 controls, the mean reference age was 56 years for cases and 57 years for controls with a standard deviation of 10 and 9 years, respectively. Most women had at least one full term pregnancy (76%) and nearly 50% had ever breastfed (Table 1).

Table 1 Characteristics of the study population composed of 14 eligible case-control studies from the BCAC data base

Most of the 97 analyzed SNPs had very few missing values, with a maximum of 4% for rs10217741 (RORB). The minor allele frequency (MAF) in controls ranged from 2–49%. Detailed information for all SNPs is provided in Supplementary table S2. HWE was not met for rs10765576 (MTNR1B) and rs14303 (MAP2K1), and therefore, these polymorphisms were not followed up further.

Main-effect analysis

Noteworthy associations (BFDP < 0.8) between individual SNPs and breast-cancer risk have been identified particularly for TPH2 intronic polymorphisms. ORs (95% CIs) of individual SNPs with a p value < 0.05 and respective BFDPs with different priors (0.1, 0.05, and 0.01) for an effect in the adjusted dominant or recessive model are given in Table 2. ORs and CIs for all SNPs and models are given in Supplementary table S3. The largest OR (adjusted for study, reference age, parity, breast feeding, smoking status, current intake of estrogen-progesterone MHT, and principle components) was observed for recessive rs10857561 (MAPK8, OR = 1.11, 95% CI 1.04–1.18, BFDP < 0.8 at prior 0.01). Increasing the prior to 0.05 also revealed eight linked noteworthy dominant TPH2 SNPs (rs7300641, rs1386492, rs1473473, rs4760751, rs1487276, rs1386489, rs1487281, rs7299582) with similar adjusted ORs around 1.07 (95% CIs 1.02–1.12) in the dominant model. At a prior of 0.1, protective effects were revealed for two TPH2 SNPs in the recessive model (rs17110627, OR = 0.91, 95% CI 0.83–0.99; and rs2129575, OR = 0.91, 95% CI 0.83–0.99), as well as recessive rs13515 (MAPK1, OR = 0.89, 95% CI 0.79–0.99). Recessive rs7075976 (MAPK8, OR = 1.06, 95% CI 1.01–1.12) was also noteworthy. The additive model showed similar results (Suppl. table S3). A sensitivity analysis with additional adjustment for menopausal status did not change the results. When we compared our results with those obtained in the meta GWAS analysis, none of the noteworthy SNP associations reported here were identified at the genome wide association level (5E-08) [20, 26, 27].

Table 2 Breast-cancer risk associations and Bayesian false-discovery probability (BFDP, bold text indicates BFDP < 0.8) of individual SNPs with p values < 0.05 in adjusted dominant and recessive logistic regression models*, sorted by ORb

Analysis by tumor hormone-receptor status

Tumor hormone-receptor status was ER-positive (ER+) for 14,724 patients and ER-negative (ER−) for 3516 patients, as well as PR-positive (PR+) for 10,016 patients and PR-negative (PR−) for 4,768 patients (Table 1). Noteworthy breast-cancer risk associations (BFDP < 0.8) in the ER+ , ER−, PR+ , and PR− subgroups for the dominant and recessive model are listed in Table 3. Results for all SNPs and models are given in Supplementary table S4. None of the four tumor-hormone-receptor subgroups showed noteworthy associations at a prior of 0.01. At a prior of 0.05, six SNPs showed noteworthy associations. These comprised MAPK8 SNP rs10857561 in the ER+ subgroup (OR = 1.10, 95% CI 1.02–1.18; recessive model) and two tightly linked TPH2 SNPs rs7300641 and rs1386492 in the PR+ group (both: OR = 1.07, 95% CI 1.02–1.13; dominant model). In the ER− subgroup the two linked TPH2 SNPs rs1473473 and rs1487276 (both: OR = 1.12, 95% CI 1.04–1.22; dominant model) and the RORA SNP rs17237290 (OR = 0.85, 95% CI 0.73–0.97; dominant model) showed noteworthy associations, while none of the associations in the PR− subgroup were noteworthy at prior 0.05. Besides rs17237290, the variants mentioned above were also noteworthy in the main analysis at prior 0.05 under the identical genetic models. At a prior of 0.1, in total 25 noteworthy associations of 19 SNPs were observed for all four breast cancer subtypes.

Table 3 Breast-cancer risk associations and Bayesian false-discovery probability (BFDP, bold text indicates BFDP < 0.8) of individual SNPs with p values < 0.05 in adjusted logistic regression models for dominant or recessive models* stratified by tumor hormone-receptor status, each subgroup sorted by ORb

Interaction analysis

The 20 most important interaction terms each from three different starting seeds for the logicFS algorithm resulted in 53 unique interactions terms (Suppl. table S5). The adjusted logistic regression models for these terms yielded ten interaction terms with a p-value < 0.05 (Suppl. table S6), hence suitable for BFDP calculation.

With a prior probability for an effect of 0.01, we found three noteworthy interaction terms in the adjusted models (BFDP < 0.8, Table 4): rs10857561R ∧ !rs1347069D (OR = 1.15, 95% CI 1.05–1.25), rs10857561R, (OR = 1.11, 95% CI 1.04–1.18), and rs1386483R ∧ rs1473473D ∧ rs3729931D (OR = 1.20, 95% CI 1.09–1.32). With increased priors of 0.05 and 0.1, in total six and eight interaction terms reached noteworthiness, respectively. All noteworthy interactions (all priors) included at least one SNP that showed noteworthy associations, respectively, in the main effect analysis (TPH2: rs1386489, rs1473473, rs7299582; MAPK8: rs10857561, rs7075976).

Table 4 Breast-cancer risk associations and Bayesian false-discovery probability (BFDP, bold text indicates BFDP < 0.8) for interactions/main effects with p value < 0.05 in adjusted logistic regression models*, sorted by ORb


This hypothesis-based breast-cancer association study focused on the putative role of modulators of the pineal gland hormone melatonin and their potential influence on breast-cancer risk. In line with the light-at-night hypothesis, according to which altered light-induced nocturnal melatonin production and signaling increases the risk of breast cancer [3], our findings point to a cooperative role of genetic variations that may modulate serotonergic brain networks and/or the signaling of melatonin within the context of breast-cancer susceptibility.

The strongest observed risk effects were driven by various interactions of polymorphisms at TPH2 and MAPK genes (MAPK8, MAP2K1, RAF1). The triple interaction of TPH2 rs1386483 and rs1473473 as well as RAF1 rs3729931 increased breast-cancer risk by 20%, a dual interaction of MAPK8 rs10857561 and MAP2K1 rs1347069 by 15%, and MAPK8 rs10857561 alone by 11% (all observed at a prior probability of 0.01). In most instances, risk effects were evident at the individual SNP level both, in main and stratified risk analyses by hormone-receptor status. To the best of our knowledge and based on the combined iCOGs/Oncoarray meta-analysis of the BCAC cohort [20] as well as the Catalogue of Curated Breast Cancer Genes [28], these breast-cancer risk associations are newly described. Yet, some TPH2 polymorphisms have been reported in the literature within the context of psychiatric disorder related endpoints such as antidepressant response and GABA concentration, conditions in which effects of serotonin are underlying biological mechanisms [29, 30].

The involvement of TPH2 as a putative breast cancer susceptibility locus is plausible, as the enzyme is exclusively expressed in neuronal cells of the central nervous system where it catalyzes the rate limiting step in serotonin (5-HT) biosynthesis, the chemical precursor of melatonin [31]. Its pertinent role in females has been inferred from expression studies of post mortem brain tissues in which female thalamic and hypothalamic brain showed higher TPH2 mRNA expression levels compared to male counterparts [32], thereby highlighting the critical role of an intact serotonergic pathway for female neurohormone/neurotransmitter production. In general, TPH2 is present in various brain regions with particular abundance in the major central serotonergic neuronal networks that localize to median and dorsal raphe nuclei in the brainstem known to participate in basal functions such as temperature regulation, feeding and energy balance, as well as mood and sleep [33, 34]. The synthesis and periodical secretion of serotonin from these brain stem regions and neuronal terminal fields are regulated at the level of TPH2 gene expression that was shown to be under the circadian control of melatonin, estrogen, and corticoids in rodents and primates [34, 35]. This underscores the various feedback mechanisms to the serotonergic system that are involved in the entrainment of the hypothalamic SCN [7, 36, 37]. To a smaller extent, TPH2 is also rhythmically expressed in ocular tissues with rhythmical release of melatonin, the levels of which are highest in darkness and lowest in the light [8, 38]. Intrinsic photosensitive retinal ganglion cells by virtue of their concurrent rhythmic melatonin-receptor expression may therefore contribute to the output signal of the retino-thalamic tract to the SCN [39]. Hence, our findings of an association between TPH2 variants and increased breast-cancer risk are well in line with the notion that such variants impact on serotonin formation, thereby disrupting the SCN pacemaker circuitry feedback. In particular, the observed increased breast-cancer risk in TPH2 variant carriers may result from modified brain melatonin levels due to a dysfunctional SCN (together with reduced melatonin levels in the retina) that upon light-at-night periods may reduce pineal melatonin secretion.

If the disruption of pineal gland melatonin biosynthesis during extended light-at-night periods affects nocturnal serotonin levels and transmission to the hypothalamic nuclei SCN and PVN [7, 40], we need to consider further that these nuclei also make up major parts of the hypothalamic-pituitary–gonadal (HPG) axis. In the HPG axis, serotonin modulates the hypothalamic secretion of the gonadotropin releasing hormone (GnRH) as well as the pituitary luteinizing and follicle stimulating hormones (LH, FSH), and prolactin that stimulate the production of sex hormones in peripheral target organs such as the ovaries or testes [41]. Similar to melatonin, prolactin secretion may be driven by the central circadian pacemaker located in the SCN of the hypothalamus [42], and our observed TPH2 variant-associated increased breast-cancer risk may in addition relate to local serotonergic effects accountable for increased prolactin production in the anterior pituitary gland. Of note, circulating prolactin is an established breast-cancer risk factor that has been confirmed in a series of analyses from the prospective Nurses’ Health Study, particularly with respect to ER-positive breast cancer [43], and a large analysis from the prospective EPIC cohort [44] with emphasis on users of hormone replacement therapy. This is underpinned by a vast body of evidence from animal and in vitro studies. Together with estradiol and progesterone, it exerts effects on normal epithelial cell expansion, ductal side branching of the breast during puberty, and formation of lobuloalveolar structures during pregnancy [45]. As prolactin and progesterone have synergic roles to induce cell growth and proliferation during adult gland maturation and alveologenesis of the breast terminal duct-lobular units, the site of origin for most breast cancers, a crosstalk between progesterone, prolactin and receptor signaling pathways may not only be relevant in normal, but also malignant breast cells [46].

MAPK8 (JNK, c-Jun N-terminal kinase) in our study has been associated with breast-cancer risk at rs10857561, both in the individual main and hormone-receptor positive (ER + and PR +) stratified analyses, as well as in the SNP-SNP interaction analyses. While all three canonical MAPK pathways (ERK MAPK, p38 MAPK, and JNK MAPK) serve as input to the circadian clock in distinct ways [9], JNK in particular has been shown to be essential for the normal oscillation of the mammalian circadian clock and its photic regulation. The JNK-imparted transmission of light signals to the BMAL1-CLOCK complex controls oscillation speed and phase response of the master clock [9].

Aside from their role as master clock regulators via SCN or peripheral tissue, MAPK genes play a role in cellular processes like proliferation and cell death [47]. It is therefore not surprising that they were associated with the development of cancer at many sites [48]. While rs10857561 in MAPK8 showed noteworthy risk estimates for breast cancer in general as well as ER-positive and PR-positive subgroups in our analyses, this SNP has been previously shown to be associated with rectal cancer once more highlighting the role in carcinogenesis [49].

A major strength of this study is the large number of study participants (22,992 cases and 21,413 controls) retrieved from population-based case–control studies with defined reference age (age at diagnosis for cases, age at interview for controls) and availability of comprehensive epidemiological and tumor immunohistochemistry as well as genotype data. This allowed us to calculate precise SNP-specific OR main effect and interaction estimates. Our hypothesis-driven approach of a putative role of polymorphic regulators or signaling mediators of melatonin in breast-cancer risk limited the number of potentially detectable false-positive associations. Moreover, we used the BFDP to measure the noteworthiness of our effect estimates and to account for false-positive results via multiple testing. Furthermore, our interaction analysis based on logic regression models enabled us to model the effects of complex interaction scenarios considering the multivariate structure of SNP interplay.

In spite of the large study size, limitations of the study include a high number of missing values in included confounders. We used the category ‘unknown’ for categorical variables in these participants to maintain the remaining information. Moreover, the sample size in subgroup analyses was reduced, for example, ER status was missing for 21% of samples and PR status for 36% of samples. However, missingness is likely to be random with respect to genotypes. There was also minor heterogeneity in definition of stage, grade, and cut-off levels for ER and PR across studies. The tumor subtypes were strictly defined by immunohistochemical markers as other data, such as intrinsic subtypes from expression profiles nowadays used for subtype definition, are not available in large-scale epidemiological studies. Finally, our interpretations strictly rely on functional and physiological data reported in the literature and include in vitro, animal in vivo as well as post mortem findings.

Our newly identified breast-cancer risk associations justify continued research into the relationship between breast-cancer risk and putative modulators of the intricate network of rhythmic circadian regulators such as melatonin and serotonin upon photic stimulation at night. The observed interactions between genetic variants of TPH2 and MAPK8 highlight their cooperation as putative breast-cancer risk modulators and call upon the comprehensive scrutiny of the circadian clock system and its input and output effectors in large breast-cancer cohorts. This research holds the potential to reveal new insights into the breast-cancer risk of women exposed to light-at-night which is particularly relevant for female night shift workers.