Introduction

The tumor microenvironment can promote carcinogenesis by secretion of stroma-derived factors that are master regulators of cell division, growth, motility, and morphology [1]. Hepatocyte growth factor (HGF) is one of the major components secreted by the tumor stroma that drives tumorigenesis [1,2,3,4,5]. Takayama et al. conducted an in vivo study in transgenic mice and found that overexpression of HGF in mammary cells led to neoplasms [5]. In breast cancer cell lines, activation of the HGF pathway via binding of HGF to its receptor c-MET can lead to increased cell survival, proliferation, and resistance to cancer inhibitors [1]. In breast tumors, clinical studies have correlated activation of the HGF pathway (as defined by c-MET over-expression) with increased tumor size, high tumor grade, and distant metastasis [6]. In addition, a meta-analysis indicated that c-MET overexpression was associated with overall and disease-free survival in breast cancer patients [6]. However, in clinical trials that target the HGF-c-MET pathway, there is a lack of biomarkers for HGF expression that accurately identify patients prone to respond to targeted therapy [7].

The HGF pathway may also play a role in breast cancer disparities by race. HGF germline variations that differ by race have been found to modulate the expression of HGF in blood, normal tissue, and breast cancer tumor samples [8]. Ma et al. found that African American breast cancer patients had a higher frequency of truncating mutations (51%) in the promoter region of HGF compared to Caucasian breast cancer patients (15%), and these mutations were shown to result in increased expression of HGF in breast cancer tissue. The truncating mutations were also discovered in the germline (normal breast and normal blood tissue) and were found to be associated with case vs. non-case status [8]. We previously published an HGF gene expression signature that was associated with poorer survival in a small study [9], but our study population had insufficient racial diversity to evaluate the role of HGF gene expression by race.

Black women experience 40% higher mortality from breast cancer compared to white women [10]. It is important to identify biological pathways that may contribute to these disparities, and due to its role in cancer progression and differences in its expression, HGF is a plausible contributor. To evaluate the role of HGF pathway activation in breast cancer disparities, we assessed a 38-gene HGF gene expression signature in invasive breast cancer cases in a population-based resource, the Carolina Breast Cancer Study. We sought to understand associations between the HGF pathway and demographic characteristics, clinical features, and tumor subtypes within this racially diverse population.

Methods

Study population

The Carolina Breast Cancer study population has been described in previous publications [11, 12]. In brief, CBCS is a population-based study that utilized rapid case ascertainment to identify breast cancer cases from the North Carolina Cancer registry. Phases 1 and 2 of CBCS were conducted between 1993 and 2001 in 24 counties, and phase 3 subsequently expanded the study to a total of 44 counties in 2008–2013. Inclusion criteria included women who were North Carolina residents, ages 20–74 years old. Black and younger women (age < 50) were oversampled using randomized recruitment methods. Informed consent was obtained from each participant. This study was approved by the University of North Carolina at Chapel Hill Office of Human Ethics and Institutional Review Board. In total, there are 4806 invasive breast cancer cases who were enrolled in the Carolina Breast Cancer Study (phases 1–3). Within this population, 1188 participants were removed due to inadequate tissue for analysis. Quality control analysis removed 241 participants for low-quality RNA. Of the 3377 participants, 1957 were analyzed on the HGF gene expression assay. Participants that were not included in the study did differ on certain clinical variables including smaller tumor size lower grade and lower stage. However, the clinical and demographic features of the analysis set were similar to the distribution of the Carolina Breast Cancer Study as a whole, except tumors with a higher grade were more likely to be sampled. For the purposes of this analysis, 1957 invasive breast cancer cases with expression data for the HGF signature from all 3 CBCS phases (phase 1: n = 252, phase 2: n = 454, phase 3: n = 1251) were included.

Demographic and clinical characteristics

Home interviews were conducted by a trained nurse, and all demographic and lifestyle information was self-reported, except body mass index (BMI), which was calculated from body measurements obtained by the nurse. Clinical tumor characteristics (estrogen receptor status, progesterone receptor status, HER2 receptor status, combined tumor grade and AJCC stage) were obtained from medical records, pathology reports, and immunohistochemical staining analysis at the University of North Carolina at Chapel Hill. Combined tumor grade was only available for CBCS phase 1 and phase 3 tumors and was assigned by a single pathologist to respective grading categories using the Nottingham breast cancer grading system [13]. Similar distributions for grade were observed within CBCS phase 1 and phase 3 (chi-square p value = 0.07). Phase 2 participants were excluded from analyses of the association of HGF with tumor grade.

Gene expression data

Gene expression analysis for CBCS was described in prior publications [14]. Briefly, formalin-fixed paraffin-embedded (FFPE) tumor specimens were used (n = 2 (1 mm) cores; n = 2(10 μm) FFPE slides). RNA was isolated from FFPEs using the Qiagen FFPE RNeasy isolation kit (Germantown, MD) and counted using Nanostring nCounter technology (Seattle, Washington). A custom code set used to measure the genes used in the PAM50 predictor (to characterize RNA-based intrinsic breast cancer subtypes, namely luminal A, luminal B, HER2-enriched, basal-like and normal-like), risk of recurrence score (ROR-PT), and the HGF 38-gene signature [14, 15]. The ROR-PT score predicts the risk of distant recurrence incorporating information on subtype, proliferation score, and tumor size [15, 16]. After quantification of the RNA targets, NanoString gene expression values were normalized as previously described with remove unwanted variation (RUV), using the RUVg function from the RUVSeq Bioconductor package [17, 18]. We controlled for unwanted technical variation using the set of housekeeping genes that had expression above background in > 98% of samples and the highest correlation with expression of other housekeeping genes (Spearman coefficient ≥ 0.85). Six out of eleven housekeeping genes on our codeset met this criterion, namely GUSB, ACTB, GAPDH, PGK1, RPLP0, and SF3A1. Ultimately, we removed 2 dimensions of unwanted variation with RUVg (k = 2). Data was median centered across genes for heatmap visualization in R studio 3.5.3.

HGF 38-gene signature

A 38-gene hepatocyte growth factor signature was derived from a 280-gene HGF signature previously described by our research group in Casbas-Hernandez et al. [9]. This 280-gene signature was mapped to three public gene expression datasets [NK1295 [19], UNC337 [20], and Naderi and colleagues [21]], and 109 unique genes were identified across all three datasets and used to classify tumors [9]. Using the shrunken centroid method [22], we identified 38 genes that could recapitulate the classification of samples based on the 109-gene set. The 38-gene HGF classifier includes the following genes: TMEM45B, AKR7L, AQP5, C1QTNF3, C2ORF27A, C4ORF31, C9ORF98, CAPN13, CASKIN1, CMYA5, DTX3, EFHD1, F7, FMNL2, FUT8, GCNT2, HRC, INPP4B, ISLR2, KCNMA1, KCNN4, KIF3A, MAGI2, MARVELD2, NME5, PKIB, PRRG2, PRRT2, PVRL2, REEP6, RIMS4, SCUBE2, SHROOM3, SKAP1, SYBU, TFF3, and TMSB15B.

To classify each sample as HGF-positive or HGF-negative, the 38 gene signature was applied using a weighted sum score, created by summing the magnitude of the normalized, log2 transformed values of the 38 genes within the consolidated HGF signature, and multiplying upregulated genes by 1 and downregulated genes by − 1 to preserve the directionality of each gene in the reference signature from our training cohort (Eq. 1).

$$\sum {W}_g\ast {Z}_g$$
(1)
  • W = weight of gene (− 1 or 1 based off prior knowledge of upregulation or downregulation in HGF signature)

  • g = gene in HGF expression signature

  • Z = gene expression of g in known HGF signature

Within the original HGF signature [9], there were 6 genes upregulated (FMNL2, KCNN4, AQP5, GCNT2, TMSB15B, and DOCK3), and 32 were downregulated by HGF. The HGF weighted sum score was dichotomized using the mclust R package version 5.4.5, which determines cutpoints for classification based on Gaussian mixture analysis [23]. For this analysis, HGF positivity was defined as having the directional expression profile of tumors that are responsive to HGF protein treatment in breast cancer cells as assessed in Casbas-Hernandez et al. [9]. The modified HGF expression signature was concordant with the original signature trained on TCGA data (data not shown, 86% agreement, p value < 0.001).

Statistical analysis

Demographic variables including age at diagnosis (< 40, 40–49, 50+ years old), race (black, non-black), parity and breastfeeding (nulliparous, parous and never breastfed, parous and breastfed), and family history of breast cancer (yes or no) were defined as categorical variables. Body mass index (BMI) was a continuous variable but was stratified by menopausal status and defined as a categorical variable (BMI: normal/underweight [BMI < 25], overweight [30 > BMI > 25], obese [BMI > 30]). Clinical characteristics and tumor subtypes were defined as follows, based on the clinical record: estrogen receptor status [positive (> 10% positivity), negative (0% positivity), borderline (1–10% positivity—was not included in this analysis; set to missing (n = 43))], progesterone receptor status [positive (> 10% positivity), negative (0% positivity), borderline (1–10% positivity—was not included in this analysis; set to missing (n = 100))], HER2 receptor status (positive or negative), hormone receptor tumor type (hormone receptor-positive/HER2 negative, triple-negative breast cancer, hormone receptor-negative/HER2 positive), triple-negative status (non-triple-negative breast cancer, triple-negative breast cancer), and clinical stage (AJCC: stage I, stage II, stage III/IV).

RNA-based variables were defined as follows: PAM50 intrinsic subtypes [luminal A, luminal B, HER2-enriched, basal-like, and normal-like, basal-like status (basal vs. non-basal), ROR-PT score (high, medium/low), and HGF (positive, negative). Normal-like samples were assumed to have insufficient tumor cellularity to produce a tumor call and therefore were removed from analysis (n = 67).

The HGF gene signature was assessed for associations with demographic and clinical data using generalized linear models to determine relative frequency difference estimates for univariate and multivariable models [24]. The generalized linear models used an identity link function with a binomial distribution to calculate the relative frequency differences. Multivariable models were adjusted for either age, race, or both. Covariates (age and race) were based on literature review and directed acyclic graph analysis. Race stratified analyses were defined as black vs. non-black. However, sensitivity analysis conducted between black women (n = 1033) vs. white women (n = 879) did not statistically differ from white vs. non-black associations with the HGF signature. To retain power to examine associations of the HGF signature with breast cancer features, all subsequent analysis combined white and “other” racial groups into the non-black category (“other race” n = 45). To address multiple hypothesis testing for associations of HGF with patient and clinical features, we used the Benjamini-Hochberg false discovery rate (FDR) method to test for multiple comparisons for all RFD models [25]. Statistical analysis was completed in both Stata 15 SE and R statistical environment version 3.5.3.

Results

In the Carolina Breast Cancer Study, 32% of participants were classified as HGF-positive by our 38-gene assay. Clustering the HGF signature genes across all of the CBCS patients, we found two main gene clusters that corresponded well with expression patterns from the original reference signature [9]. Specifically, HGF-positive tumors had few (n = 6) genes highly expressed, while most genes had a characteristic pattern of lower expression (Fig. 1). TNBC samples were enriched in the HGF-positive cluster.

Fig. 1
figure 1

Heatmap of the 38-gene HGF gene signature in the Carolina Breast Cancer Study. Expression of genes in the HGF (hepatocyte growth factor) classifier clustered by gene and sample in CBCS, with indicators for race and triple- negative breast cancer (TNBC) subtype

To identify demographic characteristics associated with HGF-positive tumors, we evaluated relative frequency differences by race, age, and other variables. Table 1 shows that HGF positivity was more common among black women (compared to non-black women) [42% vs. 21%; RFD = + 19.93%, 95% CI (16.00, 23.87)]. Compared to women over 50, women under the age of 40 had the highest frequency of HGF-positive tumors [47% vs. 26%; RFD = + 20.33%, 95% CI (14.05, 26.61)], followed by women between the ages of 40–49 [[34% vs. 26%; RFD = + 8.51%, 95% CI (4.29, 15.98)]. Tumor grade III was strongly associated with HGF positivity [53% vs. 9%; RFD = + 41.11%, 95% CI (36.68, 45.55)] Higher breast cancer stage was also associated with increased prevalence of HGF positivity when compared to stage I [36% for stage II vs. 24%; RFD = + 6.16%, 95% CI (0.21, 12.11), and 37% for stage III/IV vs. 24%, RFD = + 7.89%, 95% CI (3.54, 12.22)]; however, after adjustment for multiple comparisons, associations with HGF and stage III/IV were no longer statistically significant (FDR p value = 0.068).

Table 1 Participant characteristics according to 38-gene HGF signature expression, Carolina Breast Cancer Study, 1993–2013

Most breast cancer risk factors showed little association with HGF positivity. Body mass index and family history of breast cancer were not associated with HGF-positive tumors (Table 1). However, there was an inverse relationship between HGF-positive tumors and reproductive history factors. Specifically, nulliparous women and parous women with a history of breastfeeding had a lower frequency of HGF-positive tumors compared to parous women who have never breastfed.

Breast cancer subtypes are defined by tumor markers, clinically based on IHC or molecularly based on RNA expression, and both have been shown to have prognostic value. We considered both IHC-based and RNA-based subtypes in association with the HGF signature (Table 2). Among IHC-based subtypes, HGF positivity was strongly associated with triple-negative breast cancer, with 82% of these tumors having HGF positivity compared to 13% in non-TNBC [RFD = + 65.85, 95% CI (61.71, 69.98). When molecular subtypes were defined by PAM50, HGF positivity is a defining feature of basal-like subtypes. HGF positivity was present in 92% of basal-like, 32% in HER2-enriched, and only 6% or less in luminal subtypes (Table 2). HGF positivity was also significantly associated with high risk of recurrence scores (ROR-PT) in CBCS [71% vs. 19%; RFD = + 48.20%, 95% CI (43.42, 52.99)]. Given that HGF positivity was highly expressed in basal-like tumors, and basal-like status could be a mediator of the associations between HGF status and tumor aggressiveness, we performed a sensitivity analysis among non-basal tumors to see if the associations with HGF positivity and clinical characteristics remained. HGF positivity was still significantly associated with race, age, grade, and risk of recurrence scores within non-basal tumors, despite much lower prevalence of HGF positivity among these tumors (Table 3).

Table 2 Association of 38-gene HGF signature with breast cancer clinical and genomic subtypes and risk of recurrence (ROR) genomic score
Table 3 Distribution of 38-gene HGF signature expression within non-basal-like tumors (N = 1338), Carolina Breast Cancer Study, 1993–2013

Discussion

This paper presents a novel biomarker of HGF positivity, a 38-gene signature developed through experimental methods and fine-tuned through application in breast tumor data. Using this signature, HGF positivity is associated with aggressive breast cancer subtypes and is strongly associated with basal-like subtype. In non-basal-like tumors, HGF positivity is less common; however, significant associations with many aggressive clinical features remained. Given that HGF positivity may ultimately be clinically targetable and is correlated with a number of poor prognosis clinical characteristics in both basal-like and non-basal-like tumors (including tumor stage, hormone receptor-negative markers, stage, tumor grade, and higher risk of recurrence scores), it is important to understand its distribution and contribution to outcomes. The distribution of HGF also highlights some long-standing breast cancer outcome disparities; HGF positivity is more prevalent among black participants and among women under the age of 50 (and especially those under 40). Taken together, this population-based study contributes important information on the distribution of HGF-positive tumors in breast cancer.

Our findings on the relationship between HGF and patient characteristics are consistent with previous literature on HGF, notably associations with black race, high stage, high grade, and younger age, but our findings extend the literature in several ways. First, previous studies focused mainly on germline genetic sequence and with relatively small patient numbers. Ma et al. observed that 51% of African American women had a higher frequency of mutations in the HGF promoter region when compared to 15% of Caucasian women. Our findings showed that not only does HGF positivity coincide with race at the somatic tumor gene expression level, but that HGF positivity is also associated with other features such as increased risk of recurrence (ROR) score and basal-like phenotype, providing a plausible link between HGF and racial disparities in breast cancer. In our study, we recognize that race is a social construct and understand this variable may encompass effects of environment, social inequities, and discrimination that are not captured in this analysis. However, we also note that self-reported race and ancestry are highly concordant in the CBCS population [26].

In our assessment of the association of age and HGF expression, women under the age of 50 had a higher frequency of HGF-positive tumors. Ma et al. also observed a statistically significant association with age, where younger breast cancer patients were more likely to have the HGF promoter mutation [8]. However, a recent meta-analysis of c-MET expression by Zhao et al. found no association between age and HGF expression [6]. The meta-analysis combined studies that used different methods to detect c-MET expression including protein(75% of studies) and RNA-based techniques(25% of studies) [6, 27], which could have contributed to between-study differences. Our analysis was concordant with a number of clinical studies [28,29,30,31,32], showing that higher stage and tumor grade were associated with HGF-positive tumors.

The HGF/c-MET axis is an attractive pathway in breast cancer research because it is targetable with existing therapeutics [1, 7, 30, 33]. Moreover, HGF positivity appears to be a feature of triple-negative/basal subtype [9, 31, 34,35,36,37], which does not currently have any targeted therapies [38]. However, the lack of an established HGF biomarker has been problematic. There are multiple biomarkers representing the HGF pathway in the literature including c-MET RNA expression, c-MET-exon skipping, c-MET protein expression, c-MET amplification, c-MET receptor, and/or HGF protein expression; however, they have not been validated and efficacy has only been proven in the C-MET exon skipping marker in clinical trials, where the demonstrated predictive benefit was minimal [7, 39]. Here we demonstrate a multi-gene HGF signature that can retain complex biological information on the pathway. This pathway could be targetable in both Basal-like and non-basal tumors. Although HGF-positive tumors were highly prevalent among triple-negative breast cancer; it was a defining feature among Basal-like tumors. Basal-like and triple-negative breast cancer subtypes are often used interchangeably, but there is heterogeneity in gene expression within triple-negative breast cancers [40, 41]. The associations with clinical characteristics and HGF positivity that we observed may have partially been mediated by the aggressive phenotype of the basal-like subtype.

HGF positivity also occurs in non-basal-like tumors. HGF positivity was present among 8% of non-basal-like tumors and was associated with more aggressive features, suggesting the pathway may also affect some of these tumors. Rahgav et al. examined the relationship between c-MET expression, as measured by reverse protein phase array, and breast cancer recurrence among 257 invasive breast cancers [42]. The study found that total c-MET levels in hormone receptor-positive and phosphorylated c-MET levels in HER2 subtypes were associated with recurrence [42]. This suggests that our findings that HGF positivity is associated with aggressive tumor phenotypes may have consequences for recurrence. Others have suggested that HGF c/MET expression may also influence prognosis specifically in HER2 overexpressing tumors via resistance to HER2-targeted therapies [43, 44]. These associations between HGF positivity and outcome should be assessed in future studies using the 38-gene assay developed here.

A strength of this analysis was the use of a novel 38-gene biomarker and a pathway-based approach, rather than classifying tumors based on a single gene. The signature was developed to be concordant with a larger signature in The Cancer Genome Atlas Project. Furthermore, the Nanostring technology has increased sensitivity and reproducibility when compared to traditional methods such as qPCR [45], particularly when using FFPE specimens. Another strength includes the large, racially diverse population-based study design. The large sample size lent itself to statistical power for the current analysis.

Some limitations also affect this work. While we describe the distribution of a novel signature for HGF, we lacked data to assess whether this signature predicted response to HGF therapy. We also do not have data on specific HGF-pathway proteins, impairing our ability to directly compare RNA vs. protein-based biomarkers. While we assayed RNA and did not specifically evaluate whether protein levels of HGF were concordant with RNA in this population, our previous research suggests that RNA-based findings were concordant with protein-based findings [9]. Therefore, the concordance of our findings with patterns in previous literature mitigates this concern somewhat. We were also unable to fully disentangle the role of basal-like subtype in driving HGF associations with tumor aggressiveness. The proportion of HGF positivity was so high among basal-likes and relatively uncommon among non-basal-likes, leaving these assessments somewhat underpowered, though even in these small strata, the associations with tumor aggressiveness appear consistent.

Currently, one of the leading challenges with targeting c-MET in clinical trials is the lack of selection of appropriate patient populations for targeted therapy [46]. There is a need for biomarkers to improve efficacy to target the c-MET/HGF signaling pathway, especially within breast cancer. Further validation of this novel biomarker could influence the use of the gene signature in identification for high-risk populations or for targeted treatment options.

Conclusion

This study observed that the novel HGF gene expression signature was a defining feature in basal-like breast cancer tumors. This signature was also found to be more prevalent in women under 50 and black women, populations most severely affected by breast cancer outcome disparities. The prevalence of this signature among populations adversely affected by breast cancer suggests this pathway may be a candidate for targetable molecular therapy that influences breast cancer disparities.