Background

Breast cancer is one of the most common human neoplasms, accounting for approximately one quarter of all cancers in females. Invasive breast cancer is the most common carcinoma in women [1, 2]. Most cases arise from epithelial cells of the mammary ductal system. In non-pregnancy and non-lactating periods, these epithelial cells produce a secretion that, when collected, is called the nipple aspirate fluid (NAF) [3]. As a protein-rich breast-proximal fluid closely related to the tumor microenvironment in cancer patients, NAF constitutes a valuable biological sample to study secreted proteins from tumor cells without contamination by other interstitial fluids or cells [4, 5]. Proteomic studies of human body fluids and tissues are challenging, especially due to the high biological variability. Since breast is a “paired” organ, in unilateral breast cancers, the contralateral non-diseased breast from the same individual can be used as an ideal negative control of the cancerous breast [6], ultimately increasing the statistical power.

In 2014, we evaluated for the first time 14 paired NAF samples from seven patients with unilateral breast cancer by PAGE, zymography, and DIGE strategies. Our results have revealed the existence of very distinct proteomic profiles among patients (i.e., individual differences). However, NAF profiles from both breasts of the same woman were very similar in qualitative terms, although important quantitative differences in protein spot intensities could be observed. Patients with less aggressive tumors shared a similar homogeneous profile, with a typical set of proteins identified. In contrast, patients with more aggressive tumors presented very unique profiles (i.e., heterogeneous) [7].

DIGE poses as the state of the art method for sample comparisons by two-dimensional gel electrophoresis. When using this technique, to perform a statistical analysis of spot intensity differences between two conditions (Cy3- and Cy5-labeled), the images are overlaid using the Cy2-labeled internal standard as the reference image. This standard is made up of equal amounts of all samples in the study [8]. Due to the substantial individual heterogeneity found in NAF samples, it was not possible to confidently overlay their gel images, therefore hampering the use of statistical tests for pinpointing differentially abundant candidate markers between the cancer and the control samples. Although the study failed to provide valuable candidate markers, it was fundamental to demonstrate that, even though substantial qualitative individual differences were observed, when comparing NAF samples from both breast within the same patient, the electrophoretic patterns were very similar, regardless of their cancer status [7].

To overcome the limitations imposed by our previous gel-based analytical strategy, a shotgun label-free proteomic approach was applied to further advance the proteomic characterization of NAF samples. However, attempts to use classical data analysis tools (e.g. Proteome Discoverer and Progenesis) [9, 10] were not successful in providing differential results. The significant inter-individual variability of NAF samples confuses chromatogram alignment, which constitutes an important first step of many algorithms. Most importantly, such traditional shotgun proteomic statistical algorithms do not capitalize on the sample pairing. As normalization is not trivial across the patients, applying data analysis strategies that rely on statistically finding differential abundance by considering the average values of each group (cancer vs control) for each protein is simply not applicable for the task at hand; these tools work considerably better for models with lower biological variation, such as cell cultures or mouse models [11, 12].

Taken together, our cumulative experience on various studies made clear that the substantial individual heterogeneity of these clinical samples required further development of proteomic data analysis tools. The pairing of the NAF samples constitutes the core of our strategy and capitalizes on the subtle variations within the same patient. Therefore, we developed an extension to the PatternLab for Proteomics suite that was tailored for the data analysis challenges at hand, which finally enabled us to confidently perform a differential proteomic comparison of breast cancer secretome samples (NAF) from patients with unilateral breast cancer. In summary, here we propose a consistent quantitative analysis workflow for the evaluation of a heterogeneous biological fluid that constitutes a valuable source of information with potential applications in clinical evaluation of breast cancer patients.

Methods

Sample collection

NAF samples (10 cancerous and 10 control) were collected from both breasts of 10 patients with biopsy-proven unilateral ductal invasive carcinoma, yielding a total of 20 biological samples, plus three individuals with no positive diagnosis of breast disease on either breast (providing six more biological samples). All samples were collected at the Mastology Service of the Fernandes Figueira Institute (IFF) of Fiocruz or at the Gynecology Ambulatory of Lagoa Federal Hospital (Table 1). Eligibility criteria for all subjects were: a) to be post-menopausal; b) no intake of exogenous hormones during the previous six months; c) no breast surgery or chemotherapy; d) no previous clinical evidence of breast disease or cancer. After obtaining the written informed consent (IFF Research Ethics Committee, license 0083/10) and the clinical and imaging confirmation of the diagnosis status, NAF collection and protein quantification were performed as previously described [7]. Briefly, the breast was gently massaged from the chest wall toward the nipple for 5 min followed by warm compress for equal time. The nipple fluids were then aspirated using breast pumps and the fluid droplets were collected using a 10 μL micropipetter (Gilson, Inc., Middleton, WI, USA). Immediately, the diluted NAF samples (10 times in phosphate buffered saline pH 7.4) were centrifuged at 250 x g for 10 min at 6 °C and the supernatant was collected and stored at − 80 °C. The NAF protein concentrations were determined using the bicinchoninic acid protein assay kit (Sigma-Aldrich, St. Louis, MO, USA).

Table 1 Reproductive and tumor characteristics of the ten unilateral breast cancer cases and three individuals without breast disease analyzed

Sample preparation

One hundred micrograms of lyophilized NAF proteins were dissolved in 20 μL of 400 mM ammonium bicarbonate/ 8 M urea followed digestion as described elsewhere [13]. The digested peptide mixture was desalted by using homemade tip columns packed with Poros R2 resin (Applied Biosystems, USA). Samples were finally dried in a vacuum centrifuge [14].

Mass spectrometry data acquisition

Desalted tryptic peptides were resuspended in 100 μL of 0.1% (v/v) trifluoroacetic acid. Samples were then analyzed by nLC-MS/MS using an UltiMate 3000 RSLC system (Dionex, USA) coupled to an Orbitrap Elite mass spectrometer (ThermoFisher Scientific, Germany). Initially, peptides were loaded (normalized TIC values between 5 × 108 – 1 × 109, corresponding to 1–4 μL) with 0.1% TFA at 20 μL/min to a 2-cm long (100 μm i.d.) Acclaim® PepMap100 NanoViper Trap column packed with 5 μm silica particles, 100 Å pore size, followed by separation at 250 nL/min on a 50 cm × 75 μm i.d. Acclaim® PepMap100 NanoViper column, both at 60 °C. Peptides were eluted with a gradient of 3 to 45% of 0.1% (v/v) formic acid and 84% (v/v) acetonitrile over 187 min. The spray voltage was set to 1.8 kV with capillary temperature of 275 °C and no sheath or auxiliary gas flow. Full MS spectra were acquired with 1 microscan on the Orbitrap analyzer at a 60,000 resolution (FWHM at m/z 400) with a target AGC value set to 1 × 106. For each survey scan (300 to 1500 m/z range), up to 10 most abundant precursor ions were sequentially submitted to CID fragmentation and MS2 analysis in the LTQ using the following parameters: MSn AGC target value of 1 × 104, normalized collision energy of 35%, minimum signal threshold of 2000 counts and dynamic exclusion time of 30 s.

Data analysis

Peptide-spectrum matching (PSM) was performed using the Comet [15] search engine (version 2016.01), which is embedded in PatternLab for Proteomics (version 4.1, http://patternlabforproteomics.org) [16]. Sequences from Homo sapiens were downloaded from UniProtKB/Swiss-Prot (containing target 42,402 entries, on September 17, 2018, http://www.uniprot.org/). The final search database, generated using PatternLab’s Search Database Generator tool, included a reverse decoy for each target sequence plus sequences from 127 common contaminants, such as BSA, keratin, and trypsin. The search parameters applied included: fully tryptic and semi-tryptic peptide candidates with masses between 550 and 5500 Da, up to two missed cleavages, 40 ppm for precursor mass and bins of 1.0005 m/z for MS/MS. The modifications were carbamidomethylation of cysteine and oxidation of methionine as fixed and variable, respectively. The validity of the PSMs was assessed using the Search Engine Processor (SEPro) [16]. Identifications were grouped by tryptic status, resulting in two distinct subgroups. For each result, XCorr, DeltaCN, and Comet’s secondary score values were used to generate a Bayesian discriminator. A cutoff score was established to accept a false-discovery rate (FDR) of 1%. A minimum sequence length of 6 amino acid residues was required and the results were further filtered to only accept PSMs with precursor mass error of less than 6 ppm. Proteins identified by only one spectrum (i.e. 1-hit-wonders) having an XCorr below 2.0 were excluded from the identification list. The post-processing filter resulted in a global FDR, at the protein level, of less than 1% and was independent of the tryptic status [17].

Experimental design and statistical rationale

For breast cancer patients, NAF samples were collected (up to three attempts) in a brief time window between the diagnosis and surgery. Even though proteomic differences in NAF due to the activity of ovarian hormones are believed to be negligible [18,19,20], we were cautious to only include post-menopausal individuals. Through a workflow of only a few steps, a high-resolution and sensitive nLC-MS/MS analysis [21, 22] was carried out for shotgun evaluation of NAF samples.

PatternLab’s XIC extraction tool was used for obtaining the XICs of peptides confidently identified according to SEPro. The XIC extraction of precursor intensity measurements was performed under a tolerance of 9 ppm and acceptable charge states + 2 and + 3. PatternLab’s XIC Explorer was then used to visually assess the distribution of intensities of the label-free quantitations, label each run as control or disease, and tag which samples were from the same patient for further paired analysis. Additionally, PatternLab’s TFold module was used to demonstrate a standard comparison of mean values between two groups: NAF from diseased breasts versus non-diseased ones.

The .xic file provided by PatternLab served as input to a tool named Paired Analyzer (PA), specifically developed for this study. PA begins by normalizing the XICs from each peptide according to the total ion current from each run. The paired analysis of each unique (i.e. proteotypic) peptide required six or more sequential precursor intensity measurements and a minimum fold change of 1.5. Then, for each peptide, the software extracts a list of values according to one of four possibilities: i) when a peptide’s XIC is obtained from data originating from both breasts, an XIC ratio (cancer:control) is recorded; ii) when an XIC is not obtained from either breast, a “0” (zero) is recorded; iii) when an XIC is obtained only from the diseased sample, a “+” (plus) is recorded; iv) when an XIC is obtained only from the control sample, a “–” (minus) is recorded (Table 2). In what follows, PA relies on a peptide-centric approach to assign a p-value to each peptide as being differentially abundant. For this, we follow a paired binomial approach. Our model assumes a 50% chance for a randomly selected peptide to be a success relative to each individual patient for which an XIC was obtained from at least one breast, where success is to be understood as that peptide having a ratio greater than 1 or a “+” for the patient in question. A peptide’s number of successes is the random variable X, and we calculate its p-value as the probability P(X > x), given by a sum of binomials, where x is the number of patients for which success was observed. Thus, for the peptide Pa (Table 2), the number of successes (x) is 3, the number of trials (n, number of columns not having 0 as a value) is 5, which yields P(X > 3) = 0.5. For Pb, x = 2 and n = 10, yielding P(X > 2) = 0.99. For Pc, x = 6 and n = 6, yielding P(X > 6) = 0.02. In summary, low p-values (e.g. p < 0.025) link a peptide to the cancer condition; on the other hand, high values (e.g. p > 0.975) would link the aforementioned peptide to the control condition.

Table 2 Theoretical example for the peptide-centric approach in the PA module of the PatternLab tool

Finally, multiple p-values originating from the peptides mapped to a protein are used to perform a meta-analysis to help determine whether that protein can be considered differentially abundant. This analysis is the determination of Stouffer’s Z-score [23] for the data at hand, denoted by Z and given as a function of the various peptide p-values as

$$ Z=\frac{\sum \limits_{i=1}^k{w}_i{Z}_i}{\sqrt{\sum \limits_{i=1}^k{w}_i^2}}. $$

In this expression, k is the number of peptides; Zi = Φ−1(1 − pi), where Φ is the standard normal cumulative distribution and pi the i-th peptide’s p-value; wi is the square root of the count of individuals in which that peptide was identified.

Average fold-changes were calculated considering the logarithms of the ratios to the base 2, allowing for symmetry in the expression rates of more (positive values, Table 2) and less (negative values) abundant proteins in cancer.

The differentially abundant proteins were categorized in pathways according to the Reactome v60 (https://www.reactome.org/) database. The distribution of those proteins was plotted in a graph from PatternLab’s showing the mean of normalized parent ion intensity abundance factor (NIAF) [24] of each identified protein from the NAF samples of the ten patients.

Selected reaction monitoring (SRM)

From the differentially abundant list of proteins, 12 of them related to glycolysis, complement cascade and platelet activation pathways were selected for further validation. The spectral library was built from the shotgun analysis described in sections 2.3 and 2.4 and loaded at Skyline software (https://skyline.ms/project/home/software/Skyline/begin.view, version 4.1). A total of 87 transitions were selected for SRM according to peptide uniqueness in the human genome; presence in the spectral library with relatively high intensity of signal; without ragged ends (KK, RR, KR or RK); minimum and maximum size of 8 and 25 aminoacids, respectively; only “y” ion types. Six pairs of samples were prepared as described above and the dessalted tryptic peptide mixtures were quantified by Pierce Quantitative Colorimetric Peptide Assay (ThermoFisher Scientific, USA). A total of 0.5 μg of peptides for each sample spiked in 32 fmol of Pierce Retention Time Calibration Mixture (ThermoFisher Scientific, USA) in 1% formic acid (FA) were loaded to a 2 cm precolumn of 75 μm i.d. with 3 μm silica particles and 100 Å pore size (Acclaim PepMapTM 100, Thermo) in 12 μL of 0,1% (v/v) FA and 5% (v/v) acetonitrile in water, using an EASY II (Proxeon, USA). Then, separation was performed at 320 nL/min in a PicoChip column, 75 μm i.d. × 15 μm tip × 10.5 cm of H354 ReproSil-Pur C18-AQ 120 Å (New Objective, USA) using an elution gradient of 5 to 45% of 0.1% (v/v) FA and 5% (v/v) water in acetonitrile over 40 min followed by 45–95% over 10 min. The nLC was coupled to a TSQ Quantiva mass spectrometer (ThermoFisher Scientific, Germany). The spray voltage was set to 2.6 kV with capillary temperature of 280 °C, the 60 min acquisition was done with 2 s cycle time, 0.7 Q1 and Q3 resolution (FWHM 508.2 m/z), 1,5 mTorr for collision induced dissociation (CID) fragmentation, and collision energy adjusted according to the theoretical equation of this mass spectrometer. After manual refinement of each transition for each sample, the areas of 48 transitions which refer to 9 proteins were exported from Skyline and imported at Paired Analyzer tool for statistical analysis as described above.

Results

PatternLab’s TFold comparison of the mean values of protein abundance between the cancerous group versus the non-cancerous one showed no protein as being differentially abundant (Fig. 1).

Fig. 1
figure 1

PatternLab’s TFold pairwise analysis of the two biological conditions Each dot represents a protein mapped according to its log2 (fold-change) as the ordinate and its -log2 (t-test p-value) as the abscissa. White dots indicate proteins that do not satisfy either the fold-change cutoff or the FDR cutoff α (0.05). Grey dots depict protein entries that satisfy the fold-change cutoff but not FDR α. Dashed dots indicate proteins that satisfy both fold-change and FDR α, but present low fold-changes. Vertical lines filled dots would represent protein entries that satisfy all statistical filters. Since no dashed or vertical lines filled dots are visible, the result interpretation is that no protein was considered differentially abundant between the biological conditions

The shotgun approach disclosed a total of 1227 protein entries (Additional file 1: Table S1), of which 87 proteins (Table 3) were differentially abundant between cancerous and non-diseased breasts from unilateral breast cancer patients, according to our paired statistical approach. From these 87 differentially abundant proteins, all of them were quantified with more than 6 peptides and are included in the Plasma Proteome Database (http://www.plasmaproteomedatabase.org/), proteins except for three immunoglobulins forms (Ig heavy constant gamma 2, Ig kappa variable 3–20, and Ig heavy variable 2–5). Nine differentially abundant proteins were detected in lower levels in NAF samples originating from the cancerous breast (Stouffer p-values ≥0.975).

Table 3 List of 87 found as differentially abundant after paired comparison of NAF samples from breast cancer patients

We also performed a differential analysis between samples from right and left breasts of three women without breast disease (Table 1, individuals 1–3). From the list of 578 statistically evaluated proteins (Additional file 2: Table S2), Ig heavy constant alpha-1 (Stouffer’s p-value = 0.0054), alpha-1-antichymotrypsin (Stouffer’s p-value = 0.9888), alpha-1-antitrypsin (Stouffer’s p-value = 0.9943) were pointed as differentially abundant. Since alpha-1-antichymotrypsin and alpha-1-antitrypsin were also found in the comparison between the breasts of cancer patients, they were excluded from the following analyses (Additional file 1: Table S1). Our motivation was to reduce the chance of false positive identifications as, in principle, there should be no reason for having differentially abundant proteins between the NAF samples originating from normal right and left breasts.

Among 87 differentially abundant proteins observed between cancerous and non-diseased paired breasts, it is worth mentioning the frequent identification of proteins associated with the glycolysis pathway, the complement cascade and the platelet activation/degranulation systems (Additional file 3: Table S3). Furthermore, having as reference the average value of NIAF plotted for each protein ordered by abundance, the 87 differentially abundant proteins were among the more abundant ones (Fig. 2).

Fig. 2
figure 2

Graph demonstrating the (− 1*10^7) * Log of the average of the normalized ion abundancy factor (NIAF) of all the proteins identified in the NAF For each protein, a number was given as an identifier, and the abscissa is representing these numbers in descending order of abundance. Gray, Black, and White dots represent proteins with no differential abundancy, more abundant in the contralateral non-diseased breasts, and in the breasts with cancer, respectively

We also performed an additional differential analysis between NAF samples from a subgroup of patients bearing estrogen receptor-positive tumors (ERpos) (Table 1). From the 873 statistically evaluated proteins (Additional file 4: Table S4), 14 proteins (Table 4) were classified as differentially abundant; 10 of them (alpha-1B-glycoprotein, ceruloplasmin, alpha-2-macroglobulin, serotransferrin, immunoglobulin heavy constant mu, alpha-1-acid glycoprotein 1, ferritin heavy chain, proteinS100-A8, and serum albumin) were also found as differentially abundant in the total set of breast cancer patients. All differentially abundant proteins found in both datasets, ERpos cancer NAF and total cancer NAF, presented the same abundance tendencies. Among the proteins found as more abundant in ERpos samples, representatives of the protein metabolism and the platelet degranulation system were frequently identified (Additional file 3: Table S3).

Table 4 List of 14 non-redundant proteins by maximum parsimony criteria found as differentially abundant after paired comparison of NAF samples from ER positive breast cancer patients

To validate the results obtained from the cancer versus nondiseased comparison, 12 differentially abundant proteins were selected according its overall high MS signal and their presence in the well represented Reactome pathways here described. By Selected Reaction Monitoring (SRM), 4 proteins of glycolysis (pyruvate kinase, glyceraldehyde-3-phosphate dehydrogenase, triosephosphate isomerase, and fructose-bisphosphate aldolase A), 4 proteins of complement cascade (complement C5, complement C3, complement factor B, and complement factor H), and 4 proteins of platelet activation and signaling (alpha-2-macroglobulin, apolipoprotein A-I, fibronectin, and annexin A5). From the initial 87 inicial transitions, 48 were successfully monitored with a CV lower then 15% among replicates, after normalization using global standards. The normalized areas were statistically analyzed by our paired setup and the higher abundance in cancer samples were confirmed to pyruvate kinase, alpha-2-macroglobulin, and complement factor B (p-value < 0.05). Although the proteins fructose-bisphosphate aldolase A, complement C5, complement C3, complement factor H, apolipoprotein A-I, and annexin A5 did not reach lower p-values, fold changes were corroborated with the higher abundance in cancer samples (Additional file 5: Table S5).

Discussion

Differential analysis performed with individually paired NAF samples from unilateral breast cancer patients (using the contralateral non-diseased breast sample as negative control) is a powerful strategy for discrimination of which proteins are related to the disease as it helps overcoming the challenge of individual heterogeneity observed between patients [7, 25]. We applied PatternLab’s TFold analysis as a representative of a widely adopted proteomic approach to demonstrate the effectiveness of these methods on datasets with a high biological variation; no proteins were found as differentially abundant. Thus, to quantitatively analyze proteins by label-free shotgun, individual pair-by-pair analysis was performed by the new Paired Analyzer tool of the PatternLab for Proteomics software [26]. As aforementioned, our software performs a quantitative peptide-centric approach that relies on the binomial distribution to attribute a p-value to each peptide as being related to the disease or not. In what follows, these p-values are rolled up to the protein level, converging to the Stouffer’s Z-score via a widely adopted meta-analysis procedure which enables combining independent statistical tests bearing upon the same hypothesis to establish a single score. The tool also allows quickly verifying whether individual peptides belonging to the same protein followed the same trend in differential abundance across the breast cancer NAF samples (Additional file 6: Graphical abstract).

Our shotgun approach revealed proteins presenting higher abundances in breast cancer samples that were previously known as related to cancer progression. Among these proteins are members of the glycolysis pathway, components of the platelet activation/degranulation systems, and proteins associated with the complement cascade [27]. By SRM, at least 2 proteins per pathway corroborated the higher abundance in cancer samples. Increased levels of glycolytic enzymes have been previously related to higher glucose consumption, oncogene activation and loss of function of tumor suppressor genes, promotion of metastasis, angiogenesis stimulation, chemotherapy resistance, and immune evasion [28, 29]. Another known pathway related to tumor progression is the complement cascade that mediates the innate immune system activation, resulting in inflammatory cell and fibroblast recruitment to the tumor microenvironment, which sustains the extracellular matrix remodeling and, consequently, supports cancer progression [30, 31].

In this work we were able to identify 20 representatives of platelet degranulation, activation, signaling or aggregation as more abundant in NAF cancer samples, showing a typically coagulant tumor microenvironment. Several studies report that cancer progression and metastasis (specifically angiogenesis promotion, apoptosis suppression, and extracellular matrix degradation) can be supported by elements of the hemostatic system, such as platelets, coagulation, and fibrinolysis [32]. Therefore, our approach seems to be suitable for more detailed analysis of coagulation cascade proteins, eventually providing further information on its mechanistic relation with breast cancer progression.

Although proteins related to glycolysis and platelet function are commonly found in cancer differential proteomic data [32, 33, 34], their putative roles as moonlighting proteins [35] have been largely overlooked. According to the MultitaskProtDB [36], glucose-6-phosphate isomerase, triosephosphate isomerase, and phosphoglycerate kinase 1, glycolytic proteins which we have found more abundant in the cancer samples, may show moonlighting functions in differentiation/ stimulation of cell migration (as a cytokine or a growth factor) [37], thrombosis/homeostasis [38], and angiogenesis (as a disulphide reductase) [39], respectively. Additionally, the peptidyl-prolyl cis-trans isomerase, an enzyme related to platelet degranulation found more abundant in cancer NAF, presents a proinflammatory cytokine function when located in the extracellular space [40]. Interestingly, our comparative evaluation of the paired breast secretion in unilateral cancer cases showed the presence of these cancer-related proteins extracellularly, which may add important new information to the understanding of the human functional proteome.

Characteristically, ERpos breast tumors are a well differentiated type of cancer and present better treatment response and overall survival [41]. This group of samples showed increased levels of proteins related to the regulation of IGF transport and to the platelet degranulation system. Although some findings about the role of IGF system in breast cancer are conflicting, many components of this system are known to be altered during breast cancer establishment and progression regardless the expression patterns of receptors (ER, PR and HER2) [42]. Overall, proliferation mechanisms which are present in these tumors were observed in this work.

Conclusions

In NAF cancer samples, the higher abundances of proteins involved in cell-stroma communication, glycolysis (Warburg effect), and immune system activation (to maintain a stimulated stroma) corroborate previous breast cancer data from the literature. Additionally, this paired comparative proteomic strategy of analysis presents valuable information on the mechanisms described above that are known to be related to the disease, even with the inter-individual heterogeneity characteristic of NAF samples. Although, we performed an SRM experiment and confirmed the higher abundance of 3 proteins in cancer samples, further verification/confirmation of higher levels of glycolytic enzymes, complement components, and platelets activators in a larger cohort (> 20 NAF paired samples per cancer subtype, throughout the entire pathways) using the targeted proteomic strategy may contribute to new advances in breast cancer evaluation. Taken together, these results demonstrate that protein analysis of NAF, a clinical sample easily obtained, could compose a pillar in precision medicine, guiding a protein-based prognosis.