Background

Breast cancer is worldwide the second most common type of cancer after lung cancer. According to the American Cancer Society, approximately 192,370 women in the US will be diagnosed with breast cancer in 2010, and about 40,170 women will die from the disease.

Molecular biomarkers have become increasingly important clinical tools for cancer screening, diagnosis, treatment customizations. There has been an increasing number of research reports on developing breast cancer biomarkers, especially in blood[1]. Many molecular biomarkers with expression level changes have been identified in breast cancer tissue samples or blood, for example, HER2[2], PNCA[3], Lipofilin B[4], Cyclin D1[5], CEACAM6[6], Osteopontin-c[7], RCP[8], and FOXA1[9].

Most current breast cancer biomarker identification is achieved using functional genomics studies of established breast cancer cell lines[10][11][12][13][14]. Cell lines are widely used in many aspects of laboratory research and particularly as in vitro models in cancer research. They have a number of advantages, including being easy to access and offering “clean” results with statistically significant signals. However, human systems are quite complex[15], and many candidate biomarkers discovered in cell lines do not readily transfer to human tissues or blood, in which clinical testing will be performed. Therefore, profiling human plasma using proteomics techniques offers an appealing alternative to cell lines or tissue biospecimens in developing protein biomarkers[16], although the debate over this issue is heated[17].

The question whether protein biomarker identified in blood can be valuable rests primarily on our ability to address the complexity associated with the human plasma proteome. The inherent presence of measurement noise, inconsistencies due to individual differences, and sample biases of the plasma proteomics approach have been reported[18]. However, our recent studies also showed, by collecting plasma proteomics into a common proteomics data repository, the HIP2 database[19], we could start to reduce the perceived coverage biases for plasma proteomics, and explore a promising goldmine of candidate cancer biomarkers and drug targets[20]. In addition, bioinformatics and systems biology techniques can help reduce this complexity significantly. For example, one can use plasma proteomics to derive breast cancer candidate protein markers and then use gene expression mapping to validate candidate protein biomarkers that are known to be secreted. One can also use advanced visualization or network biology techniques such as[21][22]to model and monitor global patterns of changes in proteomics, instead of candidate biomarkers at the individual protein level[23].

In this paper, we adopted a systems biology approach to the study of panel protein biomarker discovery in breast cancer using plasma. For polygenic diseases such as breast cancer and a complex detection platform such as human blood, we recognize that a single protein biomarker approach using “expressions” will not suffice for the high performance requirement of breast cancer screening and diagnosis. Therefore, by enlisting multiple proteins as analytes that are mechanistically linked to breast cancer pathways or functional networks, we believe that the chance of success would be higher than the simpler conventional single-marker approach.

Our computational analysis involves several steps. First, we used a t-statistics and permutation procedure to identify protein biomarker candidates that are significantly differentially detected among different individual plasma samples between the case and the control for breast cancer. Second, we performed an extensive literature curation to determine the constituents of the plasma protein biomarker panel. Third, we performed gene ontology analysis and pathway analysis to validate the list to reveal the intricate breast cancer related molecular mechanism that exists among the protein biomarkers on the panel. Fourth and last, we validated our results using two independent publicly-available breast cancer proteomics datasets derived from other groups. Our results showed that such an integrative systems approach is essential to development and validation of panel protein biomarkers that may withstand rigorous testing for the future steps.

Results

Normality test

The plasma proteome profiling dataset in Study A contains 4832 peptides, two states (health and breast cancer) and 40 samples each state. Q-Q plot and one sample Kolmogorov-Smirnov test in Figure 1 showed that the log2 transformation intensity values for all 4832 peptides from 40 health women are not normally distributed (One-sample Kolmogorov-Smirnov test, D = 0.0419, p-value < 2.2e-16). We also found the intensity values from 40 women diagnosed with cancer in Study A, women diagnosed with cancer from Study B and C, and healthy women form Study B and C are likewise not normally distributed.

Figure 1
figure 1

Q-Q Plot for Data from 40 Health women in Study A

Pathway analysis and gene ontology categorization of significant proteins

4832 peptides in the Study A were mapped to 1422 proteins by searching against the IPI database. Using a t statistics and permutation process described in the Methods section and setting a p-value cut-off of 0.001 after initial ANOVA analysis of mass spectra data, we identified 254 statistically significant differentially expressed proteins (PFER=1.422, FDR=0.0056), among which 208 are over-expressed and 46 are under-expressed in breast cancer plasma. Compared to the results of traditional statistical testing (PFER=2.5596, FDR=0.01), our results show that the coupled statistical process outperforms the sensitivity of a parametric traditional statistical test that requires strong and sometimes untenable data assumptions since it is non-parametric and requires no assumption about the distribution under the null hypothesis.

A comparison of the set of 254 proteins with published findings from proteomic analysis of human breast cancer cell lines yielded 25 differentially expressed proteins in common. Top networks were identified by using Ingenuity Pathway Analysis (Table 1, and Figure 2). Figures 3a and 3b quantified the biological significance of gene ontology biological processes

Table 1 Top Networks Involved
Figure 2
figure 2

The 25 Proteins Are Involved in an Endocrine System Disorders Network. Red stands for over-expressed and green for under-expressed

Figure 3
figure 3

Gene Ontology Biological Processes Enrichment Analysis for 25 Protein Biomarkers.

Cross-validation of candidate biomarkers

In order to validate our computational results, the same methods and procedures we used in Study A were applied to Studies B and C. As shown in the Venn diagram (Figure 4), 20 candidate protein biomarkers were identified in Study B, of which 13 were found in common with study A, and 25 candidate protein biomarkers were identified in Study C, of which 13 were found in common with study A. The similarity measurements with the protein method are 40% for biomarker sets from Study A and Study B and 35% for biomarker sets from Study A and Study C.

Figure 4
figure 4

Venn Diagram of Markers Venn diagram indicating the degree of overlap among the markers identified from three data sets: Study A, B and C.

The pathway-protein association matrix is shown in the Additional file 1. The top three pathways: ‘Complement and coagulation cascades’, ‘Regulation of actin cytoskeleton’, and ‘Focal adhesion’ are ranked top in all three studies (Figure 5). The similarity measurements with the pathway-protein matrix method are 86% for biomarker sets from Study A and Study B and 83% for biomarker sets from Study A and Study C.

Figure 5
figure 5

Heatmap of Pathway-Protein Frequency Count Matrix Vertical axis represents pathway and horizontal axis represents studies. Each cell corresponds to the count of represented proteins in pathway for the corresponding Study.

Using Ingenuity Pathway Analysis and DAVID GO analysis, we also found that biomarkers identified from the Study B and Study C have similar network and function as the 25 candidate protein biomarkers identified from study A.

From the cross-validation offered by our ‘systems biology’ approach, we found similar pathway, network and function in those biomarkers identified from the three studies. The systems biology approach helps to unravel the intricate pathways, networks and functional contexts in which genes or proteins function and is essential to the understanding molecular mechanisms of panel protein biomarkers.

Discussion

Permutation test

Most of protein identification methods were based on fold change. Fold change is easy to use and implement, fast and works with any number of arrays, but it does not take the variability of a protein into account and offers no significance measure of p-value. In this paper, we used a t-statistics to calculate the initial p-value that can takes the variance into account.

Theoretically, the t-test can be used even if the sample sizes are very small (e.g., as small as 10; some researchers claim that even smaller samples are manageable), as long as the variables are normally distributed within each group and the variation of scores in the two groups is not reliably different. If populations from which data to be analyzed by a t-test were sampled violate one or more of the t-test assumptions, the results from the analysis may be incorrect or misleading. For example, if the assumption of independence is violated, then the two-sample unpaired t-test is simply not appropriate. If the assumption of normality for the t-test is violated, or outliers are present, then the t-test may not be the most powerful test available. However, our plasma protein profiling shows no normal distribution (Figure 1). In order to detect a true difference between health and control samples, a permutation process was used, and should be reliable regardless of whether or not the sample distribution is known.

Our results showed that the permutation test was very similar to the t-test in its p-value estimate. The t-test is a parametric test and the permutation process is non-parametric. By using the permutation test we made no assumption about the distribution under the null hypothesis. Usually, the assumptions in the null hypothesis are weakened, and it becomes harder to reject. The permutation process rivals the sensitivity of a parametric t-test assuming equal variances.

To compare the power of the t-test and permutation test, that is, how likely they are to reject the null hypothesis when an alternate hypothesis is true, we assume specific distributions for the alternate hypothesis. For the t-test, the most natural alternate hypothesis is that the two samples are from different normal distributions. For large samples, it has been shown that the power of the permutation process using the difference in sample means is equal to the t-test for normally distributed alternates.

We used quantitative method to compare the results of ttperm function in the Category package of the R language using the method described in the method section. All 16 biomarkers identified using ttperm function are among the 25 panel biomarkers identified by our permutation test method in Study A. 13 of them are identified in Study C. Comparing them with the 20 candidate protein biomarkers we identified from study B, there are only 7 in common. But the remaining 9 proteins not identified by our permutation method are also not chosen as candidate protein biomarkers by previous findings using breast cell lines [10][13][14][24]. Gene ontology analysis using the 16 biomarkers showed that response to external stimulus was annotated, but cellular metabolic process was not, and acute inflammatory response was annotated but proteolysis not. The cellular metabolic process (proteolysis) annotated by our permutation method was reported to be related to cancer progression[11]. All the results show that our permutation test method is highly robust to the equality of the variances, regardless of whether the same sizes are similar and carries more conviction than the other permutation test which doesn’t consider the affect of variance inequality.

Candidate protein biomarkers identified

A total of 416 peptides were identified from analyzing the plasma protein profiling from 40 women diagnosed with breast cancer and from 40 healthy women using the permutation test, corresponding to 254 unique UNIPROT protein names. A comparison of the 254 proteins with previously published findings from proteomic analysis of human breast cancer cell lines yielded 25 candidate protein biomarkers. The 25 proteins were categorized by their interactive pathway, network and annotated biological process on Gene Ontology.

An interesting finding from Pathway Analysis is that those top networks shown in the Table 1 and Figure 1, especially the top 1 network (Endocrine System Disorders, Metabolic Disease, Antigen Presentation) and top 3 pathways (‘Complement and coagulation cascades’, ‘Regulation of actin cytoskeleton’, and ‘Focal adhesion’), are validated by our B and C dataset results, and are similar to previously reported works[25][26][27][28]. For example, Ana-Teresa et al. studied 12 candidate genes that are implicated in the etiology of breast cancer and found these genes are functionally involved in complement and coagulation cascades pathway[29]. Carol et al. reported that the cell migration in breast cancer lines can also be regulated by actin cytoskeleton dynamics[30]. And Michael et al. reported that increased focal adhesion kinase expression correlates with TGF-β1-mediated activation of p38 MAPK in metastatic human breast cancer cells and concluded that focal adhesion is essential in mediating oncogenic signaling by transforming growth factor (TGF)-β[31].

Another interesting finding from our Gene Ontology work is the role of cellular metabolic process and response to external stimulus (especially proteolysis and acute inflammatory response) in Figure 3a and 3b in breast cancer was also reported by other authors. For example, cancer, like other diseases, is accompanied by strong metabolic disorders[11]. And It was also reported that stress and external stimulus such as microbial infections, ultraviolet radiation, and chemical stress from heavy metals and pesticides affect the progression of breast cancer[32].

Conclusions

254 statistically significantly differentially expressed proteins between 40 healthy women and 40 women diagnosed with breast cancer were identified from initial LC-MS/MS experiments using a t statistics and permutation process which is useful in independent two-sample hypothesis testing. Top breast cancer activated networks and pathways were identified through systems biology approach. 25 candidate protein biomarkers were validated from the pathway/network analysis, literature curation from previous published findings in breast cell lines, and two additional studies. Gene ontology analysis confirmed that cellular metabolic process and response to external stimulus (especially proteolysis and acute inflammatory response) were enriched in the 25 protein biomarker panel. Pathway analysis identified three top enriched pathways: ‘Complement and coagulation cascades’, ‘Regulation of actin cytoskeleton’, and ‘Focal adhesion’. Our approach integrating computing, basic biomedical research, and clinical applications promises to be able to “translate” between scientific innovations and clinical diagnostic needs for breast cancer. Assay Development and Clinical Trials for Panel Biomarker from breast cancer patients are needed to assess which of the identified proteins may have diagnostic utility.

Methods

Biomarker identification and characterization holds great promise for more precise diagnoses and for tailored therapies. The heterogeneity of human cancers and unmet medical needs in these diseases provides a compelling argument to focus biomarker development in cancer. Mass spectrometry based proteomics approaches have provided insight into biomarkers of cancer and other diseases with femtomole sensitivity and high analytical precision. The schema of methods in this paper is shown in the following Figure 6.

Figure 6
figure 6

Schema of Methods First, proteins were prepared and subjected to LC/MS/MS analysis. Next, a permutation procedure was used to determine the p-value. Next, 4 previously published proteomic studies of breast cancer cell lines were used for comparison. Then, top networks and pathways were identified with Ingenuity Pathways analysis, KEGG and HPD pathway databases. And Level 2 and 5 of biological process in gene ontology are mainly studied. And then, two testing cancer dataset were used to validate the result. Last, assay development and clinical trials for panel biomarkers are planned for the future.

First, proteins were prepared and subjected to LC/MS/MS analysis. Samples were run on a Surveyor HPLC (ThermoFinnigan) with a C18 microbore column (Zorbax 300SBC18, 1 mm × 5 cm). All tryptic peptides (100 μL or 20 μg) were injected onto the column in random order. Peptides were eluted with a linear gradient from 5% to 45% acetonitrile developed over 120 min at a flow rate of 50 μL/min, eluant was introduced into a ThermoFinnigan LTQ linear ion-trap mass spectrometer. The data were collected in the “triple-play” mode (MS scan, Zoom scan, and MS/MS scan). Next, a permutation procedure was used to determine the p-value. The 80 samples for each peptide were permuted 100,000 times and the complete set of t-tests between the first 40 values and the last 40 values, was performed for each permutation. Next, 4 previously published proteomic studies of breast cancer cell lines were used for comparison. Then, top networks and pathways were identified with Ingenuity Pathways analysis, KEGG and HPD pathway databases. And Level 2 and 5 of biological process in gene ontology were mainly studied. And then, two testing cancer dataset were used to validate the result. Last, assay development and clinical trials for panel biomarkers are planned for the future.

Database

Plasma protein profiles were collected in three batches, which we refer to as Studies A, B and C. All 3 of these studies were processed in the same laboratory but at different times. Each sample was analyzed in a single batch by mass spectrometry. In both Studies A and B, 80 plasma samples were collected (40 samples collected from women with breast cancer and 40 from healthy volunteer woman who served as controls). In Study C, 40 plasma samples were collected (20 samples collected from women with breast cancer and 20 from healthy volunteer woman who served as controls). The demography and clinical distribution of breast cancer stages/subtypes for study A, B and C are comparable, although the total sample number of Study C is somewhat smaller than Study A and B.

We compared our results with 4 previously published proteomic studies of breast cancer cell lines. Their methods and results presented in peer reviewed journals [10, 12-14] have established a high reliability. A total of 3085 protein biomarkers were identified from the five breast cell lines, MCF-10A, BT474, MDA-MB-468, MD-MB-468, and T47D/MCF7 in their papers.

Protein identification and quantification

For protein identification, Tryptic peptides were analyzed using Thermo-Finnigan linear ios-trap mass spectrometer (LTQ) coupled with a HPLC system. Peptides were eluted with a gradient from 5 to 45% Acetonitrile developed over 120 minutes and data were collected in the triple-play mode (MS Scan, zoom scan, and MS/MS scan). The acquired raw peak list data were generated by XCalibur (version 2.0) using default parameters and further analyzed by the label-free identification and quantitative algorithm using default parameters described by Higgs et al[33]. MS database searches were performed against the combined protein data set from International Protein Index (IPI; version 3.60) and the non-redundant NCBI-nr human protein database (updated 2009), which totaled 22,180 protein records. Carious data processing filters for protein identification were applied to control false-discovery rate at below 5% levels.

For protein quantification, first, all extracted ion chromatograms (XICs) were aligned by retention time. Each aligned peak were matched by precursor ion, charge state, fragment ions from MS/MS data, and retention time within a one-minute window. Then, after alignment, the area-under-the-curve (AUC) for each individually aligned peak from each sample was measured, normalized, and compared for relative abundance—all as described in[33]. Here, a linear mixed model generalized from individual ANOVA (Analysis of Variance) was used to quantify protein intensities. In principle, the linear mixed model considers three types of effects when deriving protein intensities based on weighted average of quantile-normalized peptide intensities: 1) group effect, which refers to the fixed non-random effects caused by the experimental conditions or treatments that are being compared; 2) sample effect, which refers to the random effects (including those arising from sample preparations) from individual biological samples within a group; 3) replicate effect, which refers to the random effects from replicate injections from the same sample preparation.

“Systems biology” analysis

We applied a “systems biology” approach to the study of panel biomarker discovery problem in breast cancer proteomics data study in this study. Our strategies for analyzing potentially noisy proteomics data set are three-fold. First, we used a t-statistics and permutation procedures to calculate p-value for proteins changed in all samples, instead of fold change or t-test for a given sample that were commonly used in previous studies. This allowed us to better filter the proteomics results. Second, we used extensive literature curation to focus only on breast cancer relevant differentially expressed proteins. This literature curation step allowed us to concentrate on breast cancer relevant signals, with generally noisy proteomics data sets. Third, we used gene ontology analysis and pathway analysis to identify and validate correlated changes due to cancer cell signaling that may, individually, elude the detection.

T statistics and permutation process

Our test statistic for study A and B is a mean of 40 values (protein intensities in health samples) minus the mean of another 40 values (protein intensities in cancer samples). For study C, the total sample number is 40 with 20 healthy samples and 20 cancer samples. A permutation procedure was used to determine the p-value for each protein, representing the chance of observing a test statistic at least as large as the value actually obtained. The 80 samples for each protein in Study A and B (40 samples for Study C) were permuted 100000 times and the complete set of t-tests was performed for each permutation. The permutation p-value for a particular protein is the proportion of the permutations in which the permuted test statistic exceeds the observed test statistic in absolute values. We chose a significance level α=0.001 to select proteins where we estimated significant differences in the health and cancer sampled. The corresponding “per-family Type 1 error rate, PFER”, that is, the expected number of false positives for such a multiple test procedure is PFER = number of genes x 0.001. Alternatively, the nominal “false discovery rate, FDR”, or expected proportion of false positive among the genes declared differentially expressed, is FDR = PFER/number of genes declared expressed.

Network and function annotation analysis

Ingenuity Pathway Analysis was used for building network. DAVID database was used to study level 2 and 5 of biological process in gene ontology. Fisher’s exact test is used to test the statistical significance for association between the gene list with expression changes and the function set[34].

Pathway-protein association matrix

Pathway comparison are performed using the following databases: Kyoto Encyclopedia of Genes and Genomes (http://www.genome.ad.jp/kegg/)[35]and HPD[36]. The visualization for the pathway-protein association matrix was created by Excel 2007 VBA.

Pathway-protein frequency count matrix

A pathway-protein frequency count matrix (PPFCM) contains pathway on vertical axis and studies represented on horizontal axis. Each cell of PPFCM corresponds to the count of represented proteins in pathway for the corresponding Study.

Biomarker set similarity

We presented two approaches to measure the similarity of biomarker sets: 1) protein method and 2) pathway-protein matrix method.

The biomarker set similarity measure with the protein method can be defined as the extent of overlaps, e.g., common number of biomarkers, shared between two different biomarker sets. The set-set similarity score S i,j is defined as

(1)

where, N denotes total number of biomarker sets. P i and P j denote two different biomarker sets, while |P i | and |P j | are the numbers of biomarkers in these two sets. Their intersection P i P j is the set of all biomarkers that appear in both P i and P j , while their union P i P j is a set of all biomarkers either appearing in the P i or in the P j . Duplicates are eliminated in the intersection set and union set.

The biomarker set similarity measure with the pathway-protein matrix method can be defined as the correction coefficient of the number of represented biomarkers in pathways for sets. The higher the correction coefficient is, the more similar the two sets. The equation is expressed as

(2)

where Corr is the Pearson correlation coefficient, Q i is the biomarker numbers in pathways for set i, N denotes total number of biomarker sets.