1 Introduction

Breast cancer is known as a complex biological system, and tumors are complex organ systems shaped by gene aberrations, cellular biological context, characteristics specific to the person, and environmental factors. Management of breast cancer relies on the availability of robust clinical and pathological prognostic and predictive factors to guide patient decision-making and the selection of treatment options [1]. Tumor size, indicated by the T-category, is known as a strong prognostic indicator for breast cancer and is one of the factors taken into account when deciding how and whether to treat a patient, independent of lymph node status. Significantly better survival can be expected in tumors categorized as T1. It is common practice to distinguish between T1 (0.1 cm < and < 2.0 cm) and T2 (2.0 cm < and < 5.0 cm) groups by the 2-cm rule [1]. It is well known that the T1-T2 distinction is reflected in prognosis: tumors categorized into the T2 group are more aggressive and might have already spread.

Gene expression profiling has in the last decade entered the field of molecular classification. An array-based approach to characterize T1 and T2 tumors was recently attempted, based on microarray data that present the expression level for each feature (gene or probe) and revealed distinct molecular pathways characterizing each stage [2]. The differential expression (DE) for a feature is measured using two-group comparison, for which several statistical methods, such as t-statistics, significant analysis of microarray (SAM), fold changes, and B-statistics, have been proposed [3]. However, DE measures are obviously dependent on the threshold chosen to distinguish between T1 and T2 tumors. In fact, the study by Riis et al. [2] suggested that using the T-size expression signatures instead of tumor size leads to a significant difference in risk for distant metastases and that the molecular signature can be used to select patients with tumor category T1 who may need more aggressive treatment and patients with tumor category T2 who may have less benefit from it. To stratify patients into two groups each requiring a different treatment for breast cancer, ‘Cutoff Finder’ was developed by [4]. The ‘Cutoff’ point is determined by the distribution of the marker under investigation and optimizing the correlation of the dichotomization with regard to an outcome or survival variable. The method was considered for stratifications based on the expression of specific genes, estrogen receptor, and progesterone receptor, neither whole genomic regions nor tumor size. In this article, we develop an algorithm to evaluate the traditional 2.0-cm threshold in the light of gene expression differences between breast cancer patients below and above the threshold. We use two different measurements from meta-analysis theory that are useful for handling multiple genetic studies; these apply different pre-processing techniques, platforms, and lab environments. The choice of which meta-analysis technique to use depends on the type of response and objective. When the objective is to identify the DE between two conditions, methods include vote counting, combining ranks, p values, and effect sizes [3]. Campain and Yang provided an intuitive measure, called meta differential expression via distance synthesis (mDEDS) [5], using DE via distance synthesis (DEDS) [6] to aggregate multiple DE measurements. The performance of mDEDS was compared with existing meta-analysis methods, such as Fisher’s inverse chi-square, GeneMeta, metaArray, RankProd, and Naïve meta-methods, using a simulation study and two case studies [3]. The results mostly showed better performance for mDEDS, while some cases favored the Fisher’s inverse chi-square [7]. This method uses a simple procedure that combines the p values from independent datasets. Therefore, we apply both the mDEDS and the Fisher’s score in our proposed algorithm in order to analyze different thresholds. To confirm the reliability of the proposed algorithm, we performed a simulation study. Then, we applied this algorithm to three different expression datasets gathered at two Norwegian hospitals. To validate the estimated optimum threshold for the Norwegian datasets, we applied our algorithm to five publicly available expression datasets. Based on the estimated optimum threshold for the Norwegian datasets, we investigated the prognostic status from the viewpoints of local recurrence and the associated network and canonical pathway.

2 Method

Given i = 1, ⋅ ⋅⋅, I genes from k = 1, ⋅ ⋅⋅, K datasets, the measures are described below. We should use two measures of comparison.

2.1 Fisher’s inverse chi-square statistic

Let p ik indicate the p value obtained by a DE statistic for the ith gene and kth dataset. The Fisher summary statistic S i [6] for each gene i is defined as

$$ {S}_i=-2{\displaystyle {\sum}_{k=1}^K \log \left({p}_{ik}\right)} $$
(1)

This statistic tests the null hypothesis that gene i is not the DE between the two groups given K datasets. Under this null hypothesis, S i is chi-square distributed with 2K degrees of freedom. In our case, the p value is calculated by the Wilcoxon-Mann-Whitney (WMW) test for each gene and each dataset.

2.2 Differential expression via distance synthesis (DEDS)

It is possible to calculate various statistics to describe the differences in expression between the two groups, including WMW test, t-statistics, and fold change (FC). DEDS then integrates and summarizes these statistics using a weighted distance approach [6] used for two-group comparisons, and next, it measures the distance between the aggregated point and the extreme origin that is assumed to represent the largest measurement of all. These procedures can be performed by the R package called ‘DEDS’ (http://www.bioconductor.org/). In the procedure, t-, SAM, FC, B-, moderated t-, and moderated F-statistics were selected as t j . Campain and Yang expanded DEDS to a meta-analysis method, called mDEDS [5]. The flow for the analysis by mDEDS proceeds as follows. (1) Apply J appropriate statistics t ij to each of i = 1, ⋅ ⋅⋅, I genes and J with 1 ≤ J ≤ 6. The observed coordinate-wise extreme point over all genes is defined by E 0 = (max i (t i1), ⋅ ⋅⋅, max i (t iJ )). (2) For each permuted dataset b = 1, ⋅ ⋅⋅, B, obtain the permutation extreme point E b and evaluate the coordinate-wise extreme point E p by maximizing over all permutations E p  = (max b (E b1), ⋅ ⋅⋅, max b (E bJ )). (3) Obtain the overall maximum E = max(E 0, E p ). (4) Calculate the distance d i from each gene to E = (E 1, ⋅ ⋅⋅, E J ), defined by \( {d}_i={\displaystyle {\sum}_{j=1}^J\frac{{\left({t}_{ij}-{E}_j\right)}^2}{\mathrm{MAD}{\left({t}_{ij}\right)}^2}} \), where MAD is the median absolute deviation from the median. (5) Do steps (1)–(4) for all k = 1, ⋅ ⋅⋅, K studies and summarize the distances coordinate-wise. The package outputs the list for estimated statistics and the distance for each dataset. To perform procedure (5), we summarize the obtained distances for all datasets and order them according to the genes.

2.3 An extension to DEDS

For mDEDS, the original study [5] did not touch on the procedure for using the extreme origin to measure the distance between the points by applying measurements that may change across different cohorts. DEDS’s original procedure selects the larger one of the original data or the permutated data as the extreme origin, obtained without taking into account changes in the extreme origin. In fact, the extreme origin and the coordinate-wise extreme origin changed if the dataset changed. When mDEDS is calculated for the threshold shifting at 0.1 intervals within a region from 1.5 to 3.5, the origin should also change in this manner: \( {E}_{1.5}= \max \left({E}_0^{(1.5)},{E}_p^{(1.5)}\right) \) for q = 1.5,…, \( {E}_q= \max \left({E}_0^{(q)},{E}_p^{(q)}\right) \) for q,…, \( {E}_{3.5}= \max \left({E}_0^{(3.5)},{E}_p^{(3.5)}\right) \) for q = 3.5, where \( {E}_0^q \) and \( {E}_p^q \) indicate the extreme point obtained by the original data and permuted data, respectively. Therefore, we define the following extreme point, named ‘totally extreme point (TEP)’: E max = max(E 1.5, ⋅ ⋅⋅, E q , ⋅ ⋅⋅, E 3.5) if q ∈ (1.5, 3.5).

Then, the scaled distance for each gene across studies K is \( {d}_i={\displaystyle {\sum}_{k=1}^K{\displaystyle {\sum}_{j=1}^6\frac{{\left({t}_{ikj}-{E}_{\max}\right)}^2}{\mathrm{MAD}{\left({t}_{ikj}\right)}^2}}} \).

2.4 Estimation of optimal threshold q between T1 and T2

Our intention is to identify the optimal threshold used to divide the sample into two groups, such that it best distinguishes the differential expression pattern between these two groups. To identify this threshold, we define the following optimization problem for an optimal threshold q 0 within a set Q of candidate thresholds. Let S i (q) be the Fisher score (1) applied to the two group comparison using a threshold at q, i.e., \( {S}_i(q)={\displaystyle {\sum}_{k=1}^K{p}_{ik}(q)} \). Then

$$ {q}_{0\_\mathrm{Fisher}}=\underset{q\in Q}{ \arg \max }{\displaystyle {\sum}_{i=1}^I{S}_i(q)}\kern0.5em \mathrm{f}\mathrm{o}\mathrm{r}\kern0.5em \mathrm{Fisher}\hbox{'}\mathrm{s}\kern0.5em \mathrm{s}\mathrm{core} $$
(2)

and similarly for

$$ {q_0}_{\_\mathrm{mDEDS}}=\underset{q\in Q}{ \arg \min }{\displaystyle {\sum}_{i=1}^I{d}_i(q)}\kern0.5em \mathrm{f}\mathrm{o}\mathrm{r}\kern0.5em \mathrm{mDEDS} $$
(3)

For the TEP introduced in 2.3, we take the summation of the distance for all genes and estimate the threshold that minimizes this value as

$$ {q}_{0-\mathrm{T}\mathrm{E}\mathrm{P}}=\underset{q\in Q}{ \arg \min }{\displaystyle {\sum}_{i=1}^I{d}_i} $$
(4)

This is motivated by the idea that we are looking for the threshold that best divides the two tumor groups from each other based on the genome-wide expression profiles.

For possible thresholds q in Q, we evaluated the Fisher’s score and mDEDS values. A flow chart of our proposed algorithm covering the above procedures is illustrated in Fig. 1. For the computational calculation, we used Matlab® (The Mathworks, http://www.mathworks.com/products/matlab) for (1)–(4) and R packages for DEDS.

Fig. 1
figure 1

Flow diagram of the analysis process

3 Simulation study

To confirm the accuracy of our proposed algorithm, we performed a simulation study. We considered three sets of artificial 10,000-array data, named ‘simdat1,’ ‘simdat2,’ and ‘simdat3.’ We first generated artificial data to represent tumor size. For the range of sizes, we generated random numbers by a uniform distribution between 1.0 and 2.9 and between 3.0 and 5.0, and thus the border size between small and large was set at 3.0. Simdat1 contains 55 small-sized samples and 45 large-sized samples, simdat2 contains 35 small-sized samples and 45 large-sized samples, and simdat3 contains 120 small-sized samples and 80 large-sized samples. Next, we generated artificial array data using random variables that follow different probability distribution functions to obtain higher and lower expression levels of the real data. Those higher and lower expressions are for the larger-size samples. Simdat1 was generated by a normal distribution with mean 10 and standard deviation 10 (described as N(10, 10)) as higher expression levels of 3500 arrays and N(−2, 10) as lower expression levels of 3500 arrays for 45 samples. The remaining array data within the expression levels other than those classified as higher or lower were generated by N(3, 1) for all samples. For simdat2, the higher expression levels with 2000 arrays were generated by a gamma distribution with shape 5 and scale 10 (described as Γ(5, 10)) and the lower expression levels with 4500 arrays were generated by Γ(3, 6) in 45 samples. The remaining array data were generated by N(0.5, 10). For simdat3, the higher expression levels with 2500 arrays were generated by a Poisson distribution with a parameter 10 (described as Pois(10)) and the lower expression levels with 3500 arrays were generated by Pois(8) in 80 samples. The remaining array data in 120 and 80 samples were generated by N(0.1, 20). These three datasets are illustrated in Additional file 1: Figure S1. Using a grid with difference equal to 0.1 within the range from 1.5 to 3.5, we estimated the optimal q 0 satisfying Eqs. (1) and (2). Fisher’s scores for the range are illustrated in Fig. 2. The left panel indicates that the maximum point was at 3.0. The right panel shows the plots of the scores for 0.01 intervals between 2.9 and 3.1. Taken together, these results suggest that searching by Fisher’s score found the optimal threshold to be 3.0, with the greatest difference in expression level. Then mDEDS was applied, using all six t-, SAM, FC, B-, moderated t-, and moderated F-statistics. Figure 3 shows the plots for DEDS score according to the range and the minimum point indicating the optimal threshold 3.0.

Fig. 2
figure 2

Plots for Fisher’s score and maximum point indicating the optimal threshold 3.0 (left). Right panel indicates plots of scores for more precise regions around 3.0

Fig. 3
figure 3

Plots for DEDS and minimum point indicating the optimal threshold 3

To test how robust the proposed method is if a small portion of the features are DE, we also generated simulation data assuming the same statistical distributions but involving 5, 20, 40, 60, and 80 % DE genes. The upper and lower plots in Fig. 4 present the plots of the sum for S and mDEDS, respectively, with the different DE ratios. In the case of smaller difference in expression (5 %), the curves are flatter; however, the maximum for S showed an optimal threshold of 3.0 for each percentage of DE genes. The results for mDEDS appeared more unstable than those for S. When TEP was applied, the thresholds are summarized as 3.4 for 5 %, 3.1 for 20 %, 2.8 for 40 %, and 2.9 for the others. This suggests that TEP could show a more robust threshold for data at a higher DE percentage. In our breast cancer dataset, the percentage of DE genes is about 25 % in the largest case.

Fig. 4
figure 4

Plots for Fisher’s score (upper) and DEDS (lower) (black small square 5 %, O 20 %, multiplication sign 40 %, triangle 60 %, and asterisk 80 %)

Summarizing these simulation studies, both Fisher’s and mDEDS scores found the optimal threshold 3.0, which was the boundary set for generating random small and large values. Thereby, we could demonstrate the validity of our proposed algorithm.

4 Materials

4.1 Norwegian datasets

Three datasets were gathered at two Norwegian hospitals. The two datasets consist of one-colored expression data (mdata1) (27 samples and 43,376 probes) and two-colored expression data (mdata2) (46 samples and 41,674 probes), which were collected at Akershus University Hospital, Lørenskog, Norway. The third dataset is 40,995 probes with one-colored mRNA expression for 102 tumor samples (mdata3), taken from patients with early-stage breast cancer [8] managed by Oslo University Hospital Radiumhospitalet in Norway. All datasets were processed on the Agilent platform, and the pre-processing of all datasets was performed by the methods provided by Bioconductor (http://www.bioconductor.org/help/workflows/oligo-arrays/). We applied quantile normalization to one-color data and the lowest normalization to two-color data. No background correction was performed for these data. The probes were matched across datasets. Consequently, 40,995 probes were used for the analysis. Given the relatively large full range of tumor sizes of 0.1–5.0 cm, however, the number of samples for less than 1.0 cm and over 4.0 cm were very few depending on the dataset. Therefore, we fixed 1.0–3.0 cm as the range we should search to find the optimum size.

4.2 Validation datasets

To validate the optimum threshold estimated by the above datasets, we used the five different expression datasets, collectively called the Affy947 expression dataset [9]. The dataset is a collection of six published datasets containing microarray data of breast cancer samples. These datasets are all measured on Human Genome HG U133A Affymetrix arrays and normalized using the same protocol. Since one dataset (Pawitan et al. dataset [10]) did not involve the tumor sizes data, we excluded it from further analysis. They were assessable from NCBI’s Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) with the following identifies, GSE6532 for the Loi et al. dataset [11], GSE3494 [12] for Miller dataset, GSE7390 for the Desmedt et al. dataset [13], and GSE5327 for the Minn et al. dataset [14]. The Chin et al. [15] dataset is available from ArrayExpress (http://www.ebi.ac.uk/, identifier E-TABM-158). This pooled dataset was pre-processed and normalized as described in Zhao et al. [16]. Microarray quality-control assessment was carried out using the R AffyPLM package from the Bioconductor web site (http://www.bioconductor.org, [17]). The relative log expression (RLE) test and the Normalized Unscaled Standard Errors (NUSE) test were applied. Chip pseudo-images were produced to assess artifacts on arrays that did not pass the preceding quality-control tests. Selected arrays were normalized according to three-step procedures using the robust multi-array average (RMA) expression measure algorithm (http://www.bioconductor.org; [18]): RMA background correction convolution, median centering of each gene across arrays separately for each dataset and quantile normalization of all arrays. Gene mean centering has been shown to effectively remove many dataset-specific biases allowing effective integration of multiple datasets [19].

5 Results and discussion

5.1 Optimal tumor size

Our proposed algorithm summarized in Fig. 1 was applied to the data across three different cohorts, and the plots for Fisher’s score and mDEDS are shown in Fig. 5. For mDEDS, we took all possible statistics, according to [6]: t-, SAM, FC, B-, moderated t-, and moderated F-statistics. Fisher’s scores estimated 2.5 cm as the optimal threshold, larger than the classical 2.0 cm. mDEDS determined 2.2 cm as the optimal threshold. For TEP, we summarize q 0 − TEP in Fig. 6. The minimum value for Eq. (4) was 2.4 cm, which was clearer than the result shown in Fig. 5 and closer to 2.5 cm obtained by Fisher’s score. This result suggests that TEP-based q 0 − TEP gives us a more robust threshold size. Given the results by Fisher’s score, it would not seem feasible to detect whether 2.2 or 2.4 cm is the best size. However, our proposed analysis can consider the possibility that a size larger than 2.0 cm is appropriate to indicate where the expression patterns show the greatest difference.

Fig. 5
figure 5

Fisher’s score (upper) found 2.5 cm and mDEDS (lower) found 2.2 cm as the optimal threshold

Fig. 6
figure 6

Plots obtained by TEP

It is important to notice that the optimal value of q, obtained by optimizing the objective functions (3) and (4), cannot be equipped with a confidence interval obtained by bootstrap. This is similar to other situations in statistics, where certain parameters are obtained by optimization, for example, the smoothing parameter in non-parametric regression or the penalty in lasso regression, obtained by optimizing some cross validation criteria. To explain this, let us follow the bootstrapping paradigm. Let us fix a value q 1. Then we can compute the p values p i (q 1) and the Fisher score S*(q 1). We can bootstrap the data and obtain bootstrap distributions for all p values and compute the corresponding bootstrap distribution for S(q 1), which has a mean equal to S*(q 1). We now repeat for various q in Q and obtain score S*(q) and the bootstrap distributions for score S(q) for all q in Q. What we do in this article is to minimize over q the score S*(q), which can be interpreted as the bootstrap mean. But we cannot minimize the sum of the bootstrapped distributions of S(q) for all q in Q. We need to summarize these distributions by a point estimate, and our method uses the mean. For example, we could use the bootstrap medians instead. In any case, the obtained optimal q cannot carry any bootstrap-based uncertainty. On the other hand, we can repeat the threshold selection separately on each of the three datasets. For mDEDS, this gave the optimal values of 2.1, 2.2, and 2.2 cm; for Fisher’s score, we obtained 1.7 (slightly preferable to 2.5 cm), 2.4, and 2.5 cm. Three values do not allow an estimate of variability, but they appear consistent.

5.2 Validation study

To validate our proposed algorithm, additional five different expression data were analyzed using the same approaches. For mDEDS, we took six statistics, t-, SAM, FC, B-, moderated t-, and moderated F-statistics. The plots for Fisher’s score, mDEDS and TEP are shown in Fig. 7. Some studies involve few samples for smaller size than 1.5 cm or larger than 3.5 cm. Therefore, the plots should be shown within the range between 1.5 and 3.5 cm. The optimum sizes were 2.1 cm by Fisher’s scores, 2.5 cm by mDEDS, and 2.6 cm by TEP, which were all larger than 2.0 cm. If the first local maximum of 2.0 cm is ignored for the Fisher’s score, the second peak indicated 2.6 cm. These results suggest that the five datasets validate the possibility of a optimum threshold which is larger than 2.0 cm. On the other hand, the cases for mDEDS and TEP indicated 2.0 and 2.1 cm as the second peak. This confirmed that the 2.0-cm rule works for distinguishing different characteristics of the tumor in the expression data. Furthermore, we can say that the 2.0-cm rule is robust also with respect to the gene expression analysis, since it appears to be conservative in recommending a stronger treatment a couple of millimeter before a threshold based on the gene expression would indicate [2].

Fig. 7
figure 7

Optimum thresholds by Fisher’s score (top), mDEDS (middle), and TEP (bottom) in the case of validation datasets

5.3 Survival analysis using optimum threshold

Usually, the goal for tumor staging based on tumor size and other factors is to guide the choice of treatment for patients and predict their outcomes. Therefore, we evaluate our threshold also with respect to clinical outcomes, namely the survival time and time to local recurrence. We have the monthly survival time (time to death) and time to local recurrence for only mdata3. We divided these patients into two classes according to the thresholds 2.0, 2.2, and 2.4 cm. The survival functions of the corresponding classes were compared by Kaplan-Meier analysis and the log-rank test. The survival was defined either as overall survival (death by any reasons used as the observed time and alive used as the censored time) or as breast cancer (BC)-specific survival (death by only BC used as the observed time and others used as the censored time). Table 1 summarizes the obtained p values for the log-rank test of each survival time and each threshold. The 2.0-cm threshold distinguishes best in terms of overall survival. Interestingly, the 2.0- and the 2.2-cm thresholds appear to be preferred in terms of BC-specific survival. The threshold 2.2 cm appears to provide the best classification for local recurrence. The survival curves of the two groups shown in Fig. 8 are more different for all thresholds larger than 2.0 cm. This result suggests that the optimum threshold, which maximizes the total differential expression also, is confirmed by the larger difference in time to local recurrence. Local recurrence is known to be better predicted by expression compared to overall survival. In summary, despite the limitations of our data, there is some indication that a slightly larger threshold between 2.0 and 2.2 cm, which maximizes differential expression, also leads to improved distinctions in survival curves for time to local recurrence, compared to the traditional 2.0-cm rule.

Table 1 p values obtained by the log-rank test for survival time (months) and time to local recurrence (in months)
Fig. 8
figure 8

Survival curves in the case of local recurrence for each threshold 2.0 cm (top), 2.2 cm (middle), and 2.4 cm (bottom)

5.4 Associated network and canonical pathway analyses based on the gene lists of expression differences between T1 and T2 groups based on the 2.4- and 2.0-cm thresholds

We are interested in the specific biological features of the genes discriminating between tumors below and above the given 2.0- and 2.4-cm thresholds. First, we applied SAM [20] to obtain the significant probes in terms of gene expression differences for both thresholds. Table 2 summarizes the number of significant probes and the corresponding FDR [21].

Table 2 Estimated number of significant probes (genes) by SAM

As shown in the table, since the two-color dataset (mdata2) keeps the 5 % FDR level, we focus on this dataset for the associate network and canonical pathway analysis. For the probes obtained by SAM, we counted unique significant probes for each threshold as well as the number of overlapping probes (see Additional file 2: Table S1). Figure 9 summarizes the numbers of unique probes—2.4 cm unique (part A), 2.0 cm unique (part B), and overlapped (part C)—in the Venn diagram. In order to investigate the biological functional interaction for the gene lists, we used a tool called IPA (Ingenuity Pathway Analysis) [22], which delivers a rapid assessment of the signaling and metabolic pathways, molecular networks, and biological processes that are most significantly perturbed in the dataset of interest. IPA has many options to find insights on the relationships, mechanisms, functions, and pathway of relevance. We selected an option for associated network functions and canonical pathway, and the outputs for the pathway analyses and biological functions (diseases and disorders, molecular and cellular functions) are summarized in Table 3. The p value associated with a biological process or pathway annotation is a measure of its statistical significance with respect to the Functions/Pathways/Lists Eligible molecules for the dataset and a reference set of molecules (which define the molecules that could possibly be Functions/Pathways/Lists Eligible). The p value is calculated with Benjamini and Hochberg FDR [21]. The ratio of the canonical pathways is defined as the number of molecules in a given pathway that meet the cutoff criteria divided by the total number of molecules that make up that pathway. Networks are scored based on the number of network-eligible molecules they contain. In Table 3, a score above 10 is recognized as a meaningfully higher score. The network score is based on the hypergeometric distribution (source: IPA online manual).

Fig. 9
figure 9

Summary of unique probes for 2.4 and 2.0 cm and overlapped probes

Table 3 Summary of pathway analyses

Associated network functions explain the tendencies of cellular assembly in tumor interaction for the early stage of tumors and energy production for the progressive stage of tumors. Part C represents a transitional stage from early to progressive, which involves associated network functions including lipid metabolism and cell signaling, nucleic acid metabolism, and small molecule biochemistry.

For the common genes shown in part C, besides known genes in breast cancer, such as AKT, ERBB2, and PTEN, we found also MTDH. When it was introduced, the gene Metadherin (MTDH) was shown to affect the expression of many genes of relevance to the metastatic and chemo-resistance phenotypes [23]. MTDH may also represent a novel mediator of malignant breast cancer progression. Furthermore, we found interesting genes in part A such as MYC, which is known as an oncogene frequently deregulated in breast cancer; TP53, which is associated with high risk for various cancers; RAD50, which is known to moderately increase breast cancer risk; and BRCA2, whose mutation is associated with a significantly elevated risk for breast and ovarian cancers [24].

6 Conclusions

We study various tumor size thresholds that can be used to create two groups of patients. We proposed a numerical algorithm involving Fisher’s score and mDEDS using gene expressions. Both measurements found that the difference in gene expression between smaller and larger tumors appears to be slightly larger than 2.0 cm. The over 2.0-cm optimum thresholds were supported by a validation using the five published expression datasets. We also found that the thresholds over 2.0 cm lead to the most distinct Kaplan-Meier curves of time to local recurrence. From the associated network and canonical pathway analyses for Norwegian datasets, the lists of DE genes for the 2.4-cm threshold also included some genes related to the metastasis of breast cancer. The same approach can be extended to also controlling other factors such as tumor grades and estrogen receptor (ER) status, which are also important prognostic indicators for breast cancer. It could also apply to other cancer considering tumor size as a prognostic indicator. A further extension of our approach would be to determine more than two groups of patients, on the base of two (or more) thresholds. This would indicate that tumor dimension has a similar role with tumor grades. We decided to remain within the consolidated clinical practice with just the T1/T2 distinction. In summary, our analysis based on gene expressions indicates that the 2.0-cm rule applied to determine patients who will benefit from more aggressive therapy appears to be justified. However, we find indications that a slightly larger threshold, of 2.2 cm could instead be applied, thus reducing therapy for some borderline patients. This could spare negative effects of strong therapies to patients that possibly do not need them. We interpret our results as a call for a critical revision of the 2.0-cm rule in the light of individual genomic data.