Breast Cancer Research and Treatment

, Volume 120, Issue 3, pp 567–579

Data driven derivation of cutoffs from a pool of 3,030 Affymetrix arrays to stratify distinct clinical types of breast cancer

Authors

    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Dirk Metzler
    • Department of Biology IILudwig-Maximilians-University
  • Eugen Ruckhäberle
    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Lars Hanker
    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Regine Gätje
    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Christine Solbach
    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Andre Ahr
    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Marcus Schmidt
    • Department of Obstetrics and GynecologyJohannes Gutenberg University
  • Uwe Holtrich
    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Manfred Kaufmann
    • Department of Obstetrics and GynecologyJ. W. Goethe University
  • Achim Rody
    • Department of Obstetrics and GynecologyJ. W. Goethe University
Preclinical study

DOI: 10.1007/s10549-009-0416-z

Cite this article as:
Karn, T., Metzler, D., Ruckhäberle, E. et al. Breast Cancer Res Treat (2010) 120: 567. doi:10.1007/s10549-009-0416-z

Abstract

Pooling of microarray datasets seems to be a reasonable approach to increase sample size when a heterogeneous disease like breast cancer is concerned. Different methods for the adaption of datasets have been used in the literature. We have analyzed influences of these strategies using a pool of 3,030 Affymetrix U133A microarrays from breast cancer samples. We present data on the resulting concordance with biochemical assays of well known parameters and highlight critical pitfalls. We further propose a method for the inference of cutoff values directly from the data without prior knowledge of the true result. The cutoffs derived by this method displayed high specificity and sensitivity. Markers with a bimodal distribution like ER, PgR, and HER2 discriminate different biological subtypes of disease with distinct clinical courses. In contrast, markers displaying a continuous distribution like proliferation markers as Ki67 rather describe the composition of the mixture of cells in the tumor.

Keywords

Breast cancerMicroarrayCutoffDistributionPoolingMeta-analysisBimodal markers

Introduction

Breast cancer is a heterogeneous disease of many different subtypes. This is one of the reasons that large cohorts of hundreds to thousands of patients are often needed to analyze treatment effects and the prognosis of specific subgroups [13]. In contrast, microarray datasets encompass only tens to hundreds of samples because of the expenditure and complexity of these analyses compared to standard parameters like age, tumor size, or hormone receptor status. Thus pooling of microarray datasets or meta analyses are required to enlarge samples size [4]. In the majority of cases an adaption of the raw values is necessary before pooling different datasets. To this end common methods like scaling by Z-transformation [5] or magnitude normalization [6] have been applied. In some studies normalization across genes has also been performed [7]. Here we analyze influences of these methods on the resulting concordance with data from biochemical assays. A previous report already demonstrated that estrogen receptor (ER) and human epidermal growth factor receptor 2 (HER2) status can be deduced from Affymetrix microarray data with high confidence [8]. However, in this former study specific cutoff values were derived and optimized by comparison with immunohistochemistry as the gold standard. In contrast, in the present study we aimed to derive the cutoff values directly from the data without prior knowledge of the biochemical status of the samples. Finally we compared the results for bimodal and continuous markers with respect to their biological impact on disease. Our results demonstrate that reliable cutoffs can be derived from the distribution of expression values in a pooled dataset of individually normalized microarrays. These cutoffs led to exceptionally high concordance with biochemical data for bimodal markers like the ER. Clinical follow up data demonstrate that they correctly identify distinct subtypes of the disease.

Materials and methods

We combined a database of n = 3,030 Affymetrix HG-U133A microarrays from treatment-naïve primary breast cancer samples (Table 1). We included 238 of our own samples (datasets Frankfurt, Frankfurt-2, and Frankfurt-3) which have been described previously (Ahr et al. 2002 [9], Rody et al. 2007 [10], Rody et al. 2008 [11], Ruckhäberle et al. 2008 [12], and Rody et al. 2007 [13], respectively) as well as 2792 samples from 22 different publicly available datasets (Table 1): Rotterdam [1416], Mainz [17], TransBIG [18], Oxford-Untreated [19], London [20], London-2 [21], Oxford-Tamoxifen, Veridex-Tam [22], Stockholm [23], Uppsala [24, 25], San Francisco [26], New York [27], MDA133 [28], EORTC [29], Edinburgh [30], ExpO [31], Signapore [32], Genentech [33], Boston [34], Berlin [35], Paris [36], and Tampa [37]. For comparability, only the ProbeSets from the Affymetrix HG-U133A microarray were used from seven datasets where HG-U133+ microarrays were applied. The clinical characteristics of the patients in the different datasets are given in Table 1.
Table 1

Summary of Affymetrix microarray datasets used in this study

Dataset

Data source

No. of samples

% of Samples

Systemic treatment

Median follow up months

No. of relapses

Event type

References

Age ≤ 50

Tumor size ≤ 2 cm

LNN

ER pos.

G3

Rotterdam

GSE2034, GSE5327

344

n.a.

n.a.

n.a.

62

n.a.

286 untreated, 58 n.a.

86

118

DMFS

[14, 15]

Mainz

GSE11121

200

35

56

100

84

23

Untreated

92

41

DMFS

[17]

TransBIG

GSE7390

198

69

37

100

68

42

Untreated

117

91

RFS

[18]

Oxford-Untreated

GSE2990 (n = 61), GSE6532 (n = 8)

69

44

64

100

71

41

Untreated

121

29

RFS

[19]

London

GSE6532

87

6

35

33

98

23

Endocrine

137

28

RFS

[20]

London-2

GSE9195

77

5

44

53

96

41

Endocrine

98

13

RFS

[21]

Oxford-Tamoxifen

GSE6532

109

14

34

64

98

19

Endocrine

61

30

RFS

[20]

Veridex-Tam

GSE12093

136

n.a.

n.a.

100

100

n.a.

Endocrine

85

20

DMFS

[22]

Frankfurt-3

This study

52

6

9

61

98

10

Endocrine

56

19

RFS

[12]

Stockholm

GSE1456

159

n.a.

n.a.

n.a.

79

42

Yes/no

85

40

RFS

[23]

Uppsala

GSE3494 (n = 251), GSE6232 (n = 5), GSE4922 (n = 1), GSE2990 (n = 1)

258

22

51

65

80

22

Yes/no

118

91

RFS

[24, 25]

San Francisco

E-TABM-158

118

46

33

43

69

54

Yes/no

68

36

DMFS

[26]

New York

GSE2603

99

37

9

34

58

n.a.

n.a.

65

27

DMFS

[27]

Frankfurt

This study

119

55

50

56

66

47

Chemotherapy

39

29

RFS

[10]

Frankfurt-2

This study

67

51

0

49

58

30

Chemotherapy

n.a.

n.a.

[13]

MDA133

www.mdanderson.org

133

41

9

30

63

58

Chemotherapy

n.a.

n.a.

[28]

EORTC

GSE1561

49

n.a.

n.a.

n.a.

57

n.a.

Chemotherapy

n.a.

n.a.

[29]

Edinburgh

GSE5462

116

n.a.

n.a.

n.a.

100

n.a.

Endocrine

n.a.

n.a.

[30]

expO

GSE2109

301

31

32

47

67

49

n.a.

n.a.

n.a.

[31]

Signapore

GSE5364

183

n.a.

n.a.

n.a.

55

n.a.

n.a.

n.a.

n.a.

[32]

Genentech

GSE12763

30

n.a.

n.a.

n.a.

70

n.a.

n.a.

n.a.

n.a.

[33]

Boston

GSE3744

40

n.a.

n.a.

n.a.

30

100

n.a.

n.a.

n.a.

[34]

Berlin

GSE6596

24

21

63

n.a.

67

46

n.a.

n.a.

n.a.

[35]

Paris

GSE13787

23

n.a.

n.a.

n.a.

0

100

n.a.

n.a.

n.a.

[36]

Tampa

GSE10780

39

n.a.

n.a.

n.a.

70

n.a.

n.a.

n.a.

n.a.

[37]

Total

 

3,030

35

36

70

74

38

 

80

629

  

Notes: The TransBIG cohort contains independent replicate samples from 19 patients of Uppsala cohort and 22 patients of Oxford-Untreated cohorts. Affymetrix HG-U133A microarrays were applied in all studies except for datasets expO, London, London-2, Genentech, Boston, Paris, and Tampa where the identical ProbeSets from HG-U133Plus arrays were used

Affymetrix expression data were analyzed by using the MAS5.0 [38] algorithm of the affy package [39] of the Bioconductor software project [40] (http://www.bioconductor.org/). Subsequently data were log2 transformed and median centered across arrays. Further scaling was performed in two different ways: In the first method the expression values of all the genes on the array were multiplied by a scale factor S so that the magnitude (sum of the squares of the values) equals 1 (we refer to these data as “magnitude-normalized”). This method is similar to scaling by Z score transformation but the latter uses mean-centering instead of the more robust median-centering. In addition the applied magnitude-normalization is sensitive to the total feature size but this does not have an effect as long as the same number of ProbeSets is used for all samples. In a second approach, mean-centering and magnitude-nomalization were first applied across arrays and subsequently also across genes in each individual dataset. To these data we refer here as “gene-normalized”.

ER status as determined by immunohistochemistry (IHC) or biochemical assay was available for 2,198 samples from 18 of the 25 datasets (see Supplementary Table S1). We further refer to this parameter as “biochemical ER status” in this manuscript. Data on PgR status were available for 1,474 patients from 13 of the 25 datasets and HER2 status was available for only 618 patients from 8 of the 25 datasets (Supplementary Table S1). Supplementary Table S1 also gives further information on the specific methods and cutoffs used in the different studies for the definition of the ER, PgR and HER2 status. Nine different ProbeSets of the ER (ESR1) gene are present on the Affymetrix U133A array. ProbeSet 205225_at was selected for most analyses because of its highest concordance with the biochemical ER status (see Results). The progesterone receptor (PgR) is represented by only one ProbeSet (208305_at). From the two ProbeSets for HER2 which are present on the U133A array ProbeSet 216836_s_at was used (see Results). Regarding Ki67, four different ProbeSets exist on the U133A array (212020_s_at to 212023_s_at). However, there is no established cutoff for Ki67 IHC [41, 42] and a gold standard is missing. In addition all four ProbeSets display similar strong correlations to each other. Thus, we used the mean of the magnitude-normalized data of all four ProbeSets in subsequent analyses. Cutoffs for ER, PgR and HER2 expression from microarray were derived from fitting two normal distributions to the observed distribution of Affymetrix expression values by maximum likelihood optimization using the optim function in R as described by Venables and Ripley [43].

Follow up data were available for 2,058 of the samples (11 datasets without follow up, see Table 1). Survival intervals were measured from the time of surgery. For nine datasets relapse free survival (RFS) was used as an endpoint (n = 1,180) while for five dataset only distant metastasis free survival (DMFS) was available (n = 879). Thus any local recurrence events are missing from these five datasets. However, the effect of using these different endpoints was rather small in the overall dataset. Supplementary Figure S1 shows that no significant difference in relative survival was found when comparing the 879 samples where only the DMFS endpoint was available to the 1,180 samples using the RFS endpoint. Thus we used in the context of this study either the RFS endpoint as disease free survival (DFS) or the DMFS endpoint if RFS was not available. Data for women in whom the envisaged end point was not reached were censored as of the last follow-up date or at 120 months. We constructed Kaplan–Meier curves and used the log rank test to determine the univariate significance of the variables. A Cox proportional-hazards regression model was used to examine simultaneously the effects of multiple covariates on survival. The effect of each variable was assessed with the use of the Wald test and described by the hazard ratio, with a 95% confidence interval. Subjects with missing values were excluded from the analyses and all reported P values are two sided. P values of less than 0.05 were considered to indicate a significant result. All analyses were performed using the R software environment (http://www.r-project.org/) and SPSS version 17.0 (SPSS Inc., Chicago, IL).

Results

Concordance of different Affymetrix ProbeSets with biochemical data of ER, PgR, and HER2 status

In a first approach, arrays from different datasets were adapted using magnitude normalization (see methods) and the concordance with biochemical data for ER, PgR, and HER2 status was assessed. For 2,198 of 3,030 total samples (72.5%), data on the estrogen receptor (ER) status from immunohistochemistry (IHC) or biochemical assay were available. Of these, 1,635 (74.4%) were characterized as ER positive and 563 (25.6%) as ER negative. We used receiver operating characteristics (ROC) analysis to demonstrate the correlation of magnitude normalized data of different ProbeSets from the Affymetrix HG-U133A microarray with the biochemical ER status (Supplementary Figure S2). The area under curve (AUC) of the ROC analysis provides a quantitative value of the concordance with the biochemical data. The Affymetrix ProbeSet 205225_at displayed the highest concordance with an AUC of 0.949 (95% CI 0.938-0.960). This confirms the results of Gong et al. [8], who obtained the strongest correlation of this ProbeSet with ER status by IHC in their training set of 193 samples.

For 1,474 of 3,030 samples (48.6%), data on the progesterone receptor (PgR) status from immunohistochemistry (IHC) or biochemical assays were available. Among these, about 858 (58.2%) of them were characterized as PgR positive and 616 (41.8%) as PR negative. ROC analysis of the single ProbeSet (208305_at) for PgR on the HG-U133A array resulted in an AUC of 0.786 (95% CI 0.763–0.809; Supplementary Figure S3A).

The HER2 status of the tumor was available for 618 of the 3,030 samples (20.4%). 139 (22.5%) of them were characterized as HER2 positive (3+ IHC or FISH positive) and 479 (77.5%) as HER2 negative. Affymetrix ProbeSet 216836_s_at revealed a slightly better result in ROC analysis with an AUC of 0.856 (95% CI 0.814–0.897; Supplementary Figure S3B) than ProbeSet 210930_s_at (AUC 0.799; 95% CI 0.752–0.846). The superiority of this ProbeSet for HER2 status was also demonstrated previously by Gong and coworkers [8].

Derivation of a cutoff value for the ER status from the distribution of ER microarray data

We selected the ER Affymetrix ProbeSet 205225_at which worked best in ROC analysis for further study. Figure 1a presents the distribution of the expression values for this ProbeSet separately in ER positive and ER negative samples as defined by IHC/biochemical assay. In Fig. 1b we analyzed the combined distribution of the expression values among all 3,030 samples from the combined datasets. A mixture of two normal distributions was fitted to these data as demonstrated by the blue and red lines in Fig. 1b. Subsequently, the interception of the two fitted distributions was selected as a cutoff value (0.0075) for the definition of ER positive samples based on microarray. This cutoff resulted in a specificity of 86.7% and a sensitivity of 93.3% when compared with the biochemical ER status available for 2,198 of the samples (Table 2). The positive predictive value (PPV) was 95.3%, the negative predictive value (NPV) 81.7% and the overall accuracy 91.6%. Among the individual datasets the specificities ranged from 66.7 to 100%, sensitivities from 80.0 to 100%, the PPVs from 76.7 to 100%, and the NPVs from 55.6 to 100% (see Table 2).
https://static-content.springer.com/image/art%3A10.1007%2Fs10549-009-0416-z/MediaObjects/10549_2009_416_Fig1_HTML.gif
Fig. 1

Distribution of ER expression values in the combined dataset. a Distribution of ER expression values (ProbeSet 205225_at) stratified by biochemical ER status in those 2198 sample with data from immunohistochemistry/biochemical assay. b Distribution of ER expression values (ProbeSet 205225_at) among all 3,030 samples

Table 2

Concordance of ER status based on microarray and biochemical data

Dataset

ER detection methoda

General cutoff (0.0075)

Dataset specific cutoff (%)

Sensitiv. (%)

Specific. (%)

PPV (%)

NPV (%)

Accuracy (%)

Sensitiv. (%)

Specific. (%)

PPV (%)

NPV (%)

Accuracy (%)

Cutoff

Rotterdam

LBA, EIA, IHC

89.0

83.7

89.4

83.1

86.9

90.0

83.0

89.1

84.2

87.2

0.0072

TransBIG

IHC

90.3

82.8

91.7

80.3

87.9

92.5

78.1

89.9

83.3

87.9

0.0051

Oxford-untreated

Not given

100.0

66.7

76.7

100.0

84.1

100.0

40.0

64.7

100.0

71.4

0.0034

London

Not given

97.7

n.a.

100.0

n.a.

97.7

97.7

n.a.

100.0

n.a.

97.7

0.0082

London-2

Not given

94.8

n.a.

100.0

n.a.

94.8

89.6

n.a.

100.0

n.a.

89.6

0.0101

Oxford-Tamoxifen

Not given

97.2

n.a.

100.0

n.a.

97.2

100.0

n.a.

100.0

n.a.

100.0

0.0018

Veridex-Tam

IHC, LBA

99.3

n.a.

100.0

n.a.

99.3

99.3

n.a.

100.0

n.a.

99.3

0.0031

Frankfurt-3

IHC

91.8

n.a.

97.8

n.a.

90.0

98.0

n.a.

98.0

n.a.

96.0

0.0040

Uppsala

EIA

88.8

88.2

97.9

55.6

88.8

84.7

88.2

97.8

47.6

85.1

0.0092

San Francisco

Not given

94.7

81.4

89.9

89.7

89.8

96.0

79.1

88.9

91.9

89.8

0.0064

New York

Not given

94.7

95.2

96.4

93.0

94.9

94.7

92.9

94.7

92.9

93.9

0.0066

Frankfurt

IHC

94.4

90.5

94.4

90.5

93.0

97.2

88.1

93.3

94.9

93.9

0.0056

Frankfurt-2

IHC

89.7

88.5

92.1

85.2

89.2

89.7

88.5

92.1

85.2

89.2

0.0064

MDA133

IHC

93.9

92.2

95.1

90.4

93.2

96.3

92.2

95.2

94.0

94.7

0.0072

EORTC

IHC

96.4

100.0

100.0

95.2

97.9

96.4

95.0

96.4

95.0

95.8

0.0067

Edinburgh

IHC

99.1

n.a.

100.0

n.a.

99.1

100.0

n.a.

100.0

n.a.

100.0

0.0061

expO

Not given

90.2

88.2

93.9

81.8

89.5

88.2

90.2

94.7

79.3

88.9

0.0080

Boston

Not given

80.0

100.0

100.0

88.9

92.3

73.3

100.0

100.0

85.7

89.7

0.0093

All combined

 

93.3

86.7

95.3

81.7

91.6

93.4

84.0

94.4

81.4

91.0

0.0075

aLBA ligand binding assay (≥10 fmol/mg), EIA enzyme immunoassay (>0.05 fmol/μg DNA), IHC immunohistochemistry (≥10% positive tumor cells)

n.a. not applicable (if all samples were ER positive by the biochemical method)

We also performed the fitting on the distribution of ER expression values separately in each individual dataset. This procedure yielded only slightly different cutoff values (range 0.0018–0.0101, Table 2 and Supplementary Figure S4). When these dataset specific cutoffs were used, a somewhat lower overall specificity of 84.0% and identical sensitivity of 93.4% was obtained (Table 2). Thus the differences between the individual datasets are small and the simultaneous use of all samples seems to improve the fitting to the distribution.

Derivation of cutoff values for PgR and HER2 microarray data

The same method of fitting two normal distributions to the expression data of the combined sample cohort was applied to identify cutoff values for the expression of the progesterone receptor gene (PgR, ProbeSet 208305_at) and HER2 (ProbeSet 216836_s_at). The corresponding graphs are given in Fig. 2a, b, respectively. The resulting cutoff (−0.0078) from Fig. 2a for PgR expression corresponded to an overall accuracy of 71.8%, a specificity of 67.4% and a sensitivity of 74.9%. The positive predictive value (PPV) was 76.2% and the negative predictive value (NPV) 65.9%. Again, as shown in Table 3 fitting separately each dataset Supplementary Figure S5) resulted in similar cutoffs (range −0.0099 to −0.0047) and an identical overall accuracy. The HER2 cutoff (0.0135) from Fig. 2b led to an accuracy of 89.2%, a specificity of 97.9% but a rather low sensitivity of 59.0% when compared to HER2 status based on “3+” staining in immunohistochemistry or FISH ratio >2.0 (Table 4). The PPV was 89.1% and the NPV 89.2%. Similar cutoffs (range 0.0119–0.0146) were obtained when datasets were fitted separately (Supplementary Figure S6). In contrast, using either cutoff values the sensitivity for HER2 detection differed markedly between datasets (range 32–100%, Table 4).
https://static-content.springer.com/image/art%3A10.1007%2Fs10549-009-0416-z/MediaObjects/10549_2009_416_Fig2_HTML.gif
Fig. 2

Distribution of PgR and HER2 Affymetrix expression values in the combined dataset. a Distribution of PgR expression values (ProbeSet 208305_at) among all 3,030 samples. b Distribution of HER2 expression values (ProbeSet 216836_s_at) among all 3,030 samples

Table 3

Concordance of PgR status based on microarray and biochemical data

Dataset

PgR detection method

General cutoff (−0.0078)

Dataset specific cutoff

Sensitiv. (%)

Specific. (%)

PPV (%)

NPV (%)

Accuracy (%)

Sensitiv. (%)

Specific. (%)

PPV (%)

NPV (%)

Accuracy (%)

Cutoff

Rotterdam

LBA, EIA

68.3

80.4

86.2

58.7

72.7

68.3

80.4

86.2

58.7

72.7

−0.0075

London

Not given

82.8

42.9

81.5

45.0

72.9

82.8

42.9

81.5

45.0

72.9

−0.0088

London-2

Not given

94.9

66.7

90.3

80.0

88.3

89.8

77.8

93.0

70.0

87.0

−0.0064

Frankfurt-3

IHC

80.8

50.0

67.7

66.7

67.4

80.8

45.0

65.6

64.3

65.2

−0.0095

Uppsala

EIA

68.4

73.8

89.0

42.9

69.7

68.9

73.8

89.1

43.3

70.1

−0.0084

San Francisco

Not given

75.8

72.5

78.1

69.8

74.4

78.8

66.7

75.4

70.8

73.5

−0.0099

New York

Not given

67.4

90.9

85.3

78.1

80.6

67.4

90.9

85.3

78.1

80.6

−0.0069

Frankfurt

IHC

77.4

64.4

66.1

76.0

70.5

77.4

69.5

69.5

77.4

73.2

−0.0064

Frankfurt-2

IHC

72.4

77.8

72.4

77.8

75.4

72.4

77.8

72.4

77.8

75.4

−0.0081

MDA133

IHC

72.7

68.0

62.5

77.3

70.0

72.7

66.7

61.5

76.9

69.2

−0.0079

EORTC

IHC

66.7

93.1

85.7

81.8

83.0

61.1

93.1

84.6

79.4

80.9

−0.0054

expO

Not given

91.0

19.2

54.6

66.7

56.3

91.0

19.2

54.6

66.7

56.3

−0.0076

Boston

Not given

53.8

76.9

53.8

76.9

69.2

30.8

92.3

66.7

72.7

71.8

−0.0047

Combined

 

74.9

67.4

76.2

65.9

71.8

74.5

68.0

76.4

65.7

71.8

−0.0078

LBA ligand binding assay (≥10 fmol/mg), EIA enzyme immunoassay (>0.05 fmol/μg DNA), IHC immunohistochemistry (≥10% positive tumor cells)

Table 4

Concordance of HER2 status based on microarray and biochemical data

Dataset

IHC/FISH HER2 available

HER2 positivea (%)

General cutoff (0.0135)

Dataset specific cutoff

Sensitiv. (%)

Specific. (%)

PPV (%)

NPV (%)

Accuracy (%)

Sensitiv. (%)

Specific. (%)

PPV (%)

NPV (%)

Accuracy (%)

Cutoff

Frankfurt-3

19

2 (11%)

0.0

94.1

0.0

88.9

84.2

0.0

94.1

0.0

88.9

84.2

0.0163

San Francisco

79

8 (10%)

100.0

98.6

88.9

100.0

98.7

100.0

98.6

88.9

100.0

98.7

0.0119

New York

88

9 (10%)

77.8

94.9

63.6

97.4

93.2

77.8

94.9

63.6

97.4

93.2

0.0151

Frankfurt

65

22 (34%)

59.1

95.3

86.7

82.0

83.1

50.0

97.7

91.7

79.2

81.5

0.0146

Frankfurt-2

57

20 (35%)

35.0

97.3

87.5

73.5

75.4

35.0

97.3

87.5

73.5

75.4

0.0145

MDA133

132

33 (25%)

81.8

99.0

96.4

94.2

94.7

75.8

99.0

96.2

92.5

93.2

0.0144

expO

141

37 (26%)

35.1

100.0

100.0

81.3

83.0

32.4

100.0

100.0

80.6

82.3

0.0141

Boston

37

8 (22%)

87.5

100.0

100.0

96.7

97.3

87.5

100.0

100.0

96.7

97.3

0.0155

Combined

618

139 (22%)

59.0

97.9

89.1

89.2

89.2

55.4

98.1

89.5

88.3

88.5

0.0135

aIHC 3+ OR FISH > 2.0 if method given, see Supplementary Table S1

Influence of gene normalization on different cohorts

Some analyses of microarray datasets have used “gene normalization” to bring the data to a uniform scale. By this method the expression values of each gene are adjusted across all samples of the respective cohort. We analyzed the effect of this transformation on the distribution of ER expression values in the individual datasets. After “gene normalization” has been performed it was still possible to derive cutoff values from the mixed distribution as described above. However, the specific cutoffs are different for each individual dataset after “gene normalization” since they depend on the proportions of ER positive and ER negative samples in the specific dataset. We analyzed the impact of this effect by deliberately subdividing the dataset Frankfurt in two subgroups containing either only the ER positive or the ER negative samples. Supplementary Figure S7 demonstrates the influence of gene normalization on the full cohort (Supplementary Figure S7A) and the two subcohorts (Supplementary Figure S7B, C). “Gene normalization” leads to a broadening of the distribution of expression values in the ER positive and the ER negative subsets as compared to the full cohort (Supplementary Figure S7). Importantly, the derivation of a cutoff from such gene normalized data by fitting two distributions was only possible when at least some ER negative samples were enclosed in the ER positive cohort and vice versa.

Plausibility of the derived cutoffs through analysis of patients’ prognosis

ER positive and ER negative breast cancers are generally assumed as separate types of disease with a different clinical course [4446]. We reasoned that a correct classification according to ER status should result in patient subgroups with a distinct prognosis while further subdivision according to the level of ER expression should have no significant effect on survival. Thus we stratified both the microarray-derived ER positive and ER negative subgroups in four quartiles each according to ER expression (ProbeSet 205225_at). Follow up data were available for 2,058 of the samples. Figure 3 presents the results from Kaplan Meier analyses of disease free survival of the patients in the eight resulting subgroups. Patients in the ER negative subgroup as defined by the distribution-derived cutoff had a high risk for a relapse especially in the first three to five years. Survival in the ER positive subgroup is significantly better but steadily declining even after five to ten years (P = 0.001). These differences between ER negative and ER positive cancer types has been repeatedly described previously [44, 45]. In contrast we observed no significant differences in survival among each subtype when the patients were further substratified according to ER expression.
https://static-content.springer.com/image/art%3A10.1007%2Fs10549-009-0416-z/MediaObjects/10549_2009_416_Fig3_HTML.gif
Fig. 3

Different prognosis of patients supports plausibility of the derived cutoff for ER expression. Samples were defined as ER positive and ER negative based on microarray. The two subgroups of patients were further stratified into quartiles based on the expression values of ER. Kaplan Meier analysis of disease free survival of 2,058 samples with follow up data is presented for the resulting eight subgroups. While ER positive and negative subtypes clearly differ in survival (P = 0.001) no correlation of the prognosis with the relative expression of ER within the subgroups of ER positive and ER negative tumors is detectable

Next we analyzed the impact of the combined stratification according to both ER and PgR determined by either biochemical methods (Fig. 4a) or microarray (Fig. 4b) on the prognosis of the 1,085 patients for which the biochemical data as well as follow up information were available. The obtained results were similar and detailed results for single markers are also presented in Supplementary Figure S8. Microarray data resulted in a higher portion (4%) of ER negative PgR positive tumors than biochemical methods (1%) which might represent false positive PgR results (see Supplementary Figure S10 and the section Discussion). Figure 4c–e present the results of Kaplan–Meier analyses in which all 2,058 patients with available follow up data were included using the distribution derived cutoffs described above. Results for ER and PgR (Fig. 4c) were comparable to those of the smaller subset in Fig. 4b. HER2 positive patients had a worse prognosis in the complete cohort (Fig. 4d). As shown in Fig. 4e, the largest impact of HER2 expression was observed in the ER positive subgroup. To analyze the relative impact of the three variables (ER, PgR, and HER2) on the prognosis of patients simultaneously we performed univariate and multivariate Cox regression analysis as presented in Supplementary Table 2. While all three markers were highly significant in univariate analysis, only PgR remained significant with a hazard ratio of 1.48 (95% CI 1.23–1.78, P < 0.001, Supplementary Table 2). For a subset of 1,589 of the patients in the analyzed cohorts information on endocrine treatment was available. As shown in Fig. 5, the worse prognosis of PgR negative tumors was observed among both 661 endocrine treated and 928 untreated samples (results were similar for the subset of 722 patients with biochemical status data; Supplementary Figure S9).
https://static-content.springer.com/image/art%3A10.1007%2Fs10549-009-0416-z/MediaObjects/10549_2009_416_Fig4_HTML.gif
Fig. 4

Disease free survival of patients according to stratifications using the derived cutoff values. a, b Samples with available biochemical status of ER and PgR (n = 1,085) were stratified according to either the biochemical status (a) or the microarray derived status (b). Disease free survival of the respective subgroups according to Kaplan–Meier analysis is presented. Detailed individual comparisons are given in Supplementary Figure S8. ce All samples with available follow up information (n = 2,058) were stratified according to microarray derived status for ER and PgR (c), HER2 status (d), as well as HER2 and ER status (e)

https://static-content.springer.com/image/art%3A10.1007%2Fs10549-009-0416-z/MediaObjects/10549_2009_416_Fig5_HTML.gif
Fig. 5

Disease free survival of untreated and endocrine treated patients stratified according to ER and PgR status based on microarray. The 1,589 patients with available follow up information were selected which were either untreated (a) or treated only with adjuvant endocrine therapy (b). Patients were stratified according to ER and PgR status based on microarray and Kaplan–Meier analysis performed to determine the disease free survival in the respective subgroups

Analysis of continuous markers

The markers analyzed so far demonstrated bimodal distributions. Both for ER and PgR as well as HER2, we observed two clearly different subgroups of samples in the cohorts. These results are in line with the widely accepted concept that these subgroups characterize biologically distinct subtypes of breast cancer [47, 48]. In contrast Ki67 as well as other proliferation markers represent a different type of parameter. The observed distribution of Ki67 expression among the samples is not bimodal but rather continuous as shown in Fig. 6a. Thus the approach used above for bimodal distributions does not seem to be appropriate for Ki67. The continuous distribution of Ki67 expression might suggest that in contrast to the ER status those tumors with high and low Ki67 expression values, respectively, does not represent distinct types of disease. The level of Ki67 expression could rather be a surrogate marker for the proportion of Ki67 expressing cells in the tumor sample and display a quantitative correlation with prognosis. Consequently a multiple substratification according to Ki67 expression should result in multiple groups with a different clinical prognosis contrary to the results obtained for the ER above in Fig. 3. As shown in Fig. 6b, ER negative breast cancers are generally characterized by high expression of Ki67 while the influence of the HER2 status on the distribution of Ki67 expression values was not so profound as demonstrated in Fig. 6c. Thus to avoid a confounding effect ER positive tumors need to be analyzed separately from ER negative tumors. We therefore performed a quartile split according to Ki67 expression in the ER positive cohort. As shown in Fig. 7, Kaplan Meier analysis of the four groups of ER positive tumors suggests that the higher the level of Ki67 the worse the prognosis of the patients.
https://static-content.springer.com/image/art%3A10.1007%2Fs10549-009-0416-z/MediaObjects/10549_2009_416_Fig6_HTML.gif
Fig. 6

Distribution of Ki67 Affymetrix expression values in the combined dataset. a Distribution of Ki67 expression values (ProbeSets 212020_s_at–212023_s_at) among all 3,030 samples. b Distribution of Ki67 expression values stratified according to ER status among all 3,030 samples. c Distribution of Ki67 expression values stratified according to HER2 among all 3,030 samples

https://static-content.springer.com/image/art%3A10.1007%2Fs10549-009-0416-z/MediaObjects/10549_2009_416_Fig7_HTML.gif
Fig. 7

Disease free survival of patients with ER positive tumors according to the level of Ki67 expression. Patients with ER positive tumors were stratified in quartiles according to the level of Ki67 expression by microarray. Kaplan–Meier analyses of disease free survival of the 1,549 patients in the respective four groups are given

Discussion

In this study we used a data driven model for the definition of cutoffs without a priori information on the true result (e.g., the biochemical ER status). We derived the cutoff by fitting a mixture of two normal distributions to the data as the simplest approach. Obviously this simple model does not need to be correct or even a good model at all. However, we used this straight approach to avoid overfitting of the data. In contrast to previous studies [8] no external gold standard is necessary to derive a cutoff by this method. Nevertheless, the derived cutoffs for ER, PgR and HER2 showed high concordance with results from biochemical methods despite, e.g., widespread concerns of the inaccuracy of immunohistochemistry [49, 50]. The comparability of array normalized data from 25 different datasets was good leading to similar cutoffs. When individual datasets were used as “training” set similar accuracies were obtained in the resulting validation sets (Supplementary Figure 11). Our observed accuracy of ER (91.6%, Table 2) was higher than that between local and central IHC in a recent study from the ECOG 2,197 trial (90%) [51]. In the same study the PgR sensitivity and specificity between local and central IHC were 80.7 and 88.6%, respectively. Concordance of PgR IHC between core biopsy and surgical samples has been reported to be 83% [52]. Thus the observed concordance of 71% for PgR status in our study can not be considered satisfying (75% sensitivity and 67% specificity, Table 3). However, similar results for gene expression have been obtained by others [53] and might be related to substantially lower mRNA expression levels for PgR than ER (>20 fold lower mean MAS5 values, see Supplementary Figure S10). This problem might be overcome by using signatures combining several genes. Creighton et al. [16] have developed two gene signatures consisting of 182 and 1,005 genes for ER +/PR + and ER-/PR-tumors, respectively, using the Rotterdam and Uppsala datasets. When they applied these signatures to classify the tumors from the same datasets they reached concordances of 89 and 84%, respectively.

The derived cutoff for HER2 expression led to a specificity of 97.9% but a rather low sensitivity of 59.0% among the 618 samples with biochemical HER2 status. However, the sensitivity differed markedly between datasets (range 32–100%, Table 4) and the possibility of some false positive biochemical data might be considered. Gong et al. [8] used a training approach to optimize the HER2 cutoff among 195 samples leading to 91% sensitivity and 95% specificity. This cutoff resulted in sensitivity and specificity of 79 and 94% as well as 100 and 88% in two different validation datasets. Microarray data of 133 of the 195 samples from this training set were available for our analysis (dataset MDA133). Importantly, among these 133 samples our distribution derived cutoff resulted in an 82% sensitivity, 99% specificity, and 95% overall accuracy. In other words our method resulted in a slightly higher cutoff value than the training approach of Gong et al. [8]. However, with respect to an analytical approach rather than a clinical test, this high cutoff value might be more preferable. It results in a specificity of 97.9% and a positive predictive value (PPV) of 89.1% among all 618 samples with biochemical HER2 status. Thus only 10.1% false positive HER2 tumors would be included in subsequent analyses when using this cutoff value. On the other hand because of the NPV of 89.2% those samples erroneously categorized as HER2 negative by this cutoff represent only 11.8% of the total number of samples in the larger group of HER2 negative tumors.

Some microarray studies performed normalization across genes [54]. However, the expression of many genes is highly correlated. For example it has been shown repeatedly that a large set of genes is strongly associated with the ER status in breast cancer [55, 56]. As we have demonstrated, “gene normalization” in cohorts with varying proportions of tumors differing in the ER status leads to a distortion of the distribution of expression values of such genes. If subsequent analysis steps involve a relative split of the cohort to stratify samples, this can lead to strange results [57]. Thus gene normalization is a very critical point when combining or comparing datasets.

Other studies have also characterized bimodal markers from microarray datasets [5860]. Some suggested that bimodal “switch-like” genes differ from non-bimodal genes in transcriptional regulation [61]. Assuming a precise quantitation by microarrays, the different mRNA levels can either result from the level in individual cells or from the proportion of cells in the sample (or both). Breast cancer is a heterogeneous disease containing subtypes with different clinical behavior. It has been suggested that such distinct cancer subtypes may be derived from distinct progenitor cells which are arrested in their maturation [6266]. When we analyzed the prognosis of subgroups we have observed an essential difference between bimodal parameters like ER, PgR, HER2 on the one hand and the continuous marker Ki67 on the other. The bimodal markers seem to stratify distinct subtypes of tumors as revealed by their distinct clinical follow up. In contrast Ki67 did not define two distinct subtypes. Instead we observed a continuous relationship the higher the expression the worse the prognosis. From immunohistochemical studies it is known that the proportion of Ki67 expressing cells is relatively low with a median of 16–17% [41, 42]. It is not clear whether this represents a snapshot of transiently cycling cells or if Ki67 expression defines a distinct type of carcinoma cells which differ in their differentiation state. Regardless of this question it seems reasonable that the level of Ki67 mRNA measured by microarray predominantly results from the proportion of cells expressing the gene rather than the level of expression in the individual cells.

In summary our data demonstrate that pooling of microarray datasets seems to be recommended to enlarge sample size and to refine cutoffs derived from the data. Critical pitfalls which have to be considered include the introduction of bias from gene normalization which has been often applied to adjust different platforms.

Acknowledgments

We thank Samira Adel and Katherina Kourtis for expert technical assistance and anonymous reviewers for their insightful suggestions. This work was supported by grants from the Deutsche Krebshilfe, the Margarete Bonifer-Stiftung, Bad Soden, the BANSS-Stiftung, Biedenkopf, and the Dr. Robert Pfleger-Stiftung, Bamberg. The efforts of the IGC and expO [31] are gratefully acknowledged.

Supplementary material

10549_2009_416_MOESM1_ESM.pdf (428 kb)
Supplementary material 1 (PDF 429 kb)

Copyright information

© Springer Science+Business Media, LLC. 2009