Background

HER2/neu (also known as c-erbB-2) is a member of the ErbB protein family, more commonly known as the epidermal growth factor receptor (EGFR) family. The HER2 protein is a cell membrane surface-bound receptor tyrosine kinase that is involved in signal transduction pathways leading to cell growth and differentiation [1]. HER2 is a proto-oncogene located on the long arm of human chromosome 17 (17q11.2-q12). Overexpression of the protein, typically caused by amplification of the HER2 gene, leads to constitutive activity of the HER2 receptor and breast tumor development through enhanced cell proliferation, survival, motility and adhesion [2]. HER2 gene amplification has been reported in 10–35% of invasive breast carcinomas, and it is associated with an aggressive disease course, increased disease recurrence, and decreased disease-free and overall survival in lymph node-positive patients [25]. In addition to its prognostic role, HER2 has now become more important as a predictive marker of treatment response to Trastuzumab, a humanized murine monoclonal antibody to the HER2 protein. In 1998, Trastuzumab (marketed as Herceptin, Genentech, San Francisco, California, USA) was approved for the targeted therapy of HER2-overexpressing metastatic breast cancer patients by the Food and Drug Administration (FDA) of the USA, and it has also recently been shown to be very effective in the adjuvant setting [2].

The effectiveness of Herceptin therapy depends on accurately evaluating HER2 status, which can be done either by immunohistochemical (IHC) assessment of HER2 protein expression or by evaluating HER2 gene amplification using in situ hybridization (ISH), most commonly, fluorescent ISH (FISH). FISH shows excellent sensitivity and specificity in detecting the HER2 gene amplification [6]. IHC assessment of HER2 status is an inexpensive and relatively standardized method that can be performed in all pathology laboratories. Of the various HER2 antibodies available, the FDA-approved Dako Herceptest (Dako, Glostrup, Denmark) has been considered the most reliable [7]. However, new antibodies such as Ventana PATHWAY anti-HER2/neu (4B5) rabbit monoclonal antibody also provide excellent sensitivity, specificity, and inter-laboratory reproducibility [8]. Based on the determination of staining intensity and percentage of cells with complete membrane staining, the results are scored semi-quantitatively on a range of 0 to 3+. According to these four-tier criteria, 0 and 1+ scores are considered negative, 3+ score is positive, while 2+ is equivocal (weakly positive) and requires confirmation by FISH [911]. The intraobserver reproducibility is generally satisfactory for both the percentage of positive cells and membrane staining [1215]. The inter-observer agreement is excellent for scoring classes 0, 1+ and 3+ [11, 1619]. However, the determination of staining intensity and percentage of cells with complete membrane staining is subjective. This results in high inter-observer variability in assigning a 2+ score [11, 17, 20, 21] and in discriminating between 2+ and 3+ classes [12]. Consequently this leads to a high rate of false-positives for intermediate IHC scores [2224]. According to the HercepTest guidelines, cases with more than 10% of tumor cells showing strong circumferential membrane staining are classified as 3+. The American Society of Clinical Oncology/College of American Pathologists (ASCO/CAP) guidelines recommend using a 30% cut-off, in order to decrease the incidence of false positive cases [25].

It has been suggested that the use of digital microscopy improves the accuracy and inter-observer reproducibility of HER2 IHC analysis. Digital measurement of staining intensity is more accurate than assessment with a human eye because it is not influenced by factors such as the ambient light or pathologist fatigue [26, 27]. We have recently shown that automated quantitation of estrogen receptor (ER) immunostaining yields results that do not differ from human scoring against dextran-coated charcoal biochemical assay and the most important clinico-pathologic correlate, patient outcome [28]. Consistent, objective and reproducible results for HER2 assessment can be generated by a number of available automated scoring systems such as the automated cellular imaging system (ACIS) (ChromaVision, Inc, San Juan Capistrano, California, USA) [29, 30] optimized for use with Dako HercepTest, Micrometastasis Detection System (MDS, Applied Imaging, San Jose, California, USA) [31], Extended Slide Wizard (Tripath Imaging, Inc. Burlington, North Carolina, USA) and others [3234].

To determine the inter-observer variability, we have compared results of visual and automated scoring of HER2 immunostaining on TMAs constructed from invasive breast carcinomas, with data from 1,413 cases used for FISH analysis. 616 cases were scorable by both methods. We then evaluated the concordance of IHC and FISH results and performed Kaplan-Meier survival analysis to determine the prognostic significance of different analyses of HER2 status.

Methods

In this study, we used IHC data from 1,212 patients and FISH data from 616 patients. The data were derived from a series of 4,046 cases of invasive breast carcinoma diagnosed in 1986–1992, referred to the British Columbia Cancer Agency (BCCA) for treatment, and assembled into 17 tissue microarray (TMA) blocks. Ethical approval for the study was obtained from the Clinical Research Ethics Board of the BCCA [28]. Previously frozen breast cancer tissue samples were fixed in 10% neutral buffered formalin, embedded in paraffin and used to construct TMAs consisting of 0.6 mm tissue cores using a manual arrayer (Beecher Instruments, Inc., Silver Springs, Maryland, USA) as previously described [35, 36].

From each TMA block, 4 μm thick sections were cut and immunostained on Ventana Benchmark XT staining system (Ventana Medical Systems, Tucson, Arizona, USA). Sections were deparaffinized in xylene, dehydrated through three alcohol changes and transferred to Ventana Wash solution. Endogenous peroxidase activity was blocked in 3% hydrogen peroxide. Slides were then incubated with Ventana PATHWAY anti-HER2/neu (4B5) rabbit monoclonal antibody at 37°C for 32 min and developed in DAB for 10 min. Finally, sections were counterstained with hematoxylin and mounted.

HER2 was scored visually by two independent pathologists (BG, GT) according to the HercepTest guidelines: 0 (negative): no staining is observed, or membrane staining is observed in <10% of the tumor cells; 1+ (negative): a faint/barely perceptible membrane staining is detected in >10% of tumor cells; the cells exhibit incomplete membrane staining; 2+ (weakly positive, equivocal): a weak to moderate complete membrane staining is observed in >10% of tumor cells; and 3+ (strongly positive): a strong complete membrane staining is observed in >10% of tumor cells. Only six 3+ cases (0.5%) showed heterogeneous staining, i.e. would have been interpreted as 2+ by ASCO/CAP guidelines. Therefore, the scoring system used in this study would not impact the results and conclusions. Scores were entered into a standardized Excel worksheet with a sector map matching each TMA section. Cases were not included in the statistical analysis if there was no tumor tissue in the cores or the cores were cut through. Original scoring grids were converted to tables using Deconvoluter 1.10 [37] and combined in a single text file with TMA-Combiner 1.00 [38]. The resulting text files were imported into SPSS 15.0 and R2.4.0 for Windows [39].

The same slides were digitized with a commercial image analysis system Ariol (Applied Imaging Inc., San-Jose, California, USA). For clinical lab applications, Ariol has received FDA clearance as an aid to pathologists in the detection, classification, and counting of cells of a particular color, intensity, size, pattern, and shape. Applied Imaging has received additional FDA 510(k) clearances for specific applications, including immunohistochemical assessment of HER2 in breast cancer. The Ariol system is based on an Olympus microscope with motorized stage and autofocus capabilities, and equipped with a black and white video camera. We regularly performed bright-field calibration using the Calibration slide to ensure accurate scanning and analysis. The system was set to Kohler illumination to capture high quality images. Slides were scanned at 20× objective magnification with three filters: red, green and blue. Ariol software, which converts these three-channel images into color reconstructions, was used for image analysis. The program was trained by a pathologist (DT) using representative cores containing areas that would be scored as 1+ and 3+ visually. Using the color pickup tool within the Ariol image analyzer, we selected membranes with weak positive staining and assigned "1+ intensity"; we then selected the membranes with strong positivity and assigned "3+ intensity". Similarly, we selected counterstained nuclei with the color pickup tool, and adjusted the desired size, roundness and other shape parameters under visual control. Numeric values for colors of the positive objects, i.e. membranes, and negative objects, i.e. nuclei, were stored on the hard drive in a color classifier file. Numeric values for the shape of the nuclei were stored in a separate shape classifier file. The program used these two files for segmentation of the nuclei and the membranes in all other cores, and these two files were sent out to be used in the machine 2. Scores from a "0" to a "3+" were automatically generated by the Ariol image analysis software for each core, based on the intensity and completeness of the positively stained membranes, and the percent of positive cells. The Ariol algorithm applies HercepTest criteria for the score calculations. Visual examples and a graphical explanation are given in Figure 1. The training step increases the specificity of the analysis as it ensures that extracellular matrix and most stromal cells are excluded from image analysis, and it allows the program to calculate percent of positive tumor cells more precisely. After the program training on one of the representative TMA cores, the rest of the analysis was performed without human supervision. All tissue cores were analyzed in toto; no specific pathologist selection of tumor tissue within the cores was made following the training step. For statistical analysis, we selected only cores with at least 50 tumor cells detected, i.e. all cores with less than 50 cells were considered unscorable. To get an estimate of the demands posed on the operator of the Ariol system, the same slides were scanned and processed on an identical Ariol system by an operator with less than one week experience working with this particular Ariol script (KM). The descriptors of the color and shape of the positive and negative tumor cells were transferred from one system to another, therefore variations in the image analysis results depended only on the scanner settings, i.e. brightfield calibration, positioning and white balance, but not on the image analysis settings.

Figure 1
figure 1

Schematic illustration of automated HER2 scoring. a) Image analysis system Ariol (Applied Imaging Inc., San-Jose, CA). b) Training window displaying the 3+ membrane and nuclear colors with fill mask. c) Outline of membrane as detected by the color classifier for the 3+ membrane color class. d) The border mask of nuclei as detected by the color classifier for the 3+ nuclei color class.

The hematoxylin and eosin and IHC images of all cores used in this study are publicly available at the companion site [40]. The site was constructed using Genetic Pathology Evaluation Centre (GPEC) database and a Java applet provided by Bacus Laboratories, Inc. All slides were scanned with a BLISS scanner (Bacus Laboratories, Inc., Lombard, Illinois, USA), and posted on the site. WebSlide Browser for Windows (Bacus Laboratories, Inc., Lombard, Illinois, USA) can be used for viewing preview images of the arrays and images of individual cores.

Six-micron sections of the TMA slides were hybridized with probes to LSI HER2 and CEP17 with the PathVysion™ HER2 DNA Probe Kit using a modified protocol, as previously described [41]. Analysis of FISH signals was performed using Metasystems™ automated image acquisition and analysis system, Metafer (Metasystems, Altlussheim, Germany). This automated system scores FISH signals by employing specific measurement algorithms to detect and quantify clustered signals. Average copy number for each probe was calculated and the amplification ratio (ratio between the average copy per cell for Her2 and the average copy for centromere 17) determined (MC). HER2 amplification was defined as a HER2/CEP17 ratio of 2.2 or more. A HER2/CEP17 ratio <1.8 was considered negative for HER2 amplification, and a ratio at or near the cut-off (1.8–2.2) was interpreted as equivocal. Tumors that failed to hybridize were not included in the analysis. We only accepted scores if >40 tiles were counted. With Metafer system, one tile is considered one cell as the size of a tile is approximately the average size of a nucleus. Normal cells were excluded wherever possible, and the corresponding H&E slides were reviewed when needed.

For statistical analysis, we used data from 1212 patients for the IHC and 616 patients for the IHC/FISH comparisons. Exclusion criteria included core drop-off during processing, insufficient or absent tumor tissue within the cores, and artifactual distortion of the tissue making discrimination of cellular structure impossible. Statistical analysis was performed in SPSS 15.0 for Windows (SPSS Inc., Chicago, Illinois) and R 2.4.0 [39]. All tests were two-sided and used a 5% alpha level to determine significance. 95% bootstrapped confidence intervals were calculated using the adjusted bootstrap percentile (bias-corrected and accelerated) method [42]. Breast cancer specific survival was estimated using Kaplan-Meier curves and survival differences were determined by log-rank tests. We used the open-source R 2.4.0 package to calculate differences between kappa statistics from visual to automated scoring comparisons; a permutation test with 10,000 permutations was implemented.

Results

IHC and FISH results

The number of cases scorable by all four observers (visual or machine) on IHC slides, regardless of FISH status was 1,212 (30%). Of 4,046 cases analyzed, FISH was successfully performed in 1413 cases (34.9%). Of 1,413 FISH scorable cases, HER2 was amplified (HER2/CEP17 ratio of 2.2 or more) in 252 cases (17.8%). Borderline HER2 amplification (HER2/CEP17 ratio 1.8–2.2) was seen in 77 cases (5.4%), and 1084 cases (76.7%) were found to be non-amplified (HER2/CEP17 ratio <1.8). The number of cases scorable by both IHC and FISH, including FISH equivocal cases, was 616. Table 1 shows the full breakdown of data by FISH and IHC scored by the four observers.

Table 1 Comparison of FISH and IHC results in 616 cases

Analysis of HER2 IHC inter-observer variability by Kappa statistics

Inter-observer variability was estimated by comparing the visual scores of two pathologists, and the automated scores generated by two operators on two different Ariol hardware systems. Comparison of categorized variables ({0, 1+} versus {2+} versus {3+}) from 1,212 patients using weighted kappa statistics (R function wkappa(ψ) using squared weights) showed excellent inter-observer agreement: for visual 1 versus visual 2 scores, kappa = 0.929 (95% CI: 0.909–0.946), visual 1 versus machine 1 scores, kappa = 0.835 (95% CI: 0.806–0.862), and visual 2 versus machine 1 scores, kappa = 0.837 (95% CI: 0.81–0.862); good agreement was seen between machine 2 and visual 1, kappa = 0.698 (95% CI: 0.672–0.723), machine 2 and visual 2, kappa = 0.709 (95% CI: 0.684–0.732), and machine 1 and machine 2, kappa = 0.806 (95% CI: 0.785–0.826) (Table 2).

Table 2 Weighted Kappa statistics on the whole cohort for comparison of inter-observer concordance for categorized HER2 IHC variables (n = 1212)

When comparing binarized IHC scores (0, 1+ {negative} versus 3+ {positive}) in a set of 849 patients (363 cases with 2+ scores were excluded), the kappa values were within 'excellent' agreement range: for two visual scores, kappa = 1.000 (95% CI: 1-1); for two machine scores, kappa = 1.000 (95% CI: 1-1); for visual 1 versus both machine scores, kappa = 0.898 (95% CI: 0.775–0.979); and for visual 2 versus both machine scores, kappa = 0.898 (95% CI: 0.775–0.979), (Table 3).

Table 3 Kappa statistics for comparison of inter-observer concordance for binarized HER2 IHC variables (n = 849)

We also performed Kappa permutation test to assess whether the HER2 IHC scores differed in their ability to match the gold standard. This test included categorized variables (n = 352) to assess the ability of the HER2 score to indicate negative (0, 1+) versus equivocal (2+) versus positive (3+) cases where visual 1 IHC score is the gold standard (Table 4). The permutation test could not be done for binarized IHC scores because there were only 229 cases available for analysis when visual 1 IHC was used as the gold standard, and 382 cases were available when FISH was used as the gold standard. There were no discrepant cases between visual 1 and visual 2, with only one discrepant case between both visual scores and both machines.

Table 4 Permutation test to determine the inter-observer variability for categorized IHC variables (n = 352)

Concordance of IHC and FISH results by Kappa statistics

The concordance of IHC and FISH results was analyzed using binarized and categorized variables by Kappa statistics. When comparing categorized IHC scores (0, 1+ (negative) versus 2 (equivocal) versus 3+ (positive)) with FISH results in a set of 616 patients, the agreement was excellent for visual 1 (kappa = 0.814, 95% CI: 0.768–0.856), good for visual 2 (kappa = 0.763, 95% CI: 0.712–0.81), and machine 1 (kappa = 0.665, 95% CI: 0.609–0.718), while machine 2 showed moderate agreement with FISH results (kappa = 0.535, 95% CI: 0.485–0.584) (Table 5).

Table 5 Concordance of IHC and FISH results by Kappa statistics

When comparing binarized IHC scores (0, 1+ {negative} versus 3+ {positive}) and FISH results in a set of 382 patients (234 cases with 2+ scores were excluded), FISH data only showed fair agreement with all four IHC scores: visual 1 (kappa = 0.328, CI: 0.0955 – 0.537), visual 2 (kappa = 0.328, CI: 0.0914 – 0.538), machine 1 (kappa = 0.343 (0.101 – 0.558), and machine 2 (kappa = 0.343 (0.0935 – 0.555) (Table 5). This was likely caused by the large number of 2+ scores excluded (n = 234) and low number of 3+ scores (n = 6) available for this analysis. Therefore, the proportion of HER2-positive and HER2-negative cases was not fairly represented for the concordance analysis of the binarized data.

The clinical consequences of using a machine for HER2 scoring are summarized in Table 6. Automated scoring on the Ariol machine would result in more 2+ scores (2–3 times as many as visual scoring) with a consequent increase of FISH assessments in clinical practice.

Table 6 Comparison of automated IHC scores with visual scores and FISH results

Kaplan-Meier survival analysis

For 1,212 patients whose tissue cores were scorable by all four observers on IHC slides, median age at diagnosis was 59 years, and median follow-up time was 12.24 years. Clinical-pathological characteristics of these patients are summarized in Table 7.

Table 7 Clinical-pathological characteristics of 1212 patients

Kaplan-Meier survival analysis of cases stratified based on the HER2 status, as determined by visual or machine scoring of the immunostained slides, is shown in Figure 2. Results of the log-rank tests with P values in a set of 1,210 patients (outcome information was not available in 2 cases), stratified as 0 (negative), 1+ (weak), 2+ (equivocal) and 3+ (positive) are as follows: visual scoring 1 χ2 = 60.281, P = 5.12 × 10-13; visual scoring 2 χ2 = 56.037, P = 4.13 × 10-12; machine scoring 1 χ2 = 57.453, P = 2.06 × 10-12; machine scoring 2 χ2 = 62.232, P = 1.96 × 10-13 (Figure 2). After binarization of the scores as either HER2-positive or HER2-negative in a set of 848 patients, the results of log-rank test were: visual scoring 1 χ2 = 26.245, P = 3.01 × 10-7; visual scoring 2 χ2 = 26.245, P = 3.01 × 10-7; machine scoring 1 χ2 = 56.757, P = 4.93 × 10-14; machine scoring 2 χ2 = 56.757, P = 4.93 × 10-14 (Figure 3).

Figure 2
figure 2

Kaplan-Meier survival analysis performed on the data categorized as negative (0, 1+), equivocal (2+) and positive (3+) (n = 1210). a) Visual scoring #1. b) Visual scoring #2. c) Automated system #1. d) Automated system #2.

Figure 3
figure 3

Kaplan-Meier survival analysis performed on the binarized data (negative {0, 1+} and positive {3+}) (n = 848). a) Visual scoring #1. b) Visual scoring #2. c) Automated system #1. d) Automated system #2.

The permutation analysis in a set of 615 patients (outcome information was not available for one patient) showed that the differences in prognostic significance of these different analyses of HER2 status are not statistically significant, i.e. visual and machine scoring show similar results for categorized variables (Table 8). The permutation analysis could not be performed for binarized variables because after excluding 2+ scores, only 382 cases were available for analysis and there were no discrepant cases between the visual scores and between the machine scores, only 1 discrepant score between visual 1 and machine 1, and 19 discrepent scores between visual 1 and FISH results.

Table 8 Permutation test to compare the differences between categorized IHC and FISH results using survival outcome as the gold standard (n = 615)

Discussion

In breast cancer patients, determination of prognosis and treatment strategies based on HER2 status greatly depends on the accurate evaluation of HER2 overexpression by IHC and/or FISH. HER2 immunohistochemistry is an inexpensive method that can be performed readily in all pathology laboratories on either standard paraffin sections or TMA sections [43]. However, consensus regarding the best methods, reagents, or cut-off points to determine HER2 status is still debated [25, 28, 4446]. TMAs are useful for the assessment of automated unsupervised image analysis systems because of the careful selection of the areas of interest, the identical staining conditions for all cores on a single slide, and the small size of the tissue cores representable by a single image [37, 38, 47]. Problems inherent in TMA studies include taking cores from the non-cancerous areas, and a loss of cores during the staining procedure. We analyzed the results of visual (two pathologists) and automated (two operators on the Ariol image analysis system) scoring of HER2 immunostaining. Since only cores with more than 50 tumor cells detected were considered scorable on the Ariol system, the number of cases scorable by all four observers was 1,212. FISH was successfully performed in 1,413 cases (34.9%) with an amplification rate of 17.8%, which is within the reported range of 10–35% [25].

When using the four-tier criteria for HER2 IHC (0 and 1+ negative, 3+ positive, and 2+ equivocal), the inter-observer agreement is usually excellent for negative (0, 1+) and positive (3+) cases [11, 1619]. To estimate the inter-observer variability in our study, we analyzed the results of two visual and two automated scores. When comparing binarized IHC scores, the inter-observer agreement was excellent between the two pathologists (kappa = 1.000, 95% CI: 1-1), between the two machines (kappa = 1.000, 95% CI: 1-1), between both visual and both machine scores (kappa = 0.898, 95% CI: 0.775–0.979). This suggests that the Ariol automated system can be used successfully for scoring clearly positive or negative cases, whereas equivocal cases will always need follow-up through pathologist review and/or FISH.

Since the evaluation of staining intensity and percentage of cells with complete membrane positivity is subjective, the inter-observer variability tends to be higher for scoring 2+ cases [11, 17, 20, 21] and discriminating 1+ and 2+ [48] or 2+ and 3+ cases [12]. The percentage of disagreement in intraobserver reproducibility ranges from 0.9% to 3.7%. It is recommended that two expert pathologists evaluate all slides with a double-blind method and discuss discordant cases [49]. In our study, the inter-observer agreement was excellent for categorized variables (0, 1+ versus 2+ versus 3+) between the two pathologists (kappa = 0.929, 95% CI: 0.909–0.946). The first machine scores also showed excellent agreement with both pathologists (kappa = 0.835, 95% CI: 0.806–0.862; kappa = 0.837, 95% CI: 0.81–0.862). The worst concordance for categorized variables was observed between the second machine operated by a less experienced operator and either pathologist 1 (kappa = 0.698, 95% CI: 0.6723–0.723) or pathologist 2 (kappa = 0.709, 95% CI: 0.684–0.732) or the first machine scores (kappa = 0.806, 95% CI: 0.785–0.826). Although these kappa values are still considered to be in good agreement, it is likely that lack of experience in operating the Ariol system using particular scripts can influence the results of automated scoring for categorized variables. However, the results of the IHC analysis for categorized scores by either pathologists or machines demonstrated similar accuracy in assessment of prognostic significance of HER2 expression in Kaplan-Meier analysis.

Discrepancies between HER2 IHC and FISH results are not uncommon and may be caused by errors in manual IHC interpretation, IHC reagent limitations [50, 51], different anti-HER2 primary antibodies [48, 5257], a lack of interlaboratory standardization of IHC and reproducibility in interpretation of the results [58, 59]. When comparing categorized IHC scores and FISH results, only pathologist 1 showed excellent agreement with FISH results (kappa = 0.814, 95% CI: 0.768–0.856). There was good agreement between FISH and pathologist 2 scores (kappa = 0.763, 95% CI: 0.712–0.81), and machine 1 scores (kappa = 0.665, 95% CI: 0.609–0.718), while the less experienced operator showed moderate agreement with FISH results (kappa = 0.535, 95% CI: 0.485–0.584). In addition to the amount of experience working with particular Ariol scripts, variations in the image analysis results may depend on the scanner settings, such as calibration, positioning and white balance because the image analysis settings were transferred from the first Ariol system to the other, without training the program. It should also be noted that HER2 gene amplification is not always accompanied by protein overexpression and vice versa. The poor prognosis associated with HER2 amplification may be attributed to global genomic instability, as cells with high frequencies of chromosomal alterations are associated with increased cellular proliferation and aggressive behavior. This suggests that HER2 amplification may serve as a surrogate marker for underlying genomic instability [60]. The discrepancy between FISH and IHC results can also be explained by technical and interpretational limitations such as failure to hybridize, scoring algorithm on the Metafer system, small size of the TMA core making this small region not representative for the tumor. For categorized variables, comparison of log-rank tests with 10,000 permutations detected no significant differences among four observers. Two pathologists successfully distinguished negative, positive and equivocal cases, but automated scoring led to 2–3 times as many 2+ cases as visual scoring. This suggests that fully automated scoring, regardless of use experience, does not provide better distinction of 2+ cases in our study. This is inconsistent with previously reported results that machine scoring of HER2 is reproducible for 2+ cases [61]. However, the latter study only analyzed 65 cases using an Extended Slide Wizard (Tripath Imaging, Inc., Burlington, North Carolina, USA) workstation running prototype software. In theory, computer-assisted image analysis should provide more accurate results for IHC quantitation, in comparison with semiquantitative scoring by a pathologist, as image analysis systems can measure the intensity of staining much more precisely than a human eye [62]. In practice, however, the accuracy of automated quantitative analysis depends on a variety of factors other than technical issues. Fully automated systems cannot distinguish between malignant and benign lesions with a precision comparable to the expertise level of a pathologist [63, 64], and require pathologist input to identify the area to be analyzed. Since the machine interprets most visual 3+ scores as 2+, it is likely that automated HER2 scoring on the Ariol system would result in more FISH assessments in clinical practice. The automated system also leads to more 1+ cases in comparison to visual scoring, which may give rise to more FISH-amplified cases to be scored as 1+ (negative). However, this would not change patient management for 0 and 1+ cases as these are both interpreted as negative.

Conclusion

The present study shows that fully automated image analysis with a system operated by an experienced operator, but without continuous human supervision, can provide results consistent with the scoring of HER2 immunostaining by pathologists. The inter-observer agreement was excellent between the two pathologists and between the experienced operator and the pathologists for both binarized and categorized HER2 scores, as well as between the two machines for binarized scores. There was a good agreement between the two machines, and between the less experienced operator and the pathologists for categorized HER2 scores. We have previously reported that automated quantitation of ER immunostaining on the same TMA series can produce results that do not differ from pathologist scoring and dextran-coated charcoal biochemical assay [28]. Unlike ER quantitation, automated scoring of HER2 staining on the Ariol system did not provide excellent agreement between machine scores or the gold standard FISH. Although Kaplan-Meier analysis showed similar accuracy of visual and machine scores in assessment of prognostic significance of HER2 status for categorized IHC variables, the automated quantitation could not distinguish 2+ scores better than the pathologists. It resulted in more 2+ cases which would lead to more FISH assessments in clinical practice. Further development of image analysis systems will likely improve the accuracy of detection and categorization of membranous staining in histological sections, making this technique more sensitive, specific and thus suitable for use in quality assurance programs.