Reproducibility and predictive value of scoring stromal tumour infiltrating lymphocytes in triple-negative breast cancer: a multi-institutional study

Several studies have demonstrated a prognostic role for stromal tumour infiltrating lymphocytes (sTILs) in triple-negative breast cancer (TNBC). The reproducibility of scoring sTILs is variable with potentially excellent concordance being achievable using a software tool. We examined agreement between breast pathologists across Europe scoring sTILs on H&E-stained sections without software, an approach that is easily applied in clinical practice. The association between sTILs and response to anthracycline-taxane NACT was also examined. Pathologists from the European Working Group for Breast Screening Pathology scored sTILs in 84 slides from 75 TNBCs using the immune-oncology biomarker working group guidance in two circulations. There were 16 participants in the first and 19 in the second circulation. Moderate agreement was achieved for absolute sTILs scores (intraclass correlation coefficient (ICC) = 0.683, 95% CI 0.601–0.767, p-value < 0.001). Agreement was less when a 25% threshold was used (ICC 0.509, 95% CI 0.416–0.614, p-value < 0.001) and for lymphocyte predominant breast cancer (LPBC) (ICC 0.504, 95% CI 0.412–0.610, p-value < 0.001). Intra-observer agreement was strong for absolute sTIL values (Spearman ρ = 0.727); fair for sTILs ≥ 25% (κ = 0.53) and for LPBC (κ = 0.49), but poor for sTILs as 10% increments (κ = 0.24). Increasing sTILs was significantly associated with an increased likelihood of a pathological complete response (pCR) on multivariable analysis. Increasing sTILs in TNBCs improves the likelihood of a pCR. However, inter-observer agreement is such that H&E-based assessment is not sufficiently reproducible for clinical application. Other methodologies should be explored, but may be at the cost of ease of application.


Background
The role of the immune system in the pathogenesis and clinical course of cancer is well established [1,2] and has received renewed attention with the success of immunotherapies for several solid organ cancers such as melanoma and lung cancer. The assessment of tumour infiltrating lymphocytes (TILs) within a tumour has been used as a surrogate measure of the immune response and several studies from the 1980s onwards have reported on the prognostic role of TILs in a variety of different organ systems [3][4][5]. Breast cancer has historically been regarded as a non-immunogenic tumour although a dense lymphoid infiltrate has long been observed in the rare medullary subtype [6], which is associated with a favourable outcome despite its otherwise highgrade morphological features.
The stromal TIL (sTIL) component in breast cancer has been examined in a number of recent clinical studies and a prognostic role has been most consistently observed in triple-negative breast cancer (TNBC) and HER2-positive cancers compared to other subtypes in both the adjuvant and neo-adjuvant setting [7][8][9][10][11][12][13][14][15][16]. In two adjuvant series of TNBCs, each 10% incremental increase in sTILs was associated with a 14-19% reduction in risk for recurrence or death [13,15]. sTIL evaluation was included as a secondary endpoint of the Geparsixto trial and, similarly, incremental increases of sTILs were positively associated with a pathological complete response (pCR) in TNBC patients. In that study, tumours with a dense sTIL component, termed lymphocyte predominant breast cancer (LPBC), were associated with the highest pCR rate of 74% in patients who received carboplatin [7]. LPBCs, whilst not representing a specific subtype, have sTILs occupying over 50 or 60% of the stroma and are uncommon amongst breast cancers [7,[9][10][11][12][13]17]. Gene expression-based analysis of TNBC also shows that the TIL component in TNBCs is highly correlated with an immune-rich expression profile that is favourably prognostic for relapse-free survival [18].
In order for the potential of a biomarker to be realised in clinical practice, it must meet standards for analytic validity in terms of the reliability, accuracy and reproducibility of the assay. In breast cancer, sTILs are most commonly scored on haematoxylin and eosin (H&E)-stained tumour sections. Reports of the reproducibility for this methodology vary from moderate to excellent [7, 10, 13-15, 17, 19]. In 2015, an international immuno-oncology biomarker working group produced guidance aimed at standardising sTILs reporting in breast cancer [19]; and subsequently reported very high interobserver agreement in a large ring study when this guidance was combined with an interactive software tool [20].
The aim of our multi-institutional study was to evaluate the reproducibility between experienced breast pathologists across Europe for scoring sTILs in TNBCs in routine practice. Our assessment of sTILs was confined to light microscopy using the guidance of the immuno-oncology biomarker working group on the basis that this methodology is simple to perform and could be easily applied in routine practice; the software tool was not used because this would add a level of complexity that would make it more difficult to roll out in clinical practice. The case series was limited to TNBCs for two reasons: there is consistent evidence supporting a prognostic and predictive association for sTILs in this subtype, and because any differences between subtypes would not then be a cause of variation. As a secondary endpoint, the association between sTILs and the likelihood of attaining a pCR in TNBC was examined.

Materials and methods
The series comprised 75 consecutive TNBCs diagnosed in 72 patients in a symptomatic breast service of a single tertiary referral centre between 2004 and 2015 ( Table 1). All but one patient received NACT. Three of the 72 patients had multiple synchronous TNBCs. Nine patients had more than one core biopsy taken from the same tumour and these additional biopsies were included to evaluate intratumoural heterogeneity. A representative H&E-stained section of the 84 needle core biopsies (NCBs) from 75 tumours was selected and slides were scanned using an Olympus VS120 slide scanner. The digitised slides were anonymised and were uploaded to the PathXL online repository. pCR breast was defined as ypT0/is and pCR breast/axilla as ypT0/isN0 [21]. The study consisted of two circulations. In the first circulation, an email with instructions for the study was sent to 35 consultant pathologists who were members of the European Working Group for Breast Screening Pathology (EWG-BSP). The email included the review and the online tutorial from the immuno-oncology biomarker working group [19], links to the digitised slides, and a MS Excel template. Participants were asked to read the tutorial and the review before scoring sTILs in each slide and to record the absolute percentage of sTILs for each slide in the Excel template provided, which was then returned by the individual pathologists to the organising pathologist. Participants were also asked to record the length of time taken to complete the exercise.
After the first circulation, an independent pathologist, who was not part of the inter-observer study, reviewed those digital slides for which there was a noticeable variance in scores and noted the features pertinent to these cases e.g., necrosis, difficult boundary, tumour heterogeneity. The 84 digitised slides were relabelled and reordered at random on the PathXL online repository. 4 months after the completion of the first circulation, an email that contained links to the re-ordered slides, the TILWG tutorial and an Excel template was circulated to members of the EWG-BSP. The email for this second circulation highlighted the specific guidance in the working group tutorial that pertained to those features in the slides for which there was most disagreement in sTIL scores in the first circulation. Participants were asked to review the working group tutorial again and then record the absolute percentage of sTILs for each case in the Excel template and to return it to the organiser.
Slide selection, scanning and anonymisation were performed by a senior technician and an independent pathologist, neither of whom participated in the inter-observer study. All participating pathologists were blinded to the scores of other pathologists.

Statistical analysis
The relationships between the pathologists' scores in the different circulations and between each other were assessed as continuous variables (raw scores); as increments of ten percent; and as dichotomous categorical variables using a threshold of ≥ 25 and of ≥ 50%, the latter defined as LPBC. The intraclass correlation coefficient (ICC) was used to assess how closely the measurements of sTILs by different pathologists resembled each other for each slide [22,23]. The two-way mixed single measures figure was used as it reflects the values for a single typical rater. Spearman's correlation coefficient (ρ) was used to measure the strength of the relationship between scores in circulation one and circulation two for each individual pathologist for the raw sTIL scores that were given. Cohen's kappa statistic (κ) was used to measure the strength of association between circulation one and circulation two scores for each individual pathologist for the sTILs as a categorical variable. Univariate and multivariate Logistic regression analysis was used to calculate odds ratio (OR) and 95% confidence intervals (CI) to adjust for prognostic variables. The p-values reported were two tailed and a p-value of less than 0.05 was considered statistically significant. Pearson χ 2 testing was also used to assess the association between sTILs categories and pCR. The sTIL results were collated in Microsoft Excel and were subsequently analysed in SPSS 24 and Stata/IC (v14.0).

Results
Sixteen pathologists participated in the first circulation; nineteen participated in the second circulation, comprising all sixteen pathologists who partook in the first circulation and an additional three pathologists. The average time taken by participants to score sTILs in each slide was 4 min (median 3 min; range 1-10 min).

Intra-observer agreement
The intra-observer agreement for the original 16 pathologists who partook in both circulations ranged from weak to very strong correlation for absolute sTIL values (Spearman ρ = 0.314 to 0.970; p-values range from < 0.001 to 0.015) with a strong average correlation (Spearman ρ = 0.727). The lowest intra-observer agreement for one pathologist (Spearman ρ = 0.314, p-value = 0.015) reflected a move from poor agreement between this pathologist's scores and those of the other participants in the first circulation (average inter-item correlation = 0.356) to strong agreement in the second circulation average inter-item correlation = 0.740. Overall intra-observer agreement was fair using the 25% threshold (κ = 0.53; range 0.158-0.947) and for the LPBC category (κ = 0.49; range 0.021-0.868) but agreement was poor for sTILs as 10% increments (κ = 0.24; range 0.069-0.545).

Features associated with poor agreement in sTIL scores
An independent pathologist selected the slides for which there was greatest inter-observer disagreement in sTIL scores on the basis of a standard deviation for absolute scores in the top 25%. The features that could explain this variation were intra-tumoural heterogeneity of sTILs (n = 11), necrosis (n = 5), fragmentation of the biopsy (n = 4), difficulties delineating the tumour boarder (n = 4), low (n = 3) and high (n = 2) tumour cellularity; some of these features co-existed in the same case. When sTIL scores for different biopsies from the same tumour were examined (n = 9), there was overall moderate agreement (Spearman ρ = 0.511) that was weak in three cases (lowest Spearman ρ = 0.276, p-value = 0.268).

Association between sTILs and response to NACT
For the 72 patients, the median sTIL score was 20% (range 1-80%) in circulation 1 and 15% (range 1-80%) in circulation 2. The distribution of sTIL categories across the 72 patients is shown in Table 2. The median sTIL score for each case from circulation 2 was used to examine the association between sTILs and response to NACT. Increasing sTILs was paralleled by an increased likelihood of both pCR breast and pCR breast/axilla by univariate and multivariable analysis ( Table 3). Increasing 10% increments of sTILs improved the likelihood of both a pCR breast and pCR breast/axilla by over 40% on univariate analysis (p-value = 0.020 and p-value = 0.022, respectively). LPBC was associated with the greatest likelihood of a pCR breast and pCR breast/axilla (OR 9.1, 95% CI 1.07-77.2, p-value = 0.043; OR 11.8, 95% CI 1.39-100.6, p-value = 0.024 respectively), with the caveat that there were only nine LPBCs and a very wide 95% CI was observed. By multivariable analysis, increasing 10% increments of sTILs was an independent predictor of both pCR endpoints when adjusted for age at diagnosis, tumour grade and tumour type. Again, the magnitude of the association between sTILs and pCR on multivariable analysis was greatest for LPBC and significant for pCR breast/axilla. LPBC was associated with a higher rate of pCR breast and breast/axilla (both 89%; n = 8) than non-LPBC (47%; n = 29; Pearson χ 2 5.59. p = 0.018 and 40% (n = 25); Pearson χ 2 7.45 p-value = 0.006, respectively).

Discussion
sTILs have emerged as a potential prognostic and predictive marker in TNBC in the adjuvant and neoadjuvant setting. The consistency of scoring sTILs varies with excellent reproducibility reported when a software tool is used along Fig. 1 Distribution of sTILs scores given by each of the 16 participating pathologists for the 84 slides in circulation 1 (a) and in circulation 2 (b). There was greater variation in the range of scores given by the 16 pathologists in circulation 1 than in circulation 2 and the range of scores given by pathologists changed between the two circulations. Pathologist 1 gave a narrow range of scores relative to other participants in circulation 1 and gave a wider range in circulation 2 that was more in line with that of other participants; the converse was observed for pathologist 11. The range of scores given by pathologist 10 was wide relative to others in both circulations. The distribution of scores given by pathologists 14, 15, and 16 converged to become very similar in circulation 2 with guidance from an expert group [19]. In our study, the reproducibility of sTILs assessment in TNBC was examined using this guidance but without the software tool in order to examine reproducibility of a methodology that could be easily applied in routine practice. Our data affirm the predictive importance of sTILs in the neoadjuvant setting whereby increasing levels of sTILs are associated with increased odds of a pCR following treatment with anthracycline-based NACT. However, there was only moderate agreement at best between experienced pathologists for scoring sTILs.
The distribution of sTIL scores in our series was similar to that reported by others. The median sTIL value of 15% (range 1-80%) in circulation 2 is in line with other reports of a median of 15-23% in TNBC (9,11,12,14,15,17) and higher than that observed by some (13). LPBC was observed in 12% of cases, which was within the range of 4.4-28% noted by others in TNBCs [7,[9][10][11][12][13]17]. Increasing sTILs was paralleled by an increased likelihood of a pCR on both univariate and multivariable analysis. Each 10% increase in sTILs improved the likelihood of a pCR by over 40%, which is higher than 15-23% described by others [7,9,[11][12][13][14][15]. Although the number of LPBCs was small, our data suggest that the predictive relevance of sTILs may be greatest for these tumours. This is consistent with data pertaining to the Fig. 2 Distribution of sTIL scores for all slides in circulation 1 (a) and in circulation 2 (b). The distribution of sTIL scores was less heterogeneous in circulation 2 than in circulation 1. In circulation 2, there were fewer outlier scores and there was a narrow range of scores for those cases with a low sTIL population (sTILs < 20%) indicating a better level of agreement for these cases TNBC subset of the Geparsixto study, in which LPBC was associated with the greatest increase in likelihood of pCR (OR 2.17. CI 1.27-3.73, p-value = 0.05) [7]. Despite the strong favourable association between sTILs and response to NACT, the inter-observer agreement in our study suggests that sTIL evaluation using this methodology is not sufficiently reproducible for application in TNBCs in routine practice. Inter-observer agreement for the absolute percentage of sTILs was only moderately reproducible, reflected by an ICC of 0.683 with a lower limit of the 95% CI of 0.601. Higher levels of agreement for absolute sTIL values are reported by others but in studies involving only two or three pathologists with recorded ICC values of 0.92 [7] and 0.97 [17]; 85% agreement [13]; and strong correlation [14]. However, in studies involving more participants, reproducibility was comparable to that achieved in this work with an ICC value of 0.62 between four pathologists [24]; and an ICC of 0.71 in a ring study with 32 pathologists [20]. Both our study and the latter ring study [20] included a large number of participants and scored sTILs on full face sections of pre-treatment NCBs. Case mix and selection differed between the studies in that the ring study included cases from the Geparsixto trial of both TNBCs and HER2-positive tumours that were selected to ensure equal representation of tumours with different levels of sTILs. Our cases comprised consecutive TNBC biopsies from routine practice with no pre-selection criteria other than sufficient tumour for diagnosis. Thus, the slightly lower level of reproducibility for absolute sTIL values reported here may be more reflective of what is achievable in routine practice for TNBCs than that reported in the ring study. Despite the recommendation to score sTILs as a continuous variable [20], and good reproducibility for scoring absolute sTIL values, the clinical utility of the absolute percentage of sTILs is uncertain and has only a negligible effect on response to NACT. In contrast, 10% incremental increases in sTILs are consistently associated with response to NACT but the utility of this measure is likely to be confounded by the poor intra-observer agreement that we observed (κ = 0.24).
Denkert et al. reported excellent reproducibility when an interactive software tool was used to aid sTILs assessment [20]. This was achieved for absolute (ICC 0.89) and for categorical measures in a cohort pre-selected to include equal numbers of cases with low, intermediate and high sTIL levels. These data are very promising and the software has now become accessible on line (http://www.tilsi nbrea stcan cer.org). The software gives the scorer integrated feedback by showing pre-calibrated reference images against which an image from a case is assessed. However, this approach increases the complexity of the evaluation process and the time taken; the user is required to create and upload three high-power static images from each case to generate a sTIL score. In our study, even without this step, the median time taken to score each slide was 4 min, which is not insignificant in the context of a busy routine practice; others report a time of between three and nine minutes to score a case using the working group guidance [25]. This is significantly longer than the time taken to score other biomarkers in breast cancer e.g., oestrogen or progesterone receptor and HER2, where estimates of positivity are given around pre-defined thresholds.
Some of the features that we observed in cases with the greatest variation in sTIL scores were highlighted by the expert group i.e., necrosis, difficulty delineating the boarder [19]; others were not e.g., fragmentation of the core and tumour cellularity. Many of these features are not uncommon in TNBC biopsies and, consequently, may hamper attempts to improve reproducibility of scoring in this subtype. Intra-tumoural heterogeneity was seen most often and we observed only moderate agreement (ranging from weak to strong) between sTIL scores in paired biopsies from the same tumour. It will be difficult to mitigate the effect of heterogeneity on the analytic and clinical validity of sTILs in TNBCs because of the potential for sampling bias arising from the reliance on NCBs in the neoadjuvant setting. The working group guidance recommends giving an average sTIL score for a case; however, there have been no formal studies examining either the effect of heterogeneity on reproducibility or the relative clinical importance of a sTIL score derived from the average sTIL population, the hot-spots or the area with the lowest sTILs.
Our study has limitations. The number of cases was small and, as in many studies, the small number of LPBCs hampered the interpretation of their significance. We restricted our analysis to TNBC, in which the potential clinical relevance of sTILs has been shown most consistently, and our data may not be pertinent to other subtypes of breast cancer. For example, in a previous study on 454 breast cancers treated with NACT, we showed that the presence or absence of sTILs, with a cut off of 1%, significantly impacted on response to treatment in the Luminal B oestrogen receptorpositive/HER2-negative subtype [26]. Nonetheless, our series is an accurate reflection of TNBCs that are diagnosed in routine practice with no pre-selection applied. Participating pathologists did not receive formal training in scoring sTILs in the interval between the two circulations. Formal training of pathologists is emphasised to improve the consistency of reporting many of the newer predictive immunohistochemical biomarkers in other cancer types [27] and it could be argued that training could have improved consistency for scoring sTILs in this work. In our second circulation, we observed slightly better agreement between the sixteen pathologists who had already undertaken the exercise in the first circulation (ICC 0.683, 95% CI 0.601-0.767, p < 0.001) than we observed when scores from three new participants were included (ICC 0.660, 95% CI 0.577-0.747, p < 0.001). Notwithstanding, the participants were experienced breast pathologists from across Europe that would be representative of international best practice. Finally, the aim of our study was to assess concordance in measuring the whole sTIL population; we did not examine subpopulations of lymphoid cells which may provide a more functional assessment of the immune infiltrate [28][29][30][31].
In conclusion, our data affirm the predictive significance of sTILs with respect to pCR in TNBC. Quantification of sTILs by light microscopy is simple and would be suitable for widespread clinical application; however, our data show considerable inter-and intra-observer variability between experienced breast pathologists in the assessment of sTILs using the immuno-oncology biomarker working group guidance alone. Tumour heterogeneity contributed to reproducibility issues. Other methodologies may improve the consistency for scoring sTILs and should continue to be explored. The software tool of the immuno-oncology biomarker working group has the potential to improve standardisation but at the expense of decreased ease of use. Further studies will need to validate this tool for scoring sTILs in cases from routine practice, without the pre-selection of cases applied in the ring study (21), and to determine if can overcome the effect on heterogeneity on reproducibility. Formal studies evaluating the clinical importance of sTIL heterogeneity with data to support guidance on how heterogeneous cases should be scored is required. Methodological studies aimed at improving the consistency of reporting sTILs may need to be designed with specific tumour subtypes and clinical endpoints in mind.