sTILs have emerged as a potential prognostic and predictive marker in TNBC in the adjuvant and neoadjuvant setting. The consistency of scoring sTILs varies with excellent reproducibility reported when a software tool is used along with guidance from an expert group . In our study, the reproducibility of sTILs assessment in TNBC was examined using this guidance but without the software tool in order to examine reproducibility of a methodology that could be easily applied in routine practice. Our data affirm the predictive importance of sTILs in the neoadjuvant setting whereby increasing levels of sTILs are associated with increased odds of a pCR following treatment with anthracycline-based NACT. However, there was only moderate agreement at best between experienced pathologists for scoring sTILs.
The distribution of sTIL scores in our series was similar to that reported by others. The median sTIL value of 15% (range 1–80%) in circulation 2 is in line with other reports of a median of 15–23% in TNBC (9,11,12,14,15,17) and higher than that observed by some (13). LPBC was observed in 12% of cases, which was within the range of 4.4–28% noted by others in TNBCs [7, 9,10,11,12,13, 17]. Increasing sTILs was paralleled by an increased likelihood of a pCR on both univariate and multivariable analysis. Each 10% increase in sTILs improved the likelihood of a pCR by over 40%, which is higher than 15–23% described by others [7, 9, 11,12,13,14,15]. Although the number of LPBCs was small, our data suggest that the predictive relevance of sTILs may be greatest for these tumours. This is consistent with data pertaining to the TNBC subset of the Geparsixto study, in which LPBC was associated with the greatest increase in likelihood of pCR (OR 2.17. CI 1.27–3.73, p-value = 0.05) .
Despite the strong favourable association between sTILs and response to NACT, the inter-observer agreement in our study suggests that sTIL evaluation using this methodology is not sufficiently reproducible for application in TNBCs in routine practice. Inter-observer agreement for the absolute percentage of sTILs was only moderately reproducible, reflected by an ICC of 0.683 with a lower limit of the 95% CI of 0.601. Higher levels of agreement for absolute sTIL values are reported by others but in studies involving only two or three pathologists with recorded ICC values of 0.92  and 0.97 ; 85% agreement ; and strong correlation . However, in studies involving more participants, reproducibility was comparable to that achieved in this work with an ICC value of 0.62 between four pathologists ; and an ICC of 0.71 in a ring study with 32 pathologists . Both our study and the latter ring study  included a large number of participants and scored sTILs on full face sections of pre-treatment NCBs. Case mix and selection differed between the studies in that the ring study included cases from the Geparsixto trial of both TNBCs and HER2-positive tumours that were selected to ensure equal representation of tumours with different levels of sTILs. Our cases comprised consecutive TNBC biopsies from routine practice with no pre-selection criteria other than sufficient tumour for diagnosis. Thus, the slightly lower level of reproducibility for absolute sTIL values reported here may be more reflective of what is achievable in routine practice for TNBCs than that reported in the ring study. Despite the recommendation to score sTILs as a continuous variable , and good reproducibility for scoring absolute sTIL values, the clinical utility of the absolute percentage of sTILs is uncertain and has only a negligible effect on response to NACT. In contrast, 10% incremental increases in sTILs are consistently associated with response to NACT but the utility of this measure is likely to be confounded by the poor intra-observer agreement that we observed (κ = 0.24).
Denkert et al. reported excellent reproducibility when an interactive software tool was used to aid sTILs assessment . This was achieved for absolute (ICC 0.89) and for categorical measures in a cohort pre-selected to include equal numbers of cases with low, intermediate and high sTIL levels. These data are very promising and the software has now become accessible on line (http://www.tilsinbreastcancer.org). The software gives the scorer integrated feedback by showing pre-calibrated reference images against which an image from a case is assessed. However, this approach increases the complexity of the evaluation process and the time taken; the user is required to create and upload three high-power static images from each case to generate a sTIL score. In our study, even without this step, the median time taken to score each slide was 4 min, which is not insignificant in the context of a busy routine practice; others report a time of between three and nine minutes to score a case using the working group guidance . This is significantly longer than the time taken to score other biomarkers in breast cancer e.g., oestrogen or progesterone receptor and HER2, where estimates of positivity are given around pre-defined thresholds.
Some of the features that we observed in cases with the greatest variation in sTIL scores were highlighted by the expert group i.e., necrosis, difficulty delineating the boarder ; others were not e.g., fragmentation of the core and tumour cellularity. Many of these features are not uncommon in TNBC biopsies and, consequently, may hamper attempts to improve reproducibility of scoring in this subtype. Intra-tumoural heterogeneity was seen most often and we observed only moderate agreement (ranging from weak to strong) between sTIL scores in paired biopsies from the same tumour. It will be difficult to mitigate the effect of heterogeneity on the analytic and clinical validity of sTILs in TNBCs because of the potential for sampling bias arising from the reliance on NCBs in the neoadjuvant setting. The working group guidance recommends giving an average sTIL score for a case; however, there have been no formal studies examining either the effect of heterogeneity on reproducibility or the relative clinical importance of a sTIL score derived from the average sTIL population, the hot-spots or the area with the lowest sTILs.
Our study has limitations. The number of cases was small and, as in many studies, the small number of LPBCs hampered the interpretation of their significance. We restricted our analysis to TNBC, in which the potential clinical relevance of sTILs has been shown most consistently, and our data may not be pertinent to other subtypes of breast cancer. For example, in a previous study on 454 breast cancers treated with NACT, we showed that the presence or absence of sTILs, with a cut off of 1%, significantly impacted on response to treatment in the Luminal B oestrogen receptor-positive/HER2-negative subtype . Nonetheless, our series is an accurate reflection of TNBCs that are diagnosed in routine practice with no pre-selection applied. Participating pathologists did not receive formal training in scoring sTILs in the interval between the two circulations. Formal training of pathologists is emphasised to improve the consistency of reporting many of the newer predictive immunohistochemical biomarkers in other cancer types  and it could be argued that training could have improved consistency for scoring sTILs in this work. In our second circulation, we observed slightly better agreement between the sixteen pathologists who had already undertaken the exercise in the first circulation (ICC 0.683, 95% CI 0.601–0.767, p < 0.001) than we observed when scores from three new participants were included (ICC 0.660, 95% CI 0.577–0.747, p < 0.001). Notwithstanding, the participants were experienced breast pathologists from across Europe that would be representative of international best practice. Finally, the aim of our study was to assess concordance in measuring the whole sTIL population; we did not examine subpopulations of lymphoid cells which may provide a more functional assessment of the immune infiltrate [28,29,30,31].
In conclusion, our data affirm the predictive significance of sTILs with respect to pCR in TNBC. Quantification of sTILs by light microscopy is simple and would be suitable for widespread clinical application; however, our data show considerable inter- and intra-observer variability between experienced breast pathologists in the assessment of sTILs using the immuno-oncology biomarker working group guidance alone. Tumour heterogeneity contributed to reproducibility issues. Other methodologies may improve the consistency for scoring sTILs and should continue to be explored. The software tool of the immuno-oncology biomarker working group has the potential to improve standardisation but at the expense of decreased ease of use. Further studies will need to validate this tool for scoring sTILs in cases from routine practice, without the pre-selection of cases applied in the ring study (21), and to determine if can overcome the effect on heterogeneity on reproducibility. Formal studies evaluating the clinical importance of sTIL heterogeneity with data to support guidance on how heterogeneous cases should be scored is required. Methodological studies aimed at improving the consistency of reporting sTILs may need to be designed with specific tumour subtypes and clinical endpoints in mind.