Introduction

Breast cancer (BC) is the most common cancer in women and is a significant threat to public health worldwide [1]. The role of immune cells in the primary tumor in BC disease progression cannot be underestimated since, by modeling the immunological context, they reflect the immune response determined at the time of diagnosis [2, 3]. The ability to estimate the prognostic value of immune cells has been refined further with the rise of more reliable methods to determine immune cell functional status and phenotypes. The findings demonstrate that immune cells in BC can affect cancer cell behavior and that the malignant disease arises [4], in part, by a bilateral relation between cancer cells and their immunological microenvironment [5,6,7,8].

Lymphocytes have been identified as one of the immune cells with the strongest prognostic implications of the various forms of immune populations [9]. Higher concentrations of tumor-infiltrating lymphocytes (TILs), lymphocytes that migrate into the tumor microenvironment, have been associated with better prognosis and treatment response in BC, especially in two subtypes: HER2-positive and triple-negative breast cancer (TNBC) [10, 11].

An initial finding of Aaltomaa et al. has prompted significant and ongoing TIL assessment and research [12, 13], leading to the establishment of the TIL Working Group (TIL-WG), an international group of experts whose main objective is to update TIL assessment guidelines and, in particular, to standardize Visual TIL Assessment (VTA) [14]. The TIL-WG strives to minimize the subjectivity in evaluation that arises from the high level of inter-observer variation between specialists [2, 15,16,17].

By studying the spatial distribution and quantity of TILs, assessed by a histopathological examination using hematoxylin and eosin (H&E)-stained images [2], experts can gain valuable insights into patients’ prognosis and response to treatment [18].

For a comprehensive understanding of the immune landscape of BC, we can also take into consideration immunohistochemistry (IHC)-stained images of biomarkers related to TILs, like CD4, CD8, FOXP3, CD20, and CD22 biomarkers [19,20,21,22,23]. The first three of these highlight some subclasses of a type of immune cell, the T lymphocyte, that appears in the presence of infected or cancerous cells [24]. The latter two highlight B lymphocytes [25], which are also evaluated as TILs in H&E-stained images. Note that only H&E-stained images are used to evaluate TILs in clinical practice, whereas IHC-stained images are still used for research purposes only [2].

The evaluation of biomarkers related to TILs in H&E or IHC-stained images has shown significant potential when based on advanced artificial intelligence (AI) techniques, such as deep learning (DL) [26,27,28]. These methods have the potential to yield promising results that overcome the problem of inter-observer variation, but more data and more parameters need to be incorporated to create more accurate and powerful models.

DL-based approaches in BC research should be trustworthy and reliable, so efforts should be made to collect a variety of representative datasets that are free from bias and to verify the consistency of the annotations within the datasets to minimize errors made during the annotation process by introducing data quality-control procedures. An example of such a procedure is the use of relations verification defined in a spatial logic by applying Discrete Mereotopology techniques to Mathematical Morphology [29, 30].

To reinforce the value of TILs, we will first address the importance of the immune response, which plays a significant part in cancer evolution. We will describe the assessment of TILs using histological image analysis employing digital pathology, a promising and growing area of research in the field of immune responses. We will then discuss the development of computer-assisted pathology (CAP) tools, which require massive amounts of data to train models using techniques involving advanced AI algorithms. In addition, we will suggest ways to make the TIL annotation process less time-consuming in future datasets, while maintaining its ground-truthness. At this point, we will highlight the primary aim of our review, which is to provide a perspective on the datasets used for TIL assessment, in particular those of publicly available histological images, to see whether the TIL research works with the same batch of data. To broaden the spectrum of the datasets and to allow the models to be more widely generalizable, we will also look for more available data on TILs in other tumors. For each dataset, we will then evaluate the main approaches for assessing TILs based on the types of annotations provided. Thereafter, we will discuss the ideas about the datasets that could contribute most substantively to the research by filling in gaps or addressing limitations evident in the current literature.

TIL Assessment for Evaluating Cancer Progression

TILs are predictive prognostic markers in BC because they provide a snapshot of the tumor scenario and are one of the best examples of the association between natural defenses and carcinogenesis [31]. Thus, we can conceive of TILs as an unloaded weapon whose drug-induced reactivation can lead to the restoration of formerly fully operational natural anti-cancer defenses [31].

TILs are essential for analyzing the immunological environment of BC and other malignancies like colorectal cancer or other kinds of solid tumors [17, 32, 33].

TILs should be frequently evaluated as novel prognostic and therapy-predicting markers, particularly in the most aggressive breast lesions, such as the triple-negative and HER2-positive molecular subvariants [31, 34]. TILs have been studied independently of immune blockade agents as prognostic indicators influencing BC outcomes in chemotherapy trials in several publications [31, 35, 36].

However, TIL assessment has not featured in pathological reports even when it is accepted as a prognostic factor, as stated in the St. Gallen 2023 guidelines [37]. The role of TILs in treatment decisions remains unclear because the data on TILs are still considered insufficient to enable a reliable choice of specific therapy regimens to be made and to decide whether to withhold treatment [31, 38]. As a result, currently clinicians are not recommended to base their therapy decisions solely on TILs [38]. For this reason, TILs should not be treated as an independent variable [39, 40], but rather interpreted in conjunction with other prognostic variables like tumor and lymph node status to provide clinicians with all the prognostic information they need to examine treatment options reliably with their patients [38].

In the near future, TIL research will be able to guarantee a novel standardization of TIL assessment by improving on the approximate semiquantitative evaluation that is currently practiced, which is affected by a substantial degree of inter-observer variation [2, 14].

Enhancing Diagnosis Through Computer-Assisted Pathology

One of the primary advantages of digital slides over traditional glass slides is the ability to apply quantitative automatic image analysis algorithms with the introduction of AI techniques, leading to the creation of computer-assisted pathology (CAP) tools [41].

By using these instruments, it is possible to reduce inter-observer error and subjectivity of pathologists and thereby help them with the assessment process [14].

CAP tools, with AI integration, offer support in a range of tasks related to computer vision; “Computer Vision Tasks in CAP Tools” section describes possible applications. However, to fully exploit the power of AI in CAP tools, large-scale annotated datasets are indispensable. Given the time-consuming nature of dataset creation, in “Optimize the Annotation Phase Time” section, we will delve into methods to expedite and enhance this process.

Computer Vision Tasks in CAP Tools

Given the advances in computer vision and AI, CAP tools are becoming increasingly important in digital pathology tasks [41], such as automatic tissue segmentation and nucleus detection [42,43,44]. Even though certain CAP tools can quantify specific nuclei, such as TILs, they do so with varying degrees of difficultly. By considering TILs, a classification strategy can determine whether there are TILs present in a given image. By employing the localization approach, it is possible to specify the regions where TILs are located, like box shapes [45].

This strategy can be beneficial to weakly supervised learning, in which a whole slide image (WSI) that we claim contains TILs can be split into tiles and the locations of the TILs then checked [46]. By so doing, we can gain an approximate idea of where the TILs are in terms of spatial localization, even if this is not sufficient to allow them to be quantified [47].

Another helpful method for quantification that is more complex than combining classification and localization involves drawing a box around each TIL and counting the frequency of boxes [48, 49]. A variant of the method consists of placing a point over the object instead of a bounding box. This is useful when, for example, there are multiple small objects (as is the case in TIL detection), since under such circumstances box usage would cause too much confusion and, due to the overlap of many boxes, would give unclear results [50].

The next stage is semantic segmentation, which involves drawing a boundary around each object and determining the pixel-level features. Semantic segmentation entails labeling every pixel in an image and determining the class to which it belongs [51]. It is feasible to define which pixels are not part of a TIL in this manner [52, 53]. However, if there are more nearby TILs, we will not be able to estimate their frequency accurately, but instead, we will see a region with TILs [48]. For example, the segmentation mask task can be beneficial for identifying tissue regions, so objectives are not directly related to quantification. Nevertheless, it can be of practical value to distinguish between stromal and intratumoral TILs, which is essential for the obtaining a TIL score in BC [2, 14].

Finally, there is a process, known as instance segmentation, that advances semantic segmentation. Rather than giving all objects in a class equal pixel values, this process aims to segment and display various instances of the same class [54]. By doing so, we may establish more exact boundaries for semantic segmentation and object identification, by which means we can determine the number of objects [55,56,57].

Optimize the Annotation Phase Time

To effectively train CAP tools, it is essential to annotate WSIs. Pathologists typically begin this process by manually annotating a limited number of WSIs [58]. These annotations serve as the initial labeled data for CAP tool training [59, 60]. Once this first step has been completed, the remaining WSIs can be annotated using a semi-supervised learning approach, which reduces the amount of manual annotation by allowing pathologists to intervene only to ensure the accuracy and refinement of the generated annotations [61, 62].

With the advent of DL techniques, there has been an increase in demand for many annotations [61]. DL models have a considerable capacity to learn detailed patterns and features from data due to their complex architectures. While DL models require a large amount of training data, they have the potential to yield outstanding performance levels.

As a result, having a large dataset with numerous annotations becomes critical if the promise of DL in pathology is to be fully realized [63].

Collecting these data can be time-consuming, particularly when manual annotations, which can also become monotonous and repetitive, are involved. For these reasons, there is an urgent need to identify novel techniques to improve the stage of dataset creation [58, 61, 64].

We may discover smart approaches to simplifying the process by carefully reviewing the annotation methods used in some of the datasets mentioned hereafter.

First, we note that a common approach to achieving large numbers of annotations consists simply of having more people making the annotations. Since it is difficult to find several expert pathologists who are available for the task, structured crowdsourcing is a possible approach, whereby people with less expertise make annotations in accordance with their level of skills, and their work is mentored and eventually corrected by expert pathologists [65, 66].

The review phase consists of correcting and giving an overlay of the segmentation. In [65], the review phase was mainly exploited for annotating non-predominant or challenging classes. Production of the latest version of this dataset, which is mentioned in the work of Amgad et al., involved an intriguing new technique that includes non-pathologist nucleus labels [64]. Two main approaches were employed in this work: one focused on breadth, gathering single-rater annotations over many fields of view (FOVs) to obtain the majority of the data in the study, while the other assessed interrater reliability and agreement by gathering annotations from numerous non-pathologists for a smaller selection of common FOVs. Pathologists also provided annotations for these FOVs to assess non-pathologists’ reliability [64].

To lower the labeling burden, the method of initial labeling followed by a review by professional pathologists is employed. However, ensuring the accuracy of labeling by non-pathologists remains a challenge. The re-examination process is still time-consuming and labor-intensive if the initial annotation is not of high quality, and it requires the involvement of multiple specialists to prevent subjective errors [64].

Amgad et al. also introduced an algorithmic recommendation for nucleus boundaries and classes that provides instructions to annotate other nuclei with bounding boxes by clicking on nuclei with correct border recommendations [64].

Adding an automatic proposal, made, for instance, by a DL model, provides an iterative learning strategy whereby each iteration produces better annotation suggestions that require less manual adjustment [61, 67].

We suggest further approaches to the introduction of annotations by considering different people with variable levels of expertise, allowing non-experts to carry out the main tasks at their level of skill under the supervision of expert pathologists. Tools like MONAI [68, 69] and Quick Annotator [58] are available to make this manual adjustment and thereby facilitate better automatic annotation proposals. These tools, which use an active learning framework for continuous learning, can be integrated into digital pathology and WSI analysis platforms like QuPath [70].

Using these tools makes it possible to take advantage of the efficiency of weak labeling methods, which need substantially less time and resources. We can extract further annotations from unlabeled images by starting with a small set of annotations made by domain experts [47]. This idea of generating annotations from unlabeled data was previously investigated in traditional CAP tools, which used image-processing-based approaches like thresholding to extract annotations [71]. These rule-based techniques, however, are extremely task-specific and require domain expertise for troubleshooting and optimization.

Comparative Analysis of Datasets for TIL Research

This section aims to provide an overview of the publicly available datasets for TIL assessment using H&E images (Table 1). On Table 1, detailed information such as magnifications and size of the datasets is presented, enhancing our understanding of their composition and potential usability.

Table 1 Public and available datasets for TIL assessment

For a more complete perspective of the available datasets, we extend the search to encompass cancer types in addition to BC. We also examine datasets not based on TILs but also those for lymphocytes and inflammatory cells. TILs are specific kinds of lymphocytes, so we can gather additional information from datasets for lymphocyte evaluation, such as that on morphological features, for TIL assessment [72]. We then look for datasets that assess inflammatory cells because these can be relevant for TIL scoring when performed on round inflammatory cells, omitting polymorphonuclear cells solely in the intratumoral region in cutaneous melanoma [73]. As a result, we can use inflammatory cell datasets from different cancer types to generalize predictive models [74].

As can be seen in Table 1, some datasets are a combination, or a selection of some parts, of other datasets. Figure 1 outlines the dependencies of the datasets described in Table 1.

Fig. 1
figure 1

Dependencies in the TIL datasets described in Table 1. The years indicate when datasets were created; the arrows indicate the dependencies between them

We recognize that there are other online datasets, but access to them is either restricted or requires prior registration.

One important dataset that requires prior registration is the ATLAS of Histopathology database, a large-scale, patch-level annotation of different components of tissues created in 2019 by the Multimedia Lab of the University of Toronto. This dataset requires an End User License Agreement (EULA) to be accepted [93].

There are datasets for TIL assessment without annotation that feature only associated clinical data. One such example is that proposed by Shvetsov and coworkers, called UiT-TILs, that can be used to clinically validate TIL classifications [94]. The UiT-TILs dataset contains 1189 image patches from 87 non-small cell lung cancer (NSCLC) patients with matched clinical data, and it is a subset of another dataset, reported by Rakaee et al. [95]. A similar dataset was presented by Fassler’s team, in which they processed data not only from the TCGA, but also from the UNC CBCS Phase 3 cohort, which contains 2998 cases with 1138 diagnostic WSIs from representative blocks, and related follow-up recurrence and survival data [96, 97].

Dataset Goals Over the Years

In the first part of “Comparative Analysis of Datasets for TIL Research” section, we analyzed the datasets for evaluating TILs, lymphocytes and inflammatory cells. This raises questions about whether the focus of these datasets has changed over time and what subjects and issues they target. In search of answers, we extracted the keywords of each abstract through ChatGPT APIs since this is a fast and intuitive method. We decided not to examine author-entered keywords to minimize potential bias and ensure universality since some dataset articles do not include author-entered keywords.

First, we extracted the texts of the abstracts from the papers relating to the datasets shown in Table 1. We then applied OpenAI APIs to them in order to ask ChatGPT to identify the main words in each paper’s abstract. We did this because we assumed that most of the important keywords are almost always mentioned in a paper’s abstract. Finally, we grouped the keywords from the dataset articles by year and carried out topic modeling for each year using a Latent Dirichlet Allocation (LDA) model [98], which enabled us to discover the hidden relationships in the keyword collection and ultimately the main topic for each year [99, 100].

Figure 2 shows how the dataset goals change over time, as revealed by the analysis of the LDA model outcome. It is clear that the research community has continued to present ever-larger datasets over the years because the advent of DL has meant that we now need substantial quantities of data to train neural networks.

Fig. 2
figure 2

Change of topics over the years in the TIL assessment datasets analyzed

We note that the research community has invested significant resources and will continue to do so in order to make annotated datasets for training CAP tools. While these technologies use efficient and innovative approaches to save time, the process remains complex and demands effective communication among experts from diverse sectors [41].

The aforementioned trend shows that TIL evaluation is becoming more dependent on the contribution of CAP tools. This is crucial because it will progressively reduce the differences from expert assessments over time [14]. As a result, in the TIL evaluation scenario, we note a growing inclination to merge the experience of experts from different fields [2, 14].

Conclusions and Proposals for Future Challenges

This review compares datasets used for TIL assessment and encompasses datasets about lymphocytes and inflammatory cells, since, as mentioned in “Comparative Analysis of Datasets for TIL Research” section, they are related to the TIL assessment scenario.

The research community is working to make larger datasets for TIL assessment and should also provide novel TIL datasets for different tumor types in addition to BC to enhance this area of investigation [92]. It is essential to include annotated images from various scanners in order to leverage models trained on these datasets effectively. This approach ensures improved generalization of the TIL assessment, making the models helpful across a variety of scanners rather than being limited to the performance of a specific one [101,102,103].

By adopting this strategy, CAP tools will become more valuable to experts and enable them to carry out TIL assessments of which they can have greater confidence.

As stated in “Introduction” section, TIL assessment is achieved using H&E images. Nevertheless, there is a supplementary and non-standardized method for measuring TILs, consisting of quantifying immune biomarkers of specific subpopulations of TILs such as CD4, CD8, FOXP3, CD20, and CD22 [19,20,21,22,23]. The information about the immune markers can provide us with more insights into TILs and about their distribution and spatial relationships, as shown by two studies [104, 105]. However, we require IHC images of them, such as sections from the paraffin block, to make this evaluation. It should be noted that the sections, even if very close (e.g., 4 \(\mu\)m [106]) and from the same paraffin block, can vary slightly [107]. There are also a large number of variables that influence antigen staining in paraffin-embedded tissues, such as the type of fixative, fixation time, tissue processing, the level of antigen expression and preservation, and also the clone and the dilution of the antibody used, the antigen-retrieval method, and the detection system and chromogen [108, 109]. Other procedures use multiplexed IHC images to apply diverse IHC staining in a single section. However, these are more expensive and can be beset with problems of antibody compatibility and tissue penetration [110].

To broaden the scope of TIL assessment research, the research community should make these IHC images public and widely available. By doing so, or, better yet, by offering the ground truth, we can rise to exciting challenges like the one overcome to distinguish HER2-positive from HER2-negative BC specimens solely through the evaluation of H&E slides [111]. Thus, a novel aim should be to quantify immune markers directly on H&E-stained images.