Immune infiltrate diversity confers a good prognosis in follicular lymphoma

Background Follicular lymphoma (FL) prognosis is influenced by the composition of the tumour microenvironment. We tested an automated approach to quantitatively assess the phenotypic and spatial immune infiltrate diversity as a prognostic biomarker for FL patients. Methods Diagnostic biopsies were collected from 127 FL patients initially treated with rituximab-based therapy (52%), radiotherapy (28%), or active surveillance (20%). Tissue microarrays were constructed and stained using multiplex immunofluorescence (CD4, CD8, FOXP3, CD21, PD-1, CD68, and DAPI). Subsequently, sections underwent automated cell scoring and analysis of spatial interactions, defined as cells co-occurring within 30 μm. Shannon’s entropy, a metric describing species biodiversity in ecological habitats, was applied to quantify immune infiltrate diversity of cell types and spatial interactions. Immune infiltrate diversity indices were tested in multivariable Cox regression and Kaplan–Meier analysis for overall (OS) and progression-free survival (PFS). Results Increased diversity of cell types (HR = 0.19 95% CI 0.06–0.65, p = 0.008) and cell spatial interactions (HR = 0.39, 95% CI 0.20–0.75, p = 0.005) was associated with favourable OS, independent of the Follicular Lymphoma International Prognostic Index. In the rituximab-treated subset, the favourable trend between diversity and PFS did not reach statistical significance. Conclusion Multiplex immunofluorescence and Shannon’s entropy can objectively quantify immune infiltrate diversity and generate prognostic information in FL. This automated approach warrants validation in additional FL cohorts, and its applicability as a pre-treatment biomarker to identify high-risk patients should be further explored. The multiplex image dataset generated by this study is shared publicly to encourage further research on the FL microenvironment. Supplementary Information The online version contains supplementary material available at 10.1007/s00262-021-02945-0.


Fig. 1 Kaplan-Meier analysis of overall survival (months) for all requested samples in the FL cohort (N=262)
, stratified for tumour sample availability. Censored data are marked with crosses (log rank test p=0.374).

Fig. 2 Sequential TMA sections setup for multi-plex experiment validation.
Each single-plex assay is compared to an adjacent multi-plex assay.
Experimental assay development and validation was performed using sequential sections from a FFPE TMA constructed from 44 FL cases retrospectively collected from The Christie archives. These patients were diagnosed in the 1980-1990s and treated using historical protocols.
A detailed version of the multi-plex immunofluorescent experiment can be found in Tsakiroglou et al. [33]. To establish agreement between single-plex and multi-plex assays we stained pairs of 4μm thick, sequential sections from a follicular lymphoma TMA block. In each pair one section was stained using the multi-plex and the other using a single-plex protocol ( Supplementary Fig. 2). DAPI was added in both the single-plex and multi-plex experiments to quantify the whole tissue area in each core. Slides were scanned multispectrally on the Vectra microscope (Akoya Biosciences, software version 3.5) at 20x magnification, and the exposure times were set according to the observed signal strength of each filter. In the case of the single-plex experiments, exposure times were adjusted only for the relevant filters, while for the rest the default settings were applied; 40 ms for the overview and 150 ms for the multi-spectral scan. A spectral library was built and spectral unmixing of all sections was carried out in inForm 2.4 software (Akoya Biosciences).
Image analysis was subsequently performed in HALO software (Indica Labs, Albuquerque, NM, USA). Using the Multi-plex Fluorescent Area Quantification module, automated thresholding of pixel intensities in each channel identified the percentage of stained area. A demonstration of automated area quantification is shown in Supplementary Fig. 3. This algorithm requires the user to specify minimum true signal intensity. These settings for the single-plex and multi-plex sequential sections were chosen by the same user, leaving a "washout" period of 3 days between them. Cores with artefacts, such as bubbles and blood vessels were excluded from the analysis. In some cases, cores would be missing from one of the two sequential sections (tissue was broken or torn), and so these were excluded as well.
Comparisons between single-plex and multi-plex experiments demonstrated satisfactory linear correlations as shown in Supplementary Fig. 4. For most markers, slightly lower staining expression was observed in the multi-plex compared to the single-plex experiments. This was not observed for CD4, the first antibody placed on the tissue. Lower expression may derive from incomplete stripping in between staining cycles, which may lead to steric obstruction and slightly decreased antibody binding. This effect was however not significant, as seen in Bland-Altman plots ( Supplementary Fig. 5). The mean difference of the two experiments was usually close to zero, with ≥95% of data points lying within the limits of agreement (mean ± 1.96 standard deviation) for all markers.  Comparison of % tissue area stained by each marker in two sequential 4μm TMA sections, a multi-plex and a single-plex. The single-plex was also stained with DAPI and both sections were scanned multi-spectrally at 20x and unmixed with the same spectral library. Each point represents a TMA core.

Fig. 3 Bland-Altman plot comparisons between single-plex and multi-plex immunofluorescent assays for each antibody.
Antibody expression is measured as a percentage of the positively stained tissue area. Each point represents a TMA core. The dotted lines represent the limits of agreement (± 1.96 standard deviation of difference).

Cell detection
To assess segmentation performance, we considered the average precision = + + , where true positive (TP) predictions are defined as predicted nuclei, for whom exist ground truth (GT) nuclei with sufficient overlap. Overlap was measured as intersection over union (IoU) > 30%. False positive (FP) were the unmatched predicted nuclei, while false negative (FN) were the unmatched ground truth nuclei. There were 3 ROI (883 nuclei) in the test set, 3 ROI (906 nuclei) in the validation set and 35 ROI (67991 nuclei) in the training set. The average precision for the testing set of nuclei was AP = 0.827. The worst image in the testing set is presented in Supplementary Fig. 6 (AP=0.733) and in Supplementary Table 2 the AP for different threshold of the IoU is given for the test set.  After nuclear segmentation, simulated membranes are grown around the nuclei by maximum 1.5 μm to represent whole cells (see Supplementary Fig. 7) and measurements are taken of the median intensity for all stains and each cell compartment (nucleus, membrane). All images in the dataset were manual examined and areas that presented artefacts because of folded tissue, bubbles or blood vessels were excluded from further analysis.

Positive cell scoring
A validation set of 10 images, each containing a whole core, was selected to determine a positivity cut-off for each stain, based on the median stain intensity of the relevant compartment (nuclear for FOXP3 and membrane for all the rest). The method used to determine the optimal value of the positivity cut-offs was as follows; first, intensity scaling onto a consistent colour map across all images was carried out for each stain so that equal intensity levels were represented by equal brightness. A cut-off threshold was selected per image core and stain by two independent annotators (a non-expert [A.M.T] and a trainee pathologist [M. D.]) to separate positive from negative cells. Agreement between the two annotators is shown in Supplementary Table 3. A single threshold was then selected as a positivity cut-off per stain by averaging all thresholds selected for the images in the validation set by both annotators.
Agreement was assessed by using the thresholds to classify the cells as positive or negative and calculating the 1 score (harmonic mean of precision and recall) between the labels generated by different annotators. The fact that a single threshold across all images was mostly adequate to separate positive from negative cells (0.68 ≤ 1 score ≤ 0.92) indicates low staining variation across different patients and TMA blocks. These single cut-off thresholds were finally applied to phenotype cells in the entire data set.

Table 3 Agreement for cell labels generated by selecting a positivity cut-off per image in the validation set.
Agreement is calculated as the score, representing the harmonic mean of precision and recall for the binary classification task of assigning a cell as positive or negative for each stain.  This method was applied to identify cells positive for CD4, FOXP3, CD8, CD68 and PD-1. This approach was not adopted for CD21, as the staining pattern of CD21 + cells followed a non-convex meshwork pattern which would be challenging to simulate accurately by simply growing simulated membranes around the nuclei.

Fig. 6
Dendritic meshwork areas were annotated manually by drawing around the CD21 + meshwork pattern regions. A CD21 (red) and DAPI (blue) view of a multi-plex TMA core image. B Manual annotation of dendritic meshwork areas, overlayed in grey.    Q25 and Q75: 25th and 75th quantile, respectively. CoV: The average intra-patient coefficient of variation. †Immune infiltrate ratio is calculated as the total immune cells (positive for any marker) divided by the number of cells that expressed only DAPI.