Evaluation Challenges in the Validation of B7-H3 as Oral Tongue Cancer Prognosticator

B7-H3 was the only molecule identified with prognostic potential from a recent systematic review of the prognostic value of immune checkpoints in oral cancer. We aimed to validate this finding in a multicenter international cohort. We retrospectively retrieved 323 oral tongue squamous cell carcinoma (OTSCC) samples from three different countries (Brazil, Finland, and Norway) for immunostaining and scoring for B7-H3. We evaluated tumor immunogenicity by analyzing the amount of tumor-infiltrating lymphocytes and divided the tumors into immune hot and cold. To increase the reliability of the results, both digital and manual visual scoring were used. Survival curves were constructed based on the Kaplan-Meier method, and the Cox proportional hazard model was utilized for univariate and multivariate survival analysis. B7-H3 expression was not significantly associated with overall or disease-specific survival in the whole OTSCC cohort. When divided into immune hot and cold tumors, high B7-H3 expression was significantly associated with poor disease-specific and overall survival in the immune hot group, depending on the scoring method and the country of the cohort. This was achieved only in the univariate analysis. In conclusion, B7-H3 was a negative prognosticator for OTSCC patient survival in the subgroup of immune hot tumors, and was not validated as a prognosticator in the full cohort. Our findings suggest that the immune activity of the tumor should be considered when testing immune checkpoints as biomarkers. Electronic supplementary material The online version of this article (doi:10.1007/s12105-020-01222-3) contains supplementary material, which is available to authorized users.


Introduction
The incidence of oral together with lip squamous cell carcinoma (OSCC) is unfortunately increasing. In 2018, the number of new cases worldwide was approximately 350,000 Aini Hyytiäinen and Rabeia Almahmoudi have contributed equally to this manuscript Electronic supplementary material The online version of this article (doi:https ://doi.org/10.1007/s1210 5-020-01222 -3) contains supplementary material, which is available to authorized users.
with an annual mortality of approximately 180,000 [1]. This increase is not associated with an increase in the 5-year survival rate, which remains at approximately 50% for most countries [1,2]. Therefore, there is a need for new treatment and therapeutic approaches. OSCC arising from the tongue (OTSCC) is the most aggressive subgroup of oral cancers and is characterized by high rates of metastasis and mortality [3]. OTSCC carcinogenesis is traditionally associated with heavy alcohol and tobacco use [4]. In addition, evasion of the host immune response to the tumor has been recognized as a key feature in the carcinogenesis process, and tumors with low infiltration of immune cells respond more poorly to immune-based cancer therapies [5]. These findings have led the way to a novel classification of tumors into two categories, "hot" (or inflamed) and "cold" (or non-inflamed), based on quantification of tumor-infiltrating lymphocytes (TILs) [6]. Head and neck cancers are generally highly infiltrated by lymphocytes and are thus immune hot; however, the poor patient survival suggests that the anti-tumor immune response is ineffective [7].
Immune checkpoints play a predominant role in the initiation of CD4 + and CD8 + T cell-dependent immune responses by regulating interactions between co-stimulatory ligands and their receptors [8]. Ligand members of the B7/CD28 superfamily, such as B7-H1 (PD-L1) and B7-H3, can modulate the initiation by either amplifying or inhibiting co-stimulatory signals. PD-L1 overexpression inhibits the activation of functional T-cells [9] and PD-1/PD-L1 axis inhibition has been adopted as a therapeutic approach for OSCC [10]. On the other hand, B7-H3 has no identified receptors and is theorized to be involved in both co-stimulation and coinhibition of T cells [11]. In vitro, B7-H3 increases activity of CD8 + T cells but also inhibits T-cell proliferation and reduces secretion of relevant immune mediators such as interferon-γ, tumor necrosis factor α, and other cytokines [12,13].
Several studies have been conducted to determine the prognostic value of immune checkpoints in oral cancer [14]. In our recent systematic review, B7-H3 showed evidence as an adverse prognostic factor in OSCC, while other immune checkpoints were either studied once or had controversial results [14]. According to Almangush and co-authors, hundreds of biomarkers have been studied as prognostic markers for OSCC, but none are in clinical use [15]. This may be due to several factors, mainly missing validation, as among the 12 immune-modulating molecules investigated thus far, only four had been studied more than once.
Since B7-H3 was the only immune checkpoint molecule that showed a potential role in the prognostication of OSCC, we sought to validate this result in a multicenter international cohort study.

Methods and Materials
This study was performed according to the REMARK guidelines for tumor marker prognostic studies [16].

Immunohistochemical Staining
For optimizing the staining protocol, we used the following two antibodies for B7-H3: rabbit anti-human B7-H3 (D9M2L, 1:200, Cell Signaling technology, Leiden, Netherlands) and goat anti-human B7-H3 (AF1027, 1:1000, R&Dsystems, Minneapolis, MN, USA). Antibody selection was based on two published articles [8,17]. Three researchers (M.S. junior trainee; A.A-S., senior trainee; and T.S., oral pathologist) evaluated the staining with an optical microscope (Leica DM6000 together with Leica DFC365-FX camera, Leica Biosystems, Newcastle, UK). Both antibodies had the same staining pattern with slight differences in staining intensity (Online Resource 1); therefore both were used in this study. The Finnish samples were stained with rabbit anti-human antibody and the Norwegian and Brazilian samples were stained with goat anti-human antibody.
For the rabbit antibody, Dako Real EnVision Detection system K5007 kit (Dako, Carpinteria, CA) was used for staining. After deparaffinization, epitopes were retrieved in Tris-EDTA buffer (pH 9) for 15 minutes using a microwave and followed by cooling at room temperature for 20 minutes. Dako Peroxidase blocking solution S2023 was next applied for 15 minutes. Sections were then incubated with the rabbit B7-H3 primary antibody for 1 hour at room temperature followed by Dako HRP for 30 minutes at room temperature.
For the goat-based antibody, we used a goat on rodent HRP-polymer detection kit (Biocare Medical, Pacheco, CA). After deparaffinization, antigens were retrieved in citrate buffer (Dako) for 15 minutes using a microwave and followed by cooling at room temperature for 20 minutes. Dako peroxidase blocking solution S2023 was then applied for 15 minutes. Sections were then incubated with the goat B7-H3 primary antibody for 30 minutes. Goat probe from the detection kit was added for 15 minutes and followed by goat on rodent HRP polymer for 15 minutes.
Both rabbit and goat sections were then incubated with chromogen DAB for color formation for 15 minutes and washed in dH 2 O for 5 minutes. The slides were then counterstained with Mayer's hematoxylin solution (Sigma-Aldrich, St. Louis, MO, USA) and mounted in Mountex (HistoLab, Gothenburg, Sweden).
Slides were scanned using a Leica Aperio AT2 (Leica Biosystems) to be analyzed using QuPath software. 18 The specificity of each staining was confirmed with staining controls.

3
trainee) evaluated all scanned samples independently and then jointly for consensus while blinded to any clinical data. Staining intensity was evaluated as 0-3 (0: negative, 1: weak, 2: moderate, and 3: strong; Fig. 1) and the staining area was evaluated as 0-3 (0: 0%, 1: 0 > 25%, 2: 25 > 50%, 3: >50%). The staining index was calculated as a sum of the two scores. The Norwegian TMA samples did not allow a meaningful evaluation of the staining area, thus only staining intensity is reported from these samples.

Assessment of B7-H3 Expression using Digital Scoring
In addition to the traditional manual visual scoring, we sought to validate our results by using a free, automated analysis software, QuPath [18]. Two researchers (M.S. and P.C.) with coding experience developed the automated scoring protocol. First, the program was calibrated to detect colors by estimating the staining vectors. All FFPE and TMA samples had similar modal RGB and DAB values. Second, the classifier was taught to recognize cancer and stromal cells by choosing five areas of tumor and five areas of stroma for ten slides. Cell and membrane detection was performed according to the developer instructions (Online Resource 2). Third, the classifier was calibrated by comparing different mean DAB OD values for different slides from different countries to determine the thresholds to be used. Both researchers performed the analysis first independently and then agreed on the values to be used on all slides (Online Resource 2). The classifier was then saved and the script was coded. One researcher (Finland: M.S.; Brazil: P.C.) for each cohort selected 5 representative areas of the invasive front in FFPE samples, TMA was taken as a whole, and ran the automated software. The scripts are available in Online Resource 3 and 4. Results were in the form of H-score and were extracted for survival analysis.

Tumor-Infiltrating Lymphocytes (TILs) Scoring
Two researchers (A.H.; junior trainee, M.S.; junior trainee) evaluated the presence of TILs in the Brazilian and Finnish samples independently and divided the cases into immune hot and cold while remaining blinded to any clinical data [6]. Disagreements between evaluators were resolved by an experienced researcher (A.A-S; senior trainee). Scoring was conducted as previously described [19]. Based on this study, only stromal TILs were assessed. The scoring was defined as the percentage of stroma occupied by lymphocytes (0%, 5%, 10%, 20%, 30%, 40%, and ≥ 50%). Only areas directly related to the invasive front were included in the estimation. Areas of fibrosis, central necrosis, or artefacts were excluded. Norwegian cases were not scored for the TILs as they were TMAs, which do not allow the full evaluation of the tumor stroma.

Statistical Analysis
After scoring, cases were divided into high and low expression using the median as the cut-off point. We also performed the analysis by calculating the optimal cut-off point [21], but this did not change the results (data not shown). Additionally, the cases were divided into immune hot if TILs were ≥ 20%, and cold if median TILs were < 20%, based on a previous study [19]. The κ coefficient was calculated and the prognosis of patients in relation to overall survival and disease-specific mortality was analyzed using SPSS software program version 21.0 (IBM SPSS Statistics, SPSS INC, Chicago, IL, USA). Life tables were calculated according to the Kaplan-Meier method. Survival curves were compared with the log-rank test. Univariate and multivariate survival analyses were performed with Cox's proportional hazards model. In multivariate analysis, the results were adjusted for age, sex, grade, stage, and lymph node metastasis. Statistical significance was set at p < 0.05.

B7-H3 Expression in OTSCC Samples
B7-H3 was mainly expressed at the membrane of the cancer cells (Fig. 2a). Staining was also seen in the cytoplasm in some heavily stained samples, (Fig. 2b). The staining was mainly concentrated at the periphery of the tumor islands (Fig. 2c). However, the whole tumor island was positive in some cases (Fig. 2d).

B7-H3 Expression is not Associated with Survival of OTSCC Patients
During follow-up, 120 patients died of OTSCC, 56 patients died of other causes, and 147 patients were alive at the end of the follow-up period. Median follow-up time was 40 months (range: 0-252 months). B7-H3 expression was not significantly associated with OTSCC mortality ( Table 2). We performed the analysis for each country separately to determine if differences in population or laboratories had any impact on the results. All subgroups showed similar results to the combined data, which indicated that B7-H3 is not significantly associated with disease-specific or overall survival ( Table 2). Even with the optimal cut-off points, no significance was found in any of the subgroups (data not shown).

High B7-H3 Expression is Associated with Poor Survival in the Immune Hot Subgroup of OTSCC Patients
As B7-H3 mainly exerts its effects on lymphocytes, we divided the cancer samples into immune hot or cold (high or low amount of TILs, respectively) and performed the analysis for each group separately. The TMA samples from Norway could not be separated into these two groups. In immune hot cases, high B7-H3 expression associated with low overall survival in digitally scored Brazilian samples, and with low disease-specific survival in manually visually scored (scoring index and area) Finnish cases (Table 3; Fig. 3). The significant association was not observed in multivariate analysis (data not shown). In immune cold cases, no significant correlation was found between B7-H3 expression and patient survival in any of these analyses (Table 3).

Discussion
This multicenter international study sought to validate the prognostic value of B7-H3 in OTSCC, as this immune checkpoint was reported as a prognostic marker twice in head and neck and OSCC [8,17]. In our OTSCC patient cohort, high B7-H3 expression was associated with poorer prognosis in some national subgroups but only for those whose tumors were highly infiltrated by lymphocytes (immune hot); depending on the scoring method, but it failed to work in the full cohort. A significant association was only found in the univariate but not the multivariate analysis. Our results highlight a common and serious problem in prognostic marker studies, which can be called the "replication crisis" [22]. Recent systematic reviews of prognostic markers for oral cancer have suggested hundreds of molecules as putative prognostic markers [14,15,23,24]. However, none of them have been adopted into clinical use, and patient management is still mainly based on the clinical TNM staging due to missing or failed validation [15].
In the two previous studies on B7-H3 [8,17], both patient cohorts were from Asia (Taiwan and Wuhan). National subgroups tend to have exposure differences to risk factors for OSCC, such as heavy tobacco, alcohol, and betel nut use. This explains why results from OSCC cohorts from one part of the world are not necessarily applicable to others. In addition, the samples analyzed in the two previous articles were obtained from the whole oral cavity and the head and neck area while our samples were only from the tongue [8,17]. In contrast to OTSCC, HPV infection is recognized as an important risk factor for oropharyngeal cancers, underlining the need to distinguish these cancers [25]. Differences in ethnicity and tumor location could be the reason why we failed to validate B7-H3 in our patient cohort. In this study, we collected samples from three different nations (Brazil, Finland, and Norway) representing OTSCC patients of different ethnicities. This resulted in a large sample size combating statistical bias.
To analyze the quality and repeatability of our scoring, we measured inter-rater reliability for manual visual scoring with a κ coefficient. The highest scores and almost perfect agreement were achieved by senior trainees. Junior trainees had the lowest score with moderate agreement. Thus, extra care should be taken when selecting those who score the stained slides and we recommend that the slides evaluation is done by pathologists who have enough expertise in this field.
The field of pathology is rapidly moving towards automated digital scoring and the use of artificial intelligence [26]. In addition to manual visual scoring, we scored the slides using the free automated software QuPath. Use of automated software not only reduces the time for scoring but also increases the reliability of the results and reduces the risk of bias [26]. One of the major challenges of applying this software as a scoring tool is the difference in settings between laboratories and investigators. Thus, we highly recommend that authors publish all adjustable settings to allow others to replicate and validate the work. Even though the digital and manual visual scoring went hand-by hand in the majority of cases, still in some cases they gave different results which call for a better digital and manual visual scoring protocols.
Another serious problem in the field of prognostic marker studies and validation is related to variation of antibody specificity. Theoretically, all antibodies should give similar results if the antibody passes the manufacturer's quality control. Unfortunately, in practice there are large variations, not just between different antibodies from different manufacturers, but also between different lots from the same manufacturer. For this reason, we tested two antibodies from two different companies, which, fortunately, gave similar staining patterns.
Immune checkpoints, including B7-H3, are group of molecules with effects on immune cells. Recent advancements in cancer immune therapeutics have resulted in tumors being categorized into hot and cold. 6 Therefore, in this study we investigated lymphocyte infiltration in the invasive area of the tumor tissue. While our survival results were insignificant in the entire cohort, we observed significant results in hot tumors in the univariate analysis. Our findings further indicate that without the affected cells (lymphocytes) in the tumor, the prognostic power of B7-H3 (and likely also other immune checkpoint inhibitors) may be lost. Thus, we stress the necessity of investigating the immune activity of tumors when assessing the prognostic value of B7-H3 expression.
As a conclusion, in this multicenter international study, evaluation of B7-H3 expression revealed prognostic potential for patients with tumors that were highly infiltrated with lymphocytes. However, B7-H3 did not have prognostic value in the whole OTSCC cohort. This study highlighted an important issue in the field of prognostic markers, which is the "replication crisis". For prognostic studies on immune checkpoints, we encourage researchers to analyze the immune activity of the tumor samples. We also encourage researchers to publish the scoring protocols (either by manual visual or digital scoring) in detail with all adjustable parameters to allow careful replication and validation of the work. Only immunohistochemical markers with prognostic power validated in several research groups and from cohorts of different countries may have the potential to become a useful tool for universal clinical pathology.
Human Research Ethics Committee of the Piracicaba Dental School, University of Campinas (Brazil), and the Institutional Review Board of Northern Norwegian Regional Committee for Medical Research Ethics (REK Nord) with validated approval for all hospitals (Norway).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.