Introduction

A large number of clinical vaccination trials have been conducted in the last 15 years in cancer patients [1]. Only a few vaccine candidates, however, have advanced to phase III clinical trials, and none have been approved so far [2]. Recent insights into the mechanisms that regulate immune responses against tumors have led to the identification of a broad selection of compounds (antibodies, small-molecules, cytokines, TLR ligands) that have the potential to increase the efficacy of vaccine regimens [3]. However, the number of combinations that can be brought to advanced stages of clinical testing is limited due to the complex approval process for investigational new drugs. While it is not possible to predict clinical outcome of therapeutic tumor vaccination by results obtained from T cell immunomonitoring assays, development of validated surrogate immunological end points for tumor vaccine activity and ideally efficacy could dramatically accelerate their clinical development [4]. Consequently, the search for such surrogate immunological assays has become a high priority task for the field.

For cancer vaccines, this search has led to the continuous development of more powerful and sensitive assays to monitor immune responses in vaccinated patients [5, 6]. Indeed, improved immunization strategies and new monitoring techniques have led to a steady increase of the fraction of cancer patients with observed vaccine-related immune responses [710]. Nevertheless, clinical responses occur in a small fraction of patients, which may correlate with the presence of a specific immune response [1113]. The apparent disconnect between immunomonitoring data and clinical events may be explained by several factors. Measuring a single immune system parameter in one tissue compartment, mostly IFN-gamma secreting T cells in the peripheral blood, as is the case for the majority of clinical trials thus far, might not be sufficient to capture an immunological signature correlating with the development of an appropriate anti-tumor response. There is more and more data suggesting that multiple functions of antigen-specific T cells rather than, or in addition to, T cell numbers correlate with clinical outcome, as described in studies of protective immune responses to microbial infections in mice and humans [1416]. There is further evidence that classical T cell assays will have to be used to measure responses in the most appropriate tissue compartment [17, 18]. An additional important factor to consider for the observed lack of correlation between results from immunomonitoring and clinical events is the use of non standardized and non-validated immune monitoring assays. The use of such assays precludes the direct comparison of results obtained by labs across institutions [19] and also significantly impacts the ability to compare data obtained from different patients and time points within the same study. Thus, immune monitoring assays need to be standardized, validated (or at a minimum qualified) and auditable within laboratories before they can be appropriately applied to evaluate samples obtained from clinical trials. Finally, only qualified assays can be used to effectively guide the development of new drugs [20] or serve as surrogate clinical endpoints.

Approximately 3 years ago, two international associations (CVC-CRI and the Association for Immunotherapy of Cancer) initiated programs to address immune assay standardizations within the cancer scientific community [21]. A similar effort is ongoing in the field of infectious disease, mainly driven by HIV-researchers [22]. The primary goals of the CVC-CRI proficiency panel program are to allow for the harmonization of immune monitoring assays across institutions to a degree needed to represent stable biomarker assays and to provide for an external quality assurance resource for laboratories participating in the proficiency panel. The specific aims of the first CVC-CRI multimer proficiency panel presented in this manuscript were to (1) demonstrate the feasibility of such large international inter-laboratory testing project for HLA-peptide multimers, (2) identify protocol variables, reagent choices and strategies that are relevant in the formulation of harmonization guidelines for assay protocol optimization (3) provide each participating lab direct feed-back about their qualitative and quantitative performance in relation to the other members of the group (external-validation), (4) quantify the variation of results reported by such a large number of labs (inter-center variation), (5) determine the variation of results obtained within the same lab using the same protocol to quantify antigen-specific CD8+ T cells with the same samples doing the experiment at two different time points (intra-center variation).

Materials and methods

Panel design

The first CVC multimer panel was conducted with a group of 27 centers. Each lab had to determine the frequency of CD8+ T cells specific for two model antigens in cryopreserved peripheral blood mononuclear cell (PBMC) samples from five donors (D1–D5). All participants used either HLA-peptide tetramers or pentamers that were generously donated by Beckman Coulter (Fullerton, CA) and Proimmune (Oxford, UK), respectively. Two labs used both tetramers and pentamers and generated two separate data sets. Consequently, we were able to collect 29 complete data sets in a first step of testing. All participants were further offered the possibility to receive a second set of PBMC batches to allow for repetition of the experiments in a second step. Nine of the 27 labs made use of this offer and completed a second round of testing of the same PBMC samples with one lab contributing two separate data sets.

Participants and organizational setup

Participating laboratories were located in eight countries (Belgium, Canada, Germany, France, Sweden, Switzerland, UK, and USA). Each laboratory received an individual lab ID number. Panel leadership was provided by two scientific leaders experienced in MHC-peptide multimer staining, in collaboration with the CVC executive office. The Lausanne branch of the Ludwig Institute for Cancer Research (LICR-LB) performed extensive pre-testing of donor samples and selection of appropriate donors.

PBMCs and peptides

PBMCs from healthy donors were obtained from a commercial donor bank and processed under GMP conditions using established standard operating procedures at Cellular Technologies Limited (CTL), Cleveland, OH. PBMCs were frozen using a rate controlled freezer and transferred to the vapor phase of liquid nitrogen. Cell separation procedure and freezing of cells were conducted under validated conditions. It was demonstrated that functionality and viability were maintained throughout the procedure. Each vial of PBMCs contained enough cells to ensure a recovery of 10 million cells or more under CTL’s SOP. Cells were shipped to all participants for overnight delivery on sufficient dry ice for 48 h for centers within the USA. Centers located in Europe and Canada received cells shipped in dry shippers filled with sufficient liquid nitrogen to assure cell integrity for up to 7 days. Shipment of cells was performed by CTL under their existing SOPs.

PBMCs were pretested at the LICR-LB for reactivity against the HLA-A2-restricted Influenza-M158–66 (antigen “A1” = GILGFVFTL) [23] and Melan-A/Mart-126–35A27L (antigen “A2” = ELAGIGILTV) [24]. PBMCs from five donors were selected for no, low, medium and strong responses against the two tested antigens. Pre-testing was performed by generating three independent data sets using HLA-peptide tetramers (Beckman Coulter Immunomics), HLA-peptide pentamers (ProImmune) and HLA-peptide tetramers manufactured at the LICR-LB for research use. For the proficiency panel testing, MHC-peptide tetramers (Beckman Coulter) and pentamers (Proimmune) specific for both model antigens were provided in sufficient amount to perform 50 stainings per center and had to be applied as 10 μl volume per staining. All cells and reagents sent to participants in both panels were obtained from the same batches.

HLA-peptide multimer staining

One of the main features of the first CVC multimer panel was that all participants (in both steps) used their preferred reagents and locally established protocols to determine the percentage of antigen-specific CD8+ T cells. The only requirement was to use the HLA-peptide multimers provided by the respective commercial sources. Moreover, the organizers avoided offering recommendations, providing protocols, or asking for any other measure of harmonization.

Statistical analysis

The following parameters were calculated for both the overall panel and the individual participant’s performance, using the lab-specific reported percentage of multimer-positive CD8-positive cells: the mean, standard deviation, and coefficient of variation (CV), the median, 25th and 75th percentiles, minimum, and maximum percentage of CD8-positive cells for each donor and antigen.

Due to the extreme variation of the reported values we applied two data filters to remove a large proportion of non meaningful results from the final analysis. The first filter was based on the coefficient of variation (CV) from the replicates within a lab for a given donor and antigen and used to eliminate outlier replicate values from the final analysis. Data sets with either replicate that had a CV greater than 75 or only one measurement for a given donor and antigen combination did not pass this filter. The second filter was based on the dot plots generated from the lab’s analysis of the flow cytometer results. An independent evaluator examined all the dot plots and assigned a score based on the clustering of positive events. A score of 0 was given when there was no clustering, a score of 1 for ambiguous results, and a score of 2 when there was a clearly clustered population of positive events in the upper right quadrant. Labs with replicates that had a total score of less than 4 did not pass this filter. Therefore, there had to be at least one staining with clear clustering from a given replicate to pass the filter. Labs that passed both filters were considered to have detected a response for that donor and antigen. The association between detected response rate and number of counted CD8+ T cells was tested using the Chi square test.

The questionnaire responses outlining the various protocols used by all the labs were summarized using frequency tabulations. The following parameters were calculated to summarize the overall recovery and viability for each of the thawed PBMC donor samples: mean, median, minimum, and maximum.

For laboratories that participated in the second step, the individual laboratory results from the second step were presented in the same format as in the first step. Within an individual laboratory, the results from the two steps were compared by calculating the percent difference from the first step, the standard deviation of the means from both steps and the corresponding coefficient of variation (CV). The average CV across the ten donor antigen combinations was computed as a general indication of how similar the results were in the same laboratory at two different time points.

Results

All 27 participating centers received the necessary PBMC samples, commercially available HLA-peptide multimers and instructions, and were able to perform the requested stainings. A total number of 29 expected questionnaires, report forms and dot plots were successfully collected for analysis via a web-based database specifically designed to administer and process large amount of data from immune monitoring assays. The organizers also received and analyzed ten complete data sets from the second step. After each step individualized reports were provided to all participating labs.

Human auditing process of final results

Before final analysis, all collected data sets were screened for obvious failures and inconsistencies. These included incorrect gating by some participants, uneven quality of electronic compensation of flow cytometry based event acquisition (Figs. 1a, b, 2b) and aberrant reported values from two data sets that did not reflect the number of events shown in the corresponding dot plots. The latter originated in a systematic error when entering the event counts for all four quadrants into the spread sheet (not shown). Together, these findings emphasize the importance of systematic auditing of assay results.

Fig. 1
figure 1

The figure shows eight selected examples (ah) of dot plots where gating led to reporting of increased number of events in the upper right quadrant. All dot plots show the CD8-staining on the x-axis and the staining with the HLA-peptide multimer on the y-axis. Dot plots were chosen from laboratories ID01, ID10, ID28 and ID30. Under each dot plot the expected versus the reported (underlined) frequency of multimer-positive CD8 cells is indicated

Fig. 2
figure 2

Figures ae show representative dot plots from centers ID05, 17, 20, 21 and 22 reported for staining samples from donor1 with the Influenza-M1 multimer. *Center ID21 set the analytical gate in such a way that multimer-negative cells are shown in the upper right quadrant. **Center ID22 used an atypical gating strategy in which CD8-negative cells were removed at an earlier step of the analysis. The inserted table (f) shows reported values for non-specific binding of the Melan-A/Mart-1-specific multimer in samples from triplicate analysis (T1, T2 and T3) of donor 1 performed by centers ID05, 17 and 20

Overview of assay protocols currently in use at the international level

Each participant provided detailed information about experimental protocol and reagents used. It became clear that multimer labeling is currently performed using a broad variety reagents and procedures. Supplementary table 1 which is available online shows the distribution of labs for 11 variables with the potential to influence the sensitivity of the multimer labeling assay (multimer source, use of DNAse during thawing, counting method, type of flow cytometer used, staining performed in tubes or plates, conjugate staining order, number of fluorochromes, method for dead cell exclusion, anti-CD3 staining, use of a dump channel and antibodies used for co-staining) and provides a comprehensive overview of the protocols that were applied. When looking at a distinct subgroup of labs sharing one variable, it became clear that the expression of the other ten variables was still randomly distributed within the subgroups.

Inter- and intra-center variation

For our group of 27 laboratories we found an unexpectedly high variation among the reported 29 datasets for eight of the ten different donor antigen combinations with CVs ranging from 47 to 158 (Table 1). This was the case even for the three highest responses (Influenza in D2/D4/D5 with corresponding CVs of 47.2/93.7/57.1). The even higher variation found in donor 1 result from the fact that no or extremely low number of antigen-specific T cells were present in this donor. The liberal design of this panel provides a measure of the variation of results that may be representative of current immune monitoring of antigen-specific CD8+ T cell responses using the multimer-based assay.

Table 1 Percentage of CD8-specific multimer binding based on the mean of the triplicates

It is well established that the validation process of any diagnostic test, including cellular assays, should include the prior determination of accuracy, specificity, sensitivity, reliability, linearity and range determination as well as important precision parameters such as the intra-assay and inter-assay variation of results [25]. Nine labs repeated the panel at a separate time point. As one center generated two separate data sets with HLA-peptide tetramers as well as pentamers the whole group submitted ten complete datasets. In order to quantify the intra-lab variation we calculated the mean CD8+ specific T cell binding at each time and compared these means by determining the absolute and percentage-wise difference between the results from both steps (not shown) as well as the coefficient of variation of both results (details are shown in Supplementary table 2). Five participating centers reported pairs of results that were very close to each other (average intra-center CV < 30) for most of the eight depicted antigen-donor combinations showing that reproducibility within an acceptable range can be obtained for a broad range of detected antigen-specific T cell responses. The other five laboratories reported results for step 2 that were quite different from step 1 (average intra-center CV > 30). Consequently, intra-center variation was unexpectedly large for these labs.

Limitations of the multimer assay

A small group of five centers were able to detect a small but distinct population of Influenza-M1-specific CD8+ cells in donor D1 and reported an average value of 0.11% multimer-positive CD8+ T cells in this donor. Representative dot plots are shown in Fig. 2a–e. On the other hand none of the participating centers was able to detect Melan-A/Mart-1-specific CD8+ T cells in the same donor. We made use of the reported data from this donor to get more insights into the technical limitations of the technology. Our aim was to determine if the reported mean value of 0.11% Influenza-M1-specific CD8 T cells could indeed be considered as being above the normal variation of the background usually found in an average lab. ID21 set the analytical gates in such a way that a high number of clearly multimer-negative events fell into the upper right quadrant (Fig. 2d, Supplementary figures 1 and 2). ID22 used an atypical gating strategy which removed dead and CD8-negative cells and reported an extremely low background of non-specific multimer staining (Supplementary figure 2a). Three centers (ID05, ID17 and ID20) used a classical gating strategy and reported values for the non-specific binding of the Melan-A/Mart-1 multimer that were similar to results obtained by a large number of other participants and therefore appropriately represent the background staining normally found in a number of commonly used protocols (Supplementary figure 2a). These three centers reported a mean of 0.06% multimer-positive CD8+ T cells using the Melan-A/Mart-1 multimer in donor 1 with a standard deviation of 0.01% (Fig. 2f). Based on these results we roughly estimated a limit of detection (LOD) of 0.09% which resulted from adding three times the standard deviation to the mean. In addition we estimated a limit of quantification (LOQ) of 0.16% which resulted from adding ten times the standard deviation to the mean.

Individual laboratory performance

A total of nine positive responses could have been detected in the five donors (for all details see Table 2) from which 1 was considered to be below the estimated limit of quantification (defined as <0.16%), 2 were low (defined as <0.2%), 5 were moderate (defined as ≥0.2 and ≤0.5%) and 1 was high (defined as >0.5%). Table 1 indicates the median, 25th, 75th percentile and mean, for each of the ten measurements. The report from the first step of the panel displayed the percentages for antigen-specific CD8+ cells from all 29 protocols for all ten antigen-donor combinations (Supplementary figures 1 and 2).

Table 2 Expected percentage of antigen-specific CD8 T cells within the distributed samples

To define if a participant had successfully detected a response, two acceptance criteria were introduced and applied. On one hand, replicates were excluded from further analysis when they had a high variation (defined as CV > 75). A total of 24 replicates (8.2%) were found to have unacceptably high variation and were therefore rejected (details are shown in Supplementary table 3). On the other hand, replicates that did not show clear and clustered populations of events in the upper right quadrants indicating CD8+ multimer+ cells, as assessed by an independent evaluator, were also excluded (details are shown in Supplementary table 4). Due to this rule a total of 101 replicates (34%) were discarded. Based on this definition of response, the whole group was able to detect 66% of all possibly detectable responses, with some labs not able to detect any responses and only a few detecting all nine responses (Table 3). Donor D1 with Melan-A/Mart-1 was a negative control donor. No lab reported a false positive response for this antigen-donor combination. Supplementary figure 3, which is also available as electronic supplementary material, shows selected results from the optical evaluation to allow the reader to discern the amount of clustering that was considered to be clearly negative (0), ambiguous result (1) or clearly positive (2). The figure also makes clear that optical evaluation of dot plots is a very subjective approach as an objective definition of a “clustered population” does not exist. Consequently, interpretation of results by optical evaluation should be regarded with caution, especially when very small populations are judged. However, one independent evaluator examined and rated all of the dot plots and applied the same criterion to all dot plots. Hence, there was a uniform criterion applied to all dot plots even though a subjective definition was used.

Table 3 Number of labs that detected each possible number of accepted responses

Subgroup-analysis

One aim of the CVC multimer panel was to deduce recommendations for initial harmonization guidelines of the assay. One clear finding from the panel was that the number of CD8+ cells counted per replicate critically determines the chance that an immune response is detected. The majority of results were based on collection of 30,000–100,000 CD8 positive (CD8+) events. In the subgroup in which less than 10,000 CD8+ cells were collected, only 13% detected a response (Table 4). In contrast, 87% of replicates with more than 100,000 accumulated CD8+ cells detected a response correctly. There was a statistically significant difference (P = 0.001, Chi square test) when comparing the response detection rates in triplicates with at least 100,000 CD8+ cells counted (39/45 = 87%) versus less than 100,000 CD8+ cells counted (133/218 = 61%).

Table 4 Number of CD8-positive cells counted and proportion of responses detected

We then focused only on the results obtained for the four lowest responses (Melan-A/Mart-1 response for donors D2, D4 or D5 and Influenza-M1 response for donor D1). The lower part of Table 4 shows that 12% (4/33) of the lower responses were detected by triplicates with less than 30,000 positive CD8+ cells counted, whereas 80% were detected by triplicates with more than 100,000 CD8+ cells counted. There was a statistically significant difference (P = 0.001, Chi square test) when comparing the response detection rates in triplicates with at least 100,000 CD8+ cells counted (16/20 = 80%) versus less than 100,000 CD8+ cells counted (38/96 = 40%).

A second source of variation was related to electronic gate setting during data analysis. Most of the labs that reported out of range high values of specific multimer binding did so because they set the gates too low thereby increasing the number of dots on the upper right quadrant (Fig. 1). Thus, a method that helps setting the gates correctly would lead to a reduction of the inter-lab variation.

Although the majority of labs (66%) were not able to detect eight or nine responses, ten of the participating labs were able to do so. Analysis of the protocols used by these ten high performing labs indicated that comparable results could be obtained with protocols that differed widely from each other (not shown). For instance, high performing labs were found among those that (1) reported low or high cell viability after thawing, (2) used tetramers as well as pentamers, (3) used or did not use DNAse for thawing, (4) used manual or automated cell counting methods, (5) stained in tubes or in plates, (6) first stained with the multimer or simultaneously stained with multimer and antibodies for co-staining, (7) used three or four fluorochromes, (8) used or did not use dead cell staining, (9) used or did not use anti-CD3 staining, (10) used or did not use a dump channel, or (11) reported low, medium or high values for non-specific multimer binding. The only subgroup that did not include any of the ten high performing labs was the one that only used two fluorochromes (anti-CD8 vs. HLA-peptide multimer) for the staining (two laboratories).

To identify additional variables that may influence the sensitivity of the assay, we used the results from the questionnaires to define 2–3 subgroups for 11 protocol variables. We then compared the ability to successfully detect one of the nine responses for all different subgroups of labs. Supplementary table 5 summarizes the lab characteristics and reports the number of detected responses and the average proportion of detected responses for each subgroup. Responses were more likely to be detected if labs had more than 75% viability (vs. less than 75%), used pentamers (vs. tetramers), used DNAse (vs. no DNAse), used a manual counting method (vs. machine), did multimer and then co-staining (vs. simultaneously), used three or four colors (vs. two colors), did not do CD3 staining (vs. did CD3 staining) or used a dump channel (vs. no dump channel). It is important to stress that the labs within each subgroup applied protocols that differed widely from each other. This means that differences found between some of the subgroups in our panel do not formally prove that the variable that was selected to define the subgroup was the defining factor that influenced the performance of those labs. Although we were not able to deduce additional recommendations from this analysis we classify the subgroups that showed the highest differences as interesting targets for more systematic analysis in future proficiency panels.

Discussion

The specific aims of the first CVC-CRI multimer proficiency panel presented in this manuscript were to establish the foundations for developing a robust and validated MHC multimer assay which can be harmonized across immune monitoring laboratories by: (1) demonstrating the feasibility of such large international inter-laboratory testing project for HLA-peptide multimers, (2) identifying protocol variables, reagent choices and strategies that are relevant for assay protocol optimization (3) providing each participating lab direct feed-back about their qualitative and quantitative performance in relation to the other members of the group, (4) quantifying the variation of results reported by such a large number of labs (inter-center variation), and (5) determining the variation of results obtained within the same lab using the same protocol with the same samples doing the experiment at two different time points (intra-center variation).

The results from this panel demonstrated the feasibility of such a large inter-laboratory testing project for MHC-peptide multimer staining. Importantly, they reveal the wide variety of different protocols and strategies used with multimers, which appears representative of the most commonly used protocols worldwide. This unique situation allowed us to determine the average amount of non-specific multimer binding normally observed in a broad range of commonly applied protocols (Fig. 2f, Supplementary figure 2a). Based on these data we estimated an LOD and LOQ. Such performance-related thresholds determined by data sets obtained in large-scale quality assurance programs are valuable tools to more objectively rank the performance of participating labs and might be used for assay certification purposes in the future.

The results obtained in this CVC-CRI multimer panel allowed, for the first time, the quantification of the actual variation in results between different labs across the United States and Europe. The high inter- and intra-center variation were unexpected and stand in contrast to a report from a smaller European group [26]. Although the performance of cellular immunoassays tested in single centers or within selected groups that either used one common protocol [27] or went through intensified validation and harmonization steps appears quite robust [28, 29], it is important to point out that the performance of this assay in a non-selected group of labs is clearly not robust and results are subject to a large degree of variation. The results generated by this large international panel can now serve the scientific community as a benchmark for future similar projects. Every newly introduced measure to harmonize the technique should lead to a reduction of the variation found in this panel.

Our results emphasize the urgent need for harmonization of the multimer labeling technique. Despite this conclusion, it is important to state that the presented data should not be used to justify the non-critical discarding of data obtained by HLA-peptide multimer staining. Recently a systematic analysis of cellular based immunoassays clearly showed that ELIPOT, cytokine flow cytometry as well as HLA-peptide multimer staining can be used to precisely and stably quantify antigen-specific T cell responses within a broad range of frequencies within highly validated laboratories [30]. The fact that a third of the labs involved in this proficiency panel were able to detect eight or nine of all responses included in the distributed samples and that five out of ten labs were able to (qualitatively and quantitatively) reproduce their results in a second step throughout a broad range of T cell responses of different frequencies demonstrate that staining with HLA-peptide multimers is a sensitive technique that can be reliably used to quantify antigen-specific T cells in clinical trials. At the same time, these observations pose major challenges for assay harmonization as it is clear that two-thirds of the labs involved had difficulty in accurately and reproducibly measuring specific T cell frequencies.

Initial harmonization guidelines to improve assay performance

Based on the results of this multimer panel, we outline initial assay harmonization guidelines addressing four areas of immunomonitoring by HLA-peptide multimers (Table 5). These guidelines include basic recommendations (a) for the staining procedure, (b) the analysis of data, (c) the auditing of data obtained before they are released and (d) the qualification of staff members involved in the analysis. Use of these guidelines may substantially reduce assay variability while allowing individual laboratories to use their respective assay protocols.

Table 5 MHC-peptide multimer harmonization guidelines to optimize assay performance

Standard operating procedure for HLA-peptide multimer staining

Three clear recommendations for the staining procedure can be deduced from the analysis of inter-laboratory result variation. These are to (1) use of a protocol that includes more than two colors for staining, (2) evaluate at least 100,000 CD8+ T cells and (3) use a background control to set the gates.

Although a simple two-color staining is adequate to characterize the HLA-peptide multimer binding capacity of T cell lines or clones, it is clear that such a simple protocol does not allow reliable quantification of the frequency of antigen-specific T cells in peripheral blood specimens which include non CD8 T cell subsets that either specifically or non-specifically bind specific detection reagents. Non-CD8 T cell specific staining becomes a major problem when low frequencies of antigen-specific T cells need to be visualized, which is the case for ex vivo detection of most virus-specific T cell responses that are regularly found to represent less than 1% of CD8+ T cells in chronic viral infections [31] or tumor-specific T cells that are present in even lower numbers [32, 33]. One clear recommendation from this panel is to use additional fluorochromes and establish a methodology which minimizes background; this conclusion is supported by an earlier report elaborating the relevant factors for staining with HLA-peptide multimers [34]. The use of additional fluorochromes is technically feasible and easily allows gating to remove dead cells, B cells, monocytes, CD4+ cells or NK cells to focus more specifically on the cell population of interest, thereby reducing the chance of displaying cells that all can contribute to non-relevant events.

The second recommendation to increase the number of counted cells confirms data from an inter-laboratory comparison of different labs conducted by the European CIMT Monitoring panel [35] demonstrating the high impact of this parameter.

A strategy that helps to set gates correctly would prevent the errors found in this panel and may lead to greatly reduced inter-laboratory variation and optimized levels of detection and quantification as these critically depend on the observed background noise. In this regard, the use of a “negative” control staining to define the background multimer labeling remains controversial. Centers that already apply negative control stainings use one of the following three strategies. (1) Fluorescence minus one (“FMO”) stainings that include all antibodies/dead cell markers but no HLA-peptide multimer. (2) Negative control stainings with HLA-peptide multimer specific for well defined peptides from human immunodeficiency virus (HIV). (3) Commercially available preparations of irrelevant HLA-peptide multimers. An alternative strategy may be to search for new negative control peptides and to systematically test HLA-peptide multimers containing these putative candidates for their suitability as negative controls in patients with cancer, infectious and autoimmune diseases. All strategies that are commonly used to define background of HLA-peptide multimer staining have specific advantages and disadvantages and as yet they have not been systematically compared with each other. Regardless, it is clear that the use of a negative control for defining the background staining of HLA-peptide multimers should be applied as a guide for standardized gate setting. Valuable software solutions that apply mathematical algorithms to set gates independent of the operator’s subjective view are in development but are so far not available for broad use [36].

In addition to these three initial recommendations we identified primary targets for systematic analysis in future panels that have a high probability to influence test sensitivity (Supplementary table 5). The influence of these variables will be systematically addressed in future proficiency panels.

Training of staff members

Although the staining with MHC-peptide multimers seems, at first sight, to be a simple and straightforward procedure, our results have shown that even the use of one and the same batch of HLA-peptide multimers can lead to different results if applied in different staining protocols. The high values for intra-center variability detected in five from ten labs participating in a repetition of the experiment showed that even when a staining is repeated in the same lab by using the same protocol it does not automatically lead to similar results (Supplementary table 2). Each step, from the collection of cell material to reporting of results, adds sources for variation which separately contribute to the variation of results obtained either by different labs worldwide or from measurements by the same lab at different time points. It is thus necessary that staff members performing the measurements should be aware of the sources of variation within their responsibility and be trained in all measures that need to be taken to keep variation as low as possible. The regular and specific education of staff members and mechanisms to control their performance are well established and mandatory requirements for all laboratories working under rules defining a good manufacturing practice but have so far not been systematically applied for immunomonitoring of clinical samples. In a previous ELISPOT proficiency panel conducted by the CVC, it could be shown that establishing SOPs for assay validation and staff training clearly lead to an increased performance of participating labs [37]. We therefore propose to implement formalized assay specific education of all staff members involved in immune monitoring with HLA-peptide multimers before clinical material should be allowed to be processed.

In conclusion, our study reveals the variety of HLA-peptide multimer-based assay formats currently in use in the tumor immunology and immunotherapy community. Moreover, it has allowed us to measure the magnitude of both inter and intra-laboratory result variation. We have gained detailed insight into technical limitations of commonly applied protocols and provide estimates for expectable limits of detection and quantification. Finally, valuable recommendations could be derived that should guide the future assay harmonization efforts spearheaded by the CVC-CRI as well as other consortia active in this domain.