Statistical Analysis of ELISPOT Assays
Cytokine ELISPOT assays have emerged as a powerful tool for the detection of rare antigen-specific T cells in freshly isolated cell material, such as blood. While ELISPOT assays allow one to directly visualize and count extremely low frequencies of cytokine-secreting T cells among millions of nonsecreting bystander cells, the interpretation of ELISPOT data can become ambiguous when (a) spot numbers in antigen-containing wells are low, (b) spot counts in negative control wells are elevated, and particularly (c) when both of the above occur simultaneously. Thus, the primary task, even before statistics are employed, must be the optimization of the basic assay parameters and reagents such that the assay yields low background signal in the negative-control wells and the maximal number of antigen-induced spots in test wells, i.e., the signal-to-noise ratio is maximized. Furthermore, the use of proper spot-size gating parameters for data analysis is indispensable for screening out irrelevant background spots, and thus increasing the signal-to-noise ratio. The goal of most ELISPOT experiments is to identify positive T-cell responses as defined by a significantly elevated spot count in antigen-stimulated wells over the nonstimulated medium-control or negative-control antigen. In this chapter, we conclude that – with some limitations – the T-Test and related statistical methods which rely on the assumption of normal distribution are suitable for identifying positive ELISPOT results.
Key wordsElispot Statistical analysis Signal to noise Spot-size gating parameters PBMC ImmunoSpot MATLAB ANOVA Wilcoxon Rank Sum Test T-Test Poisson distribution
Because ELISPOT typically aims for the detection of rare antigen-specific cells within a variable background, the notion of “fuzzy in, fuzzy out” applies to ELISPOT data – perhaps more so than for other immunoassays. Statistical analysis cannot itself substitute for experimental stringency in performing ELISPOT assays which provide clear, unambiguous results. Preceding the statistical analysis of ELISPOT data, therefore, great attention must be given to establishing spot counts (which are in part based on the statistical analysis of the spot-size distributions) before the counts obtained within an experiment are subject to further statistical analysis.
1.1 “Precise in”: Optimizing Signal - to-Noise
1.1.1 The Choice of Membrane
In cytokine ELISPOT assays, cytokine secretion from individual cells is measured. In typical experiments, 96-well PVDF membrane plates are precoated with a cytokine-specific capture antibody, e.g., a suitable anti-IFN-γ antibody. Due to its fractal surface and high hydrophobicity, PVDF membranes have been found to outperform most other membranes which had been used previously for cytokine ELISPOT assays (1), and therefore the use of PVDF membranes is highly recommended for obtaining optimal results when performing these tests. Historically, cytokine ELISPOT assays had been very fragile and hard to reproduce before the introduction of PVDF membranes.
1.1.2 The Choice of Cell Numbers Plated
In the next step of the assay, the test cell material (e.g., human PBMC) is plated into the precoated wells, both with and without antigen. For T-cell assays, the test cells need to be dense enough to allow for optimal contact between T cells and antigen-presenting cells (APC), but the cells must not be overcrowded, as it is essential that each T cell sits directly on the membrane so that its secretory product can be effectively captured. For human PBMC, 100,000–800,000 cells can be plated per well into 96-well plates with the number of antigen-induced spots per number of PBMC plated following a linear relationship (2). (In contrast, when a monolayer of APC is provided, even single T cells can be plated per well and tested (3)).
For the purpose of economizing cell utilization, ELISPOT assays are frequently performed with 100,000 PBMC per well. However, when low-frequency T cells are to be assayed, increasing cell numbers up to 800,000 per well increases the signal in a directly linear fashion without disproportionately introducing noise into the system. When ambiguous results are obtained through the testing of low cell numbers (e.g., 100,000 per well in 96-well plates), retesting these PBMCs with higher cell numbers can provide clear results. The cells can be readily retested, since protocols have been developed that permit an investigator to freeze PBMC such that their functionality in ELISPOT assays is retained and is effectively identical to fresh cells upon thawing (4). Also, 6-well PVDF membrane plates are being introduced that allow for increasing the sample size per well tenfold respective to 96-well plates, thus increasing 10× the signal-to-noise resolution of ELISPOT assays.
1.1.3 Establishing the Correct Spot Counts in Antigen-Stimulated Wells
Upon antigen stimulation, the antigen-specific T cells engage in the secretion of cytokine which is captured around the cell by the membrane-bound capture antibody. Overall, the size and density of the resulting cytokine “spot” is reflective of the quantity of cytokine produced by the cell, and the morphology of the spots reveals the secretion kinetics: i.e., fuzzy spots reflect a rapid secretion kinetics, whereas sharp spots are indicative of slow analyte release (5) (see also Chapter 11 in this volume). Invariably, T-cell ELISPOT assays provide a wide spectrum of spot sizes and morphologies, irrespective of whether T-cell clones or primary antigen-specific T-cell populations are studied, and of the analyte (3, 6). Thus, the question arises as to how to reliably establish spot counts, i.e., how to distinguish whether larger spots result from cell clustering, and how to determine the smallest spots which still are to be counted. Tremendous variations in counts can be seen in the absence of clear, unambiguous criteria for counting. Such criteria can be objectively established, however, by understanding the rules that underlie spot-size distributions. For T-cell clones, as well as primary T cells secreting various cytokines, it has been determined that the spot-size distribution always follows log-normal distribution (3, 6). Thus, proper analysis establishes the spot-size distribution for an assay and subjects the size distribution curve to statistical analysis, automatically setting the gates at 99.7% confidence to establish upper and lower limits, respectively, for the largest and smallest spot sizes which still belong to that distribution. In this way, spots that exceed this size gate can be recognized as clusters, and spots that are smaller than the lower limit are gated out. Such statistical analysis permits objective, user-independent counting while spot counts established without such analysis are subjective, unreliable numbers contributing to “fuzzy in.”
1.1.4 Correct Spot Counts in Negative-Control Wells
ELISPOT assays aim at establishing frequencies of antigen-specific T cells, i.e., measuring antigen-induced spots over medium-control or irrelevant antigen. The negative-control wells, however, can also contain cytokine-secreting cells (which generate spots). Such background spots typically are produced by cells of the innate immune system which also secrete the cytokine in question. One of the primary challenges of ELISPOT analysis is to differentiate between such background spots and antigen-induced, T cell-derived spots. This frequently can be accomplished by size gating. T cells typically have a substantially higher cytokine secretion rate (resulting in lager spot-size distributions) than cells of the innate immune system (resulting in a smaller spot-size spectrum) (7). Thus, analogous to gating in flow cytometry, the smaller spot-size distributions seen in the negative controls can be gated out, allowing for identification of the larger, T cell-derived spots in the antigen-stimulated wells. ELISPOT gating can be done automatically (and thus, objectively) following statistical principles (for more on this topic, please refer to the Chapter 13 in this volume). Medium-control and antigen-stimulated spot counts generated without such counting principles are literarily meaningless numbers (much like ungated flow cytometry frequencies), and provide a “fuzzy in” for which subsequent statistical analysis cannot compensate.
1.1.5 Lowering the Medium Background: Thus Increasing Signal to Noise
ELISPOT assays, like most cellular assays, have traditionally been performed using culture media which contain serum. Serum, however, is a major assay variable. It contains cytokines that stimulate or suppress the test PBMC, many times resulting in an increased background spot production (triggering cytokine secretion by cells of the innate immune system) that is not necessarily accompanied by an increased antigen-specific response by T cells. Because the cytokines present in serum bind to high-affinity receptors on PBMC, even brief exposure of PBMC to a stimulatory or suppressive serum during freezing/thawing or washing the PBMC can effectively ruin an ELISPOT assay. The use of serum-free media for freezing, washing, and testing is, therefore, highly recommended providing typically higher signal-to-noise ratios than even the “best” sera selected for this purpose (2).
1.2 Testing Antigen- and Medium-Control Wells in Replicates
Because in PBMC, antigen-specific T cells typically occur in low frequencies, ELISPOT assays frequently need to detect small increases in antigen-induced spot counts versus the negative control. Therefore, to increase the statistical relevancy of the resulting spot counts, it has become convention to measure cytokine production in medium-control and antigen-containing wells in replicate wells, typically triplicates. After proper counting of replicate wells for each condition (assuring a “high resolution in” – see above), further analysis can be done to establish whether there are indeed more spots seen in the antigen-containing wells. Many times, the spot counts in replicate wells are consistent, and the differences between medium-control and antigen-stimulated wells are large – in such cases, the interpretation of the results is clear-cut. However, the interpretation of ELISPOT data can become ambiguous when (a) the spot numbers in antigen-induced wells are low, (b) spot counts in negative-control wells are elevated, and (c) when both of the former occur simultaneously. Here is where the grey area of ELISPOT data interpretation lies, with different solutions, mostly empirical, propagated to resolve the problem (8). Further, the choice of triplicates is entirely empirical, with its statistical foundation not having been sufficiently clearly delineated – more on this below.
1.3 Counting Apoptotic Cells in Addition to Live and Dead
Many times, immunizations cause only a moderate increase in the frequency of antigen-specific T cells. Such differences can be detected when comparing antigen-induced responses in preimmunization PBMC samples with PBMC obtained at various time points following immunization. Such longitudinal testing is mostly done with cells that have been stored/shipped for many hours, sometime days, before being frozen and/or tested. Protracted handling can damage the cells, resulting in increased numbers of dead as well as apoptotic cells. Standard Trypan Blue cell counting can discern between live and dead cells, but does not detect cells that have entered upon the irreversible path of apoptosis. Apoptotic cells are still alive, and are counted as such with Trypan Blue and other live/dead counting methods. Apoptotic T cells should be reckoned as effectively dead for functional assays, however, because they are refractive to antigen stimulation and will be dead before they could produce cytokine. ELISPOT assay harmonization panels have, therefore, repeatedly emphasized the need to count apoptotic cells, in addition to live and dead cells, and to correct the live count by subtracting the apoptotic count. Various dyes are available to stain apoptotic, live, and dead cells with different colors for visual counting using standard hemocytometers and UV microscopy or, more conveniently and precisely, by image analysis or flow cytometry.
1.4 Empirical Versus Statistical Evaluation of Medium-Control Versus Antigen-Induced Spot Numbers
A key issue in ELISPOT data evaluation is setting the threshold beyond which a response should be considered positive. Several different approaches are described in the literature which may be classified mainly into two different categories: empirical and statistical (8). Empirical approaches generally do not have sound theoretical justifications, but can rather be regarded as rules of thumb. Such commonly used approaches require an arbitrarily designated minimum difference between the mean spot counts for the antigen-containing wells and the negative-control wells, mostly medium. Other empirical rules demand the ratio of antigen-elicited spots to spot numbers in the medium control to exceed a certain value – here, often a twofold increase is considered to be the cutoff for a positive response. Various combinations of these rules have also been proposed, such as combining the minimum threshold R0 for spot count per 106 cells with a minimum threshold C0 ratio of antigen to control; responses exceeding these values are designated as positive, yet the two thresholds may be chosen in such a way as to obtain a false-positive rate <0.01 (9). The advantage of empirical rules is their simplicity. Their major drawback is their lack of reliability when it comes to detecting weak responses. Since they have no precise theoretical foundation, empirical rules have no claim to universality, since they do not rely on explicit model assumptions which could be tested for and thus ensure the general applicability of the proposed procedure.
In contrast to empirical rules, statistical tests rely on a theoretical background. Here, the observed data (or a statistical result of the data, such as the mean) is tested for its compatibility with a (usually, purely random) null model (null hypothesis) to obtain a p-value. The p-value is consequently a measurement of the probability of the observed data under the chosen null model, where low p-values indicate evidence which is highly improbable to be observed under the null model and suggests a rejection of the null hypothesis. Thus, the p-value itself delivers a quantitative measurement of probability, and not a binary yes/no decision for the identification of a positive response. A call for a positive response is then usually claimed when the p-value is below a certain threshold called the significance level α, where values smaller than 0.05 or 0.01 are typically considered significant. Hence, given the null hypothesis is true, the probability of Type I errors (false-positive calls) would be 5% (for p = 0.05) or 1% (for p = 0.01).
Note that with this definition the p-value always needs to be interpreted in relation to the particular null model under consideration; in fact, it could also be regarded as a measurement of distance from a certain null model. Moreover, the null model often relies on specific assumptions, and the violation of this assumption leads to invalid results. Exactly this point is presently the bottleneck for scientifically validated statistical analyses of ELISPOT data, since no data or predictions are currently available regarding the distribution and inter-/independence of ELISPOT data points (spots).
Since as of yet, distributional assumptions are not available for ELISPOT data evaluation, some researchers advocate the use of resampling-based statistical approaches that do not involve such assumptions (10). It allows to perform single or multiple testing corrections against a randomly permutated null distribution in one step. Simulation studies with data drawn from Poisson and Negative Binomial Distributions (which account for an overdispersion effect) have been conducted for various assays. The authors advocate the usage of permutation tests which require the generation of a background distribution using a large number (e.g., 10,000) of random permutations and comparison of the value of the observed test statistics from the real (unpermutated) data against this background distribution. An empirical p-value can then be computed as the fraction of test statistics with a value greater than or equal to the value from the permuted background dataset. To generate a sufficiently large background distribution (in the case of low replicate numbers), the permutations can be performed over all different antigens tested (10). A general problem with the global permutation approach is that the p-value of one antigen is dependent on the responses to other antigens, so the methods may fail to recognize a moderate response in the presence of a large response against one of the other antigens (11). Thus, the permutation of the controls and of each antigen separately has been proposed (8, 11). Generally, resampling-based techniques are rather computationally intensive (which, however, is not a great problem with modern computers) and often not as easily available to many experimental researchers as standard testing methods. Furthermore, when only a low number of replicates is available, as is typically the case in ELISPOT experiments, the p-value distribution can become “granular” and, due to the limited number of different permutations possible, it may become impossible to obtain p-values below a certain threshold. Empirical evaluation of ELIPSOT data or resampling-based statistics is a valid approach to ELISPOT analysis only when the distributional properties of spots are not known. Therefore, we set out to establish those experimentally, allowing for the introduction of assumption-based statistics.
Regular T-Test statistics can be computed using standard spreadsheet software, such as Microsoft Excel which is included in the Microsoft Office Suite.
For advanced statistical analysis of ELISPOT data (like fitting linear models, testing distributional assumptions, or permutation tests), real statistical software packages are required. A prominent example is the freely available statistical software package R (12) which can be downloaded from http://cran.r-project.org/. This is a very powerful statistical computer language which is broadly extensible by a wide range of additional packages. To exploit the full power of R, some skills in computer programming are recommended. Similarly, the commercial MATLAB software (http://www.mathworks.com/) includes a statistics toolbox which may also be used for analyzing ELISPOT data.
Alternatively, other commercial statistical software packages featuring a graphical user interface are available, like GraphPad Prism® (http://www.graphpad.com/prism/prism.htm), SPSS (http://www.spss.com), SAS (http://www.sas.com), or Statistica (http://www.statsoft.com/).
3.1 Normal Distribution of ELISPOT Data for >30 Spots per Well
3.2 Poisson Distribution of ELISPOT Data with 15 Spots or Fewer per Well
The use of the Negative Binomial Distribution has been proposed to model count data for ELISPOT counts (10). This distribution can be regarded as an extension of the Poisson distribution. It has two parameters which allow modeling of both the variance and the mean, and it is thus able to handle the overdispersion problem. A common used remedy in cases of assumed deviations from the normal distributions is the usage of nonparametric tests, such as the well-known Wilcoxon Rank Sum Test. This might entail a loss of power in case of normally distributed data, but can provide an improved power over the T-Test in cases of considerable deviation from the normal distribution. A further concern is that data originating from two different distributions need to be compared when the medium background falls into the <20 spots category while the antigen-induced spots are >30 (see Note 1).
3.3 The Pragmatic Solution: The Applicability of the T-Test for ELISPOT Data Analysis
For pragmatic purposes, it is important to know how the Poisson distribution of low-count data affects the performance of the T-Test (which requires normally distributed data) in the low-count range. Our data from simulations indicate that even for Poisson-distributed data in the low-count range, no increased number of false positives for a significance level of 0.05 is to be expected. However, the T-Test may fail to identify weak responses. The extent to which the power of the T-Test is affected by a Poisson distribution in the low-signal range needs further investigation. Increasing the number of replicates in any case increases the probability of detecting weak responses. The triplicates typically used are certainly a minimum. The determination of a reasonable number of replicates, however, depends on several variables and often requires an advanced statistical power analysis for a particular experimental setup and expected effect size (see Note 2).
In conclusion, the usage of empirical rules, such as using fold changes with a fixed threshold, should in general be discouraged. Albeit enjoying some popularity, such empirical rules do not incorporate any information about variance, and thus provide no measurement of confidence. Instead, the application of solid statistical tests is highly recommended. The question of which statistical tests are best suited for ELISPOT evaluation is an important one, and is certainly not completely resolved (see Note 3). Our results from transfected cell lines indeed suggest a Poisson distribution, but only for data in the low-count range. For spot counts >15, no significant deviation from the normal distribution could be observed, indicating that spot counts in that range can be assumed to be normally distributed. This implies that most standard statistical tests, which are valid for normally distributed data, like the T-Test and ANOVA models, are applicable in this setting. In any case, more investigations (experimental data as well as computer simulations) are needed not only to assess potential effects of deviations from normality in the low spot-count range, but also to develop novel methods for the analysis of ELISPOT data.
Although being essentially a nonparametric test, the Wilcoxon Rank Sum Test requires the data to be independent and identically distributed (iid). While the first assumption of independence is not affected, the second assumption of identical distributions is obviously violated. Whether this has practical implications for the statistical analysis of ELISPOT data has not yet been investigated, and still requires more in-depth research.
On a practical note, it is recommended to perform all tests using at least triplicates, and if the results fall along the borderline, to retest cryopreserved aliquots at higher cell numbers, and with more replicates.
An important aspect here is the understanding of the distribution of ELISPOT data. Since the assay provides counts of individual antigen-specific T cells that are randomly distributed during pipetting into the wells, on theoretical grounds, one might expect ELISPOT data to follow a Poisson distribution.
- 2.Zhang, W., Caspell, R., Karulin, A. Y., Ahmad, M., Haicheur, N., Abdelsalam, A., et al. (2009) ELISPOT assays provide reproducible results among different laboratories for T-cell immune monitoring – even in hands of ELISPOT-inexperienced investigators. J Immunotoxicol 6, 227–234.PubMedCrossRefGoogle Scholar
- 7.Guerkov, R. E., Targoni, O. S., Kreher, C. R., Boehm, B. O., Herrera, M. T., Tary-Lehmann, M., et al. (2003) Detection of low-frequency antigen-specific IL-10-producing CD4(+) T cells via ELISPOT in PBMC: cognate vs. nonspecific production of the cytokine. J Immunol Methods 279, 111–121.PubMedCrossRefGoogle Scholar
- 9.Dubey, S., Clair, J., Fu, T. M., Guan, L., Long, R., Mogg, R., et al. (2007) Detection of HIV vaccine-induced cell-mediated immunity in HIV-seronegative clinical trial participants using an optimized and validated enzyme-linked immunospot assay. J Acquir Immune Defic Syndr 45, 20–27.PubMedCrossRefGoogle Scholar
- 12.R Development Core Team. (2009) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.Google Scholar